Android Question [RESOLVED] Different results from: string.GetBytes, File.ReadBytes and RandomAccessFile.ReadBytes

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Experimenting with reading large files and writing data to SQLite.
With this I noticed there are differences with the above 3 methods to get a byte array.

For example testing with this string (either in a file or as a string):

Phone¬Email¬Website¬Fax
13043¬FA512¬Pharmacy¬Community Pharmacy¬Visible¬

Reading this as string I get bytes -62 and -84 for the ¬ character and and only byte 10 for the linebreak
Reading this string from a text file (.csv) I get only byte -84 for the ¬ character and bytes 13 and 10 for the linebreak.

These are the code fragments to get the byte array:

B4X:
'reading string:
    arrBytes = strString.GetBytes("UTF8")

'reading file, using File:
arrBytes = File.ReadBytes(tFF.strFolder, tFF.strFile)

'reading the file using RAF:
    RAF.Initialize(tFF.strFolder, tFF.strFile, True)
    lBytes = RAF.Size
    Dim arrBytes(lBytes) As Byte
    RAF.ReadBytes(arrBytes, 0, lBytes, 0)

I can work around this, but would like to understand why this happens and how I can predict what will happen.

RBS
 

emexes

Expert
Licensed User
Probably will need user input, that is for them to specify the column separator string at least

If the column separator is a single character, then you could probably identify it on the basis that:

1/ it is unlikely to be a letter or digit
2/ there will be at least n of them on almost every line

except for if the separator is space and can be one or more spaces. ?

Commas, tabs, pipes and spaces are the most common separators I've seen, but I've also seen semicolons, exclamation marks and forward and backward slashes. And now commas-with-high-bit-set too, thanks to your NHS data.

I'm sure that there are systems somewhere in the world that use other characters too. In fact, I'm starting to vaguely remember some data that used the ASCII field separator control code Chr(28) aka FS

Argh, just remembered your continental comrades using commas instead of decimal points. Also, if the data contains account or id numbers with dashes in them, then those dashes could easily be misidentified as field separators. Hmm. Perhaps these bridges are best left uncrossed until we you get to them. ?
 
Upvote 0

emexes

Expert
Licensed User
I specified ¬ as the value separator and Chr(34) as the value encloser and not sure yet how to determine these 2.

It occured to me that there should always be a even number of value enclosers on each line = that might be a way to determine them automatically.
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
> 2/ there will be at least n of them on almost every line

I coded something based on a similar idea:

B4X:
Sub GetCharCounts(strLine As String) As Map
    
    Dim i As Int
    Dim mapCountChars As Map
    Dim iCount As Int
    Dim oChar As Char
    
    mapCountChars.Initialize
    
    For i = 0 To strLine.Length - 1
        oChar = strLine.CharAt(i)
        If mapCountChars.ContainsKey(oChar) Then
            iCount = mapCountChars.Get(oChar)
            mapCountChars.Put(oChar, iCount + 1)
        Else
            mapCountChars.Put(oChar, 1)
        End If
    Next
    
    Return mapCountChars
    
End Sub

Sub DetectCharacterEncoding(TextBytes() As Byte) As ResumableSub

    Dim i As Long
    Dim PreviousByte As Int    'actually just the top 2 bits
    Dim ThisByte As Int = 0    'actually just the top 2 bits
    Dim InvalidUTF8Flag As Boolean = False

    For i = 0 To TextBytes.Length - 1
        PreviousByte = ThisByte
        ThisByte = Bit.And(TextBytes(I), 0xC0)    'top 2 bits

        If PreviousByte = 0xC0 Then      'if multibyte start
            If ThisByte <> 0x80 Then     'not followed by multibyte continuation
                InvalidUTF8Flag = True
                Exit
            End If
        else if ThisByte = 0x80 Then       'if multibyte continuation
            If PreviousByte < 0x80 Then    'not preceded by multibyte start or continuation
                InvalidUTF8Flag = True
                Exit
            End If
        End If
    Next
 
    If InvalidUTF8Flag Then
        Return "windows-1252"
    Else
        Return "UTF-8"    'or plain ASCII
    End If
 
End Sub

Sub GetSeparatorCharFromFile(strFolder As String, strFile As String, strCharSet As String, iLines As Int) As ResumableSub
    
    Dim i As Int
    Dim n As Int
    Dim oTR As TextReader
    Dim strLine As String
    Dim iCount As Int
    Dim lstMaps As List
    Dim oMap As Map
    Dim oMapCharAndCount As Map
    Dim strKey As String
    
    oTR.Initialize2(File.OpenInput(strFolder, strFile), strCharSet)
    lstMaps.Initialize
    
    Do While True
        strLine = oTR.ReadLine
        If strLine = Null Then Exit
        lstMaps.Add(GetCharCounts(strLine))
        n = n + 1
        If n = iLines Then Exit
    Loop
    
    oTR.Close
    
    oMapCharAndCount.Initialize
    
    For i = 0 To lstMaps.Size - 1
        oMap = lstMaps.Get(i)
        For Each oKey As Object In oMap.Keys
            strKey = oKey & oMap.Get(oKey)
            If oMapCharAndCount.ContainsKey(strKey) Then
                iCount = oMapCharAndCount.Get(strKey)
                oMapCharAndCount.Put(strKey, iCount + 1)
            Else
                oMapCharAndCount.Put(strKey, 1)
            End If
        Next
    Next
    
    For Each oKey As Object In oMapCharAndCount.Keys
        If iLines = oMapCharAndCount.Get(oKey) Then
            Return CStr(oKey).CharAt(0)
        End If
    Next
    
End Sub

Sub CStr(o As String) As String
    Return o
End Sub

Sub TestTextReader
    
    Dim c As Int 'ignore
    Dim n As Int
    Dim oTR As TextReader
    Dim lBytes As Long
    Dim strLine As String
    Dim iCols As Int
    Dim lstCSV As List
    Dim arrDataTypes() As String
    
    Dim rs As ResumableSub = PickFile("Parse a .csv or .txt file", _
                                      "Pick a .csv or .txt file or a folder", _
                                       File.DirRootExternal & "/PhonePats", _
                                       Array As String("csv", "txt"), _
                                       True)
    Wait For (rs) Complete (tFF As tFolderAndFile)
    
    StartSW(0)
    
    RAF.Initialize(tFF.strFolder, tFF.strFile, True)
    lBytes = 1000 'this is enough at least in case of the SCL.csv file
    Dim arrBytes(lBytes) As Byte
    RAF.ReadBytes(arrBytes, 0, lBytes, 0)
    
    Dim rs1 As ResumableSub  = DetectCharacterEncoding(arrBytes)
    Wait For (rs1) Complete (strCharset As String)
    
    Log(strCharset) '>>   windows-1252
    
    Dim rs1 As ResumableSub  = GetSeparatorCharFromFile(tFF.strFolder, tFF.strFile, strCharset, 4)
    Wait For (rs1) Complete (oChar As Char)
    
    Log("separator char: " & oChar)
    
    Return

Will test later.

RBS
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Forgot to exclude certain chars from being the separator char, corrected that now and seems to work fine so far.

B4X:
Sub GetSeparatorCharFromFile(strFolder As String, strFile As String, strCharSet As String, iLines As Int) As ResumableSub
    
    Dim i As Int
    Dim n As Int
    Dim oTR As TextReader
    Dim strLine As String
    Dim iCount As Int
    Dim lstMaps As List
    Dim oMap As Map
    Dim oMapCharAndCount As Map
    Dim strKey As String
    Dim strPossibleSeparator As Char
    Dim strExclude As String = " 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    
    oTR.Initialize2(File.OpenInput(strFolder, strFile), strCharSet)
    lstMaps.Initialize
    
    Do While True
        strLine = oTR.ReadLine
        If strLine = Null Then Exit
        lstMaps.Add(GetCharCounts(strLine))
        n = n + 1
        If n = iLines Then Exit
    Loop
    
    oTR.Close
    
    oMapCharAndCount.Initialize
    
    For i = 0 To lstMaps.Size - 1
        oMap = lstMaps.Get(i)
        For Each oKey As Object In oMap.Keys
            strKey = oKey & oMap.Get(oKey)
            If oMapCharAndCount.ContainsKey(strKey) Then
                iCount = oMapCharAndCount.Get(strKey)
                oMapCharAndCount.Put(strKey, iCount + 1)
            Else
                oMapCharAndCount.Put(strKey, 1)
            End If
        Next
    Next
    
    For Each oKey As Object In oMapCharAndCount.Keys
        Log(oKey)
        If iLines = oMapCharAndCount.Get(oKey) Then
            strPossibleSeparator = CStr(oKey).CharAt(0)
            If strExclude.Contains(strPossibleSeparator) = False Then
                Return strPossibleSeparator
            End If
        End If
    Next
    
End Sub

RBS
 
Upvote 0

emexes

Expert
Licensed User
Forgot to exclude certain chars from being the separator char, corrected that now and seems to work fine so far.

I haven't run your code here, so I could be wrong about this, but I feel like string-enclosing characters should be excluded too, otherwise if you have a data file with many columns of enclosed strings, those characters are going to have high and consistent counts per line, and thus look like field separators.

B4X:
Dim strExclude As String = " ' "" 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"    'includes quote and double-quote
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Good point, will change that.

RBS
 
Upvote 0

emexes

Expert
Licensed User
feel like string-enclosing characters should be excluded too

Created test file of first thousand pharmacies, with 36 double-quotes added to each line, to test separator identifying.

(during ad breaks watching Crocodile Dundee on tv ? )
 

Attachments

  • PharmacyQ1000.txt
    285.1 KB · Views: 138
Last edited:
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
Created test file of first thousand pharmacies, with 36 double-quotes added to each line, to test separator identifying.

(during ad breaks watching Crocodile Dundee on tv ? )
Have tested your file now and all working perfectly fine.
Finding the right column separator char (¬) and the data in the SQLite table was all fine as well.
As the SQLite table column names had leading and trailing double quotes as well I added some code to take these off, although there was no harm from it.

As all seems sorted now I will summarize this very long thread now in a bit and call it solved.

RBS
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
I spoke too soon as there was a major bug in the code to determine any separator Char.
Hadn't tested fully, that is not tested on enclosed separators and also not tested on files with no separators, so where there is only one column.
Also in the code to determine the separator char there could be less lines in the file than the specified number of lines to test and that needed fixing.
Have fixed this now, but won't close this thread for now.

Adjusted code:

B4X:
Sub CountFieldsInTextLine(strCSVLine As String, strSeparator As String, strEncloser As String) As Int
    
    Dim c As Int
    Dim iEndIndex As Int
    Dim bInsideQuotes As Boolean = False
    
    If strSeparator.Length = 0 Then
        Return 1
    End If
        
    For iEndIndex = 0 To strCSVLine.Length - 1
        If strCSVLine.CharAt(iEndIndex) = strEncloser Then
            bInsideQuotes = bInsideQuotes = False
        Else
            If strCSVLine.CharAt(iEndIndex) = strSeparator Then
                If bInsideQuotes = False Then
                    c = c + 1
                End If
            End If
        End If
    Next
    
    Return c + 1
    
End Sub

Sub GetCharCounts(strLine As String, strEncloser As String) As Map
    
    Dim i As Int
    Dim bInsideQuotes As Boolean
    Dim mapCountChars As Map
    Dim iCount As Int
    Dim oChar As Char
    
    mapCountChars.Initialize
    
    For i = 0 To strLine.Length - 1
        oChar = strLine.CharAt(i)
        If oChar = strEncloser Then
            bInsideQuotes = bInsideQuotes = False
        Else
            If bInsideQuotes = False Then
                If mapCountChars.ContainsKey(oChar) Then
                    iCount = mapCountChars.Get(oChar)
                    mapCountChars.Put(oChar, iCount + 1)
                Else
                    mapCountChars.Put(oChar, 1)
                End If
            End If
        End If
    Next
    
    Return mapCountChars
    
End Sub

Sub GetSeparatorCharFromFile(strFolder As String, strFile As String, _
                             strCharSet As String, strEncloser As String, iLines As Int) As ResumableSub
    
    Dim i As Int
    Dim n As Int
    Dim oTR As TextReader
    Dim strLine As String
    Dim iCount As Int
    Dim lstMaps As List
    Dim oMap As Map
    Dim oMapCharAndCount As Map
    Dim strKey As String
    Dim strPossibleSeparator As Char
    Dim strExclude As String = " ' "" 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"    'includes quote and double-quote
    
    oTR.Initialize2(File.OpenInput(strFolder, strFile), strCharSet)
    lstMaps.Initialize
    
    Do While True
        strLine = oTR.ReadLine
        If strLine = Null Then
            iLines = n
            Exit
        End If
        lstMaps.Add(GetCharCounts(strLine, strEncloser))
        n = n + 1
        If n = iLines Then
            Exit
        End If
    Loop
    
    oTR.Close
    
    oMapCharAndCount.Initialize
    
    For i = 0 To lstMaps.Size - 1
        oMap = lstMaps.Get(i)
        For Each oKey As Object In oMap.Keys
            strKey = oKey & oMap.Get(oKey)
            If oMapCharAndCount.ContainsKey(strKey) Then
                iCount = oMapCharAndCount.Get(strKey)
                oMapCharAndCount.Put(strKey, iCount + 1)
            Else
                oMapCharAndCount.Put(strKey, 1)
            End If
        Next
    Next
    
    For Each oKey As Object In oMapCharAndCount.Keys
        'Log(oKey)
        If iLines = oMapCharAndCount.Get(oKey) Then
            strPossibleSeparator = CStr(oKey).CharAt(0)
            'Log("strPossibleSeparator: |" & strPossibleSeparator & "|")
            If strExclude.Contains(strPossibleSeparator) = False Then
                Return strPossibleSeparator
            End If
        End If
    Next
    
    'need to return something and can't return Null
    Return ""
    
End Sub

Sub ParseCSVLine(strCSVLine As String, iCols As Int, strSeparator As String, strEncloser As String) As String()
    
    'Log("ParseCSVLine, iCols: "& iCols)
    
    Dim c As Int
    Dim arrValues(iCols) As String
    Dim iStartIndex As Int
    Dim iEndIndex As Int
    Dim bInsideQuotes As Boolean = False
    
    If strSeparator.Length = 0 Then
        arrValues(0) = strCSVLine
        Return arrValues
    End If
        
    For iEndIndex = 0 To strCSVLine.Length - 1
        If strCSVLine.CharAt(iEndIndex) = strEncloser Then
            bInsideQuotes = bInsideQuotes = False
        Else
            If strCSVLine.CharAt(iEndIndex) = strSeparator Then
                If bInsideQuotes = False Then
                    arrValues(c) = strCSVLine.SubString2(iStartIndex, iEndIndex)
                    iStartIndex = iEndIndex + 1
                    c = c + 1
                End If
            End If
        End If
    Next
    
    arrValues(c) = strCSVLine.SubString2(iStartIndex, iEndIndex )
    
    Return arrValues
    
End Sub

Also I am not sure if it is OK to use TextReader.
Have seen posts where it is not recommended to use this, unless you are dealing with files that are not UTF-8 encoded.
I wonder if this should be stated as:
unless you are dealing with files that are not UTF-8 encoded or files that are very large.

Also I am interested to know what exactly is wrong with using TextReader.

RBS
 
Upvote 0

RB Smissaert

Well-Known Member
Licensed User
Longtime User
As this is now a very long thread I will summarize and re-save it as resolved.

As far as I see it these are the relevant thing I learned from all this:

1. If you are dealing with text and byte arrays obtained from that text you need be aware of the pitfalls of different
text encodings as explained nicely in the second post by drgottjr.

2. From Exemes I got a nice bit of code (sofar working nicely) to get the text encoding by looking at the plain byte array:

B4X:
Sub GetCharacterEncodingFromFile(tFF As tFolderAndFile, lLookAtXBytes As Long) As String
    
    Dim i As Long
    Dim RAF As RandomAccessFile
    Dim lBytes As Long
    Dim PreviousByte As Int    'actually just the top 2 bits
    Dim ThisByte As Int = 0    'actually just the top 2 bits
    Dim InvalidUTF8Flag As Boolean = False
    
    RAF.Initialize(tFF.strFolder, tFF.strFile, True)
    lBytes = lLookAtXBytes
    Dim TextBytes(lBytes) As Byte
    RAF.ReadBytes(TextBytes, 0, lBytes, 0)
    
    For i = 0 To TextBytes.Length - 1
        PreviousByte = ThisByte
        ThisByte = Bit.And(TextBytes(i), 0xC0)    'top 2 bits

        If PreviousByte = 0xC0 Then      'if multibyte start
            If ThisByte <> 0x80 Then     'not followed by multibyte continuation
                InvalidUTF8Flag = True
                Exit
            End If
        else if ThisByte = 0x80 Then       'if multibyte continuation
            If PreviousByte < 0x80 Then    'not preceded by multibyte start or continuation
                InvalidUTF8Flag = True
                Exit
            End If
        End If
    Next
    
    RAF.Close
 
    If InvalidUTF8Flag Then
        Return "windows-1252"
    Else
        Return "UTF-8"    'or plain ASCII
    End If
    
End Sub

3. If you end up with long and complex code to a job that is normally done in a much simpler way then probably you are doing
something wrong. I refer to the complex custom code I posted as an attached file. I think I don't need this anymore now.

4. To determine the separator character (usually will be comma) of a text file I posted some code that works well for me and
that might be useful to other people:

B4X:
Sub GetCharCounts(strLine As String, strEncloser As String) As Map
    
    Dim i As Int
    Dim bInsideQuotes As Boolean
    Dim mapCountChars As Map
    Dim iCount As Int
    Dim oChar As Char
    
    mapCountChars.Initialize
    
    For i = 0 To strLine.Length - 1
        oChar = strLine.CharAt(i)
        If oChar = strEncloser Then
            bInsideQuotes = bInsideQuotes = False
        Else
            If bInsideQuotes = False Then
                If mapCountChars.ContainsKey(oChar) Then
                    iCount = mapCountChars.Get(oChar)
                    mapCountChars.Put(oChar, iCount + 1)
                Else
                    mapCountChars.Put(oChar, 1)
                End If
            End If
        End If
    Next
    
    Return mapCountChars
    
End Sub

Sub GetSeparatorCharFromFile(strFolder As String, strFile As String, _
                             strCharSet As String, strEncloser As String, iLines As Int) As ResumableSub
    
    Dim i As Int
    Dim n As Int
    Dim oTR As TextReader
    Dim strLine As String
    Dim iCount As Int
    Dim lstMaps As List
    Dim oMap As Map
    Dim oMapCharAndCount As Map
    Dim strKey As String
    Dim strPossibleSeparator As Char
    Dim strExclude As String = " ' "" 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"    'includes quote and double-quote
    
    oTR.Initialize2(File.OpenInput(strFolder, strFile), strCharSet)
    lstMaps.Initialize
    
    Do While True
        strLine = oTR.ReadLine
        If strLine = Null Then
            iLines = n
            Exit
        End If
        lstMaps.Add(GetCharCounts(strLine, strEncloser))
        n = n + 1
        If n = iLines Then
            Exit
        End If
    Loop
    
    oTR.Close
    
    oMapCharAndCount.Initialize
    
    For i = 0 To lstMaps.Size - 1
        oMap = lstMaps.Get(i)
        For Each oKey As Object In oMap.Keys
            strKey = oKey & oMap.Get(oKey)
            If oMapCharAndCount.ContainsKey(strKey) Then
                iCount = oMapCharAndCount.Get(strKey)
                oMapCharAndCount.Put(strKey, iCount + 1)
            Else
                oMapCharAndCount.Put(strKey, 1)
            End If
        Next
    Next
    
    For Each oKey As Object In oMapCharAndCount.Keys
        'Log(oKey)
        If iLines = oMapCharAndCount.Get(oKey) Then
            strPossibleSeparator = CStr(oKey).CharAt(0)
            'Log("strPossibleSeparator: |" & strPossibleSeparator & "|")
            If strExclude.Contains(strPossibleSeparator) = False Then
                Return strPossibleSeparator
            End If
        End If
    Next
    
    'need to return something and can't return Null
    Return ""
    
End Sub

5. I wrote some code as well to get any encloser character of a text file. This I could do once I could determine the
separator character. Working all well, but will need some further testing before posting.

6. As somehow TextReader had a bit of a "bad name" (mentioned in Erel's code smells) I wrote a simple class that does
the same as TextReader.ReadLine. Again all working well, but needs a bit of further testing before posting.

Will save this thread now as resolved.

RBS
 
Upvote 0
Cookies are required to use this site. You must accept them to continue using the site. Learn more…