Android Question [RESOLVED] Different results from: string.GetBytes, File.ReadBytes and RandomAccessFile.ReadBytes

RB Smissaert · Dec 7, 2023

Experimenting with reading large files and writing data to SQLite.
With this I noticed there are differences with the above 3 methods to get a byte array.

For example testing with this string (either in a file or as a string):

Phone¬Email¬Website¬Fax
13043¬FA512¬Pharmacy¬Community Pharmacy¬Visible¬

Reading this as string I get bytes -62 and -84 for the ¬ character and and only byte 10 for the linebreak
Reading this string from a text file (.csv) I get only byte -84 for the ¬ character and bytes 13 and 10 for the linebreak.

These are the code fragments to get the byte array:

B4X:

'reading string:
    arrBytes = strString.GetBytes("UTF8")

'reading file, using File:
arrBytes = File.ReadBytes(tFF.strFolder, tFF.strFile)

'reading the file using RAF:
    RAF.Initialize(tFF.strFolder, tFF.strFile, True)
    lBytes = RAF.Size
    Dim arrBytes(lBytes) As Byte
    RAF.ReadBytes(arrBytes, 0, lBytes, 0)

I can work around this, but would like to understand why this happens and how I can predict what will happen.

RBS

MicroDrie · Dec 9, 2023

Here is a B4J and B4A program that takes the following steps:

It loads the csv file directly from the specified URL (from the hospital overview) into "temp.csv"
It replaces the separator with a tab value and writes the record to "temp2.csv"
It reads the file "temp2.csv" in the database test.db in the table "Hospitals"

The program is not perfect, but it creates or overwrites a table in an specified SQLite database. In B4J the program works without any problems, in B4A the reading starts for a reason that is unclear to me. Due to the current Android restrictions on directories I could not read the result in a file exploxer. In addition, the size of the files places quite a large demand on memory. It may be worth considering a more client server-like solution in which the client only downloads a limited subset of the information to reduce the resource requirement.

You can further expand this concept to your own wishes by making the filename, database table name and strQuery structure references multiple.

Because all fields in the database table TEXT have values, you may find it easier to define the strQuery with a loop. The current method allows for this in contrast to a different format.

emexes · Dec 9, 2023

RB Smissaert said:
That file (SCL.csv) looks indeed a bit more complex:

-110 10
-106 20
-96 1
-84 593649
-55 2
-37 1
-19 2

That makes it look like the files are encoded with Windows-1252 eg the accented E in FÉ Care

Simplest solution is probably ByteConverter.StringFromBytes(bytes(), "windows-1252")

https://www.b4x.com/android/help/byteconverter.html#byteconverter_stringfrombytes

emexes · Dec 9, 2023

MicroDrie said:
It loads the csv file directly from the specified URL (from the hospital overview) into "temp.csv"

emexes said:
Simplest solution is probably ByteConverter.StringFromBytes(bytes(), "windows-1252")

or, in the @MicroDrie masterpiece, HttpJob.GetString2("windows-1252") and the field separator changed to "¬"

https://www.b4x.com/b4j/help/jhttputils2.html#httpjob_getstring2

RB Smissaert · Dec 10, 2023

MicroDrie said:
Here is a B4J and B4A program that takes the following steps:

It loads the csv file directly from the specified URL (from the hospital overview) into "temp.csv"

It replaces the separator with a tab value and writes the record to "temp2.csv"

It reads the file "temp2.csv" in the database test.db in the table "Hospitals"

The program is not perfect, but it creates or overwrites a table in an specified SQLite database. In B4J the program works without any problems, in B4A the reading starts for a reason that is unclear to me. Due to the current Android restrictions on directories I could not read the result in a file exploxer. In addition, the size of the files places quite a large demand on memory. It may be worth considering a more client server-like solution in which the client only downloads a limited subset of the information to reduce the resource requirement.

You can further expand this concept to your own wishes by making the filename, database table name and strQuery structure references multiple.

Because all fields in the database table TEXT have values, you may find it easier to define the strQuery with a loop. The current method allows for this in contrast to a different format.

Thanks, and will have a look at your code.

There are really 2 things to deal with:
1. Very large files. This I think I have sorted.
2. Working out somehow what the charset of the file is.
This I haven't figured out yet.

I convert from bytes to string eg like this:

B4X:

arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (i - 1) - lSeparatorPos, "UTF-8")

As I understand it there are files where "UTF-8" is not the right charset to use.

RBS

emexes · Dec 11, 2023

RB Smissaert said:
2. Working out somehow what the charset of the file is.
This I haven't figured out yet.

If this is a once-off mission done when you first use data from a new source, then use something like https://abctool.gitlab.io/encoding-detector/

If it is an ability that you want to build into your program then that is a tad harder but this will get you there 99.9% of the time:

Go with UTF-8 unless the text bytes contain any invalid UTF-8 multibyte encodings, in which case go with windows-1252

Invalid UTF-8 multibyte encodings are:
a/ bytes beginning with high bits 10 are that are not preceded by a byte beginning with high bit 1
b/ bytes beginning with high bits 11 that are not followed by a byte beginning with high bits 10

edited: to simplify

RB Smissaert · Dec 11, 2023

emexes said:
If this is a once-off mission done when you first use data from a new source, then use something like https://abctool.gitlab.io/encoding-detector/

If it is an ability that you want to build into your program, that is a tad harder, but:

1/ if all byte values are ASCII ie 0 thru 127 (or even better: 9, 10, 13 and 32 thru 126) then it is interpretable as US-ASCII (or as any of the three other encodings below, which are all supersets of ASCII)

2/ otherwise if all bytes beginning with high bits 11 are followed by a byte beginning with high bits 10, and all bytes beginning with high bits 10 are preceded by a byte beginning with high bit 1, then it is probably UTF-8

3/ otherwise it is probably windows-1252

but if you want to be more certain, then:

4/ if the count of IBM437 accented letters (~byte values 128 thru 165) is greater than the count of windows-1252 accented letters (byte values 192 thru 255) then it is more probably IBM437 (the original IBM PC character set, with line-drawing characters 179 thru 218 that are unlikely to appear in text data)

I will need to build this in the app, so users can view or import (to DB) text files without manipulating (eg in Excel) these files themselves.
Your steps 1 to 3 will probably be sufficient to handle the files that I am dealing with so that is very helpful.
Do you have code that will take a byte array and produce the charset although it looks I can work this out myself.
I take it I have to look at all the bytes of the file, so not for example the first 1000 or so.

RBS

emexes · Dec 11, 2023

RB Smissaert said:
Do you have code that will take a byte array and produce the charset

I do but it's in PowerBasic.

I was thinking about it during lunch and realised that nowadays you can probably ignore the old IBM PC set anyway, and simplified the preceding post accordingly.

RB Smissaert said:
I take it I have to look at all the bytes of the file, so not for example the first 1000 or so.

All would be better, but capped at the first 1000 or 10000 will do the job most-if-not-all times.

With your NHS data where the field separator is invalid UTF-8 and appears often and on each line, then cqpped will do the job all times.

emexes · Dec 11, 2023

RB Smissaert said:
Do you have code that will take a byte array and produce the charset

~~I do but it's in PowerBasic.~~

Now in B4X (simplified by ignoring old IBM PC IBM437) :

B4X:

Sub DetectCharacterEncoding(TextBytes() As Byte) As String

    Dim InvalidUTF8Flag As Boolean = False
 
    Dim PreviousByte As Int    'actually just the top 2 bits
    Dim ThisByte As Int = 0    'actually just the top 2 bits

    For I = 0 To TextBytes.Length - 1
        PreviousByte = ThisByte
        ThisByte = Bit.And(TextBytes(I), 0xC0)    'top 2 bits

        If PreviousByte = 0xC0 Then    'if multibyte start
            If ThisByte <> 0x80 Then     'not followed by multibyte continuation
                InvalidUTF8Flag = True
                Exit
            End If
        else if ThisByte = 0x80 Then    'if multibyte continuation
            If PreviousByte < 0x80 Then    'not preceded by multibyte start or continuation
                InvalidUTF8Flag = True
                Exit
            End If
        End If
    Next
 
    If InvalidUTF8Flag Then
        Return "windows-1252"
    Else
        Return "UTF-8"    'or plain ASCII
    End If
 
End Sub

RB Smissaert · Dec 11, 2023

emexes said:

~~I do but it's in PowerBasic.~~

Now in B4X (simplified by ignoring old IBM PC IBM437) :

B4X:

Sub DetectCharacterEncoding(TextBytes() As Byte) As String

    Dim InvalidUTF8Flag As Boolean = False
 
    Dim PreviousByte As Int    'actually just the top 2 bits
    Dim ThisByte As Int = 0    'actually just the top 2 bits

    For I = 0 To TextBytes.Length - 1
        PreviousByte = ThisByte
        ThisByte = Bit.And(TextBytes(I), 0xC0)    'top 2 bits

        If PreviousByte = 0xC0 Then    'if multibyte start
            If ThisByte <> 0x80 Then     'not followed by multibyte continuation
                InvalidUTF8Flag = True
                Exit
            End If
        else if ThisByte = 0x80 Then    'if multibyte continuation
            If PreviousByte < 0x80 Then    'not preceded by multibyte start or continuation
                InvalidUTF8Flag = True
                Exit
            End If
        End If
    Next
 
    If InvalidUTF8Flag Then
        Return "windows-1252"
    Else
        Return "UTF-8"    'or plain ASCII
    End If
 
End Sub

Thanks for that and your code picks out rightly that SCL.csv file (the one with triple "C") as windows-1252.

Would it be possible to do this:

B4X:

Do While True
    Dim L As String = TR.ReadLine
    If L = Null Then
        Exit
    End If

    L = L.Replace(Chr(65533), Chr(9))    'invalid UTF-8 translates to Chr(65533); change those to tabs

In my byte arrays rather than on strings.

I am rewriting my CSV2List Sub, so it always has a start and end byte index (if -1 will be first and last) and also make it work with a byte array directly
rather than with folder and file strings. All to make it simpler and smaller subs.

RBS

MicroDrie · Dec 11, 2023

What I don't understand is why you want to use a byte array to read a CSV text file that, according to the CVS file by default, can only contain text characters. Of course the advantage of a standard is that everyone can create their own and therefore different standard with exceptions to the rule. You refer to the NHS files, but they are purely text. A text-only file can of course start with a number of bytes written before the definitions/signature of the application of that file or a different character code. A base64 image file may be included, but those image bytes are still written as base64 text characters and can be read as a text string by your app and written to the text-based SQLite database. Why do you want to use a bytre array?

emexes · Dec 11, 2023

RB Smissaert said:
Would it be possible to do this:

B4X:

L = L.Replace(Chr(65533), Chr(9)) 'invalid UTF-8 translates to Chr(65533); change those to tabs

In my byte arrays rather than on strings.

Yes, but "manually" eg:

B4X:

Dim BA() As Byte = bc.StringToBytes("first¬second¬third", "windows-1252")

Log("Before" & TAB & bc.StringFromBytes(BA, "windows-1252"))

Dim ChangeFrom As Byte = Asc("¬")
Dim ChangeTo As Byte = Asc(TAB)

For I = 0 To BA.Length - 1
    If BA(I) = ChangeFrom Then BA(I) = ChangeTo    'standardise field separator to TAB
Next

Log("After" & TAB & bc.StringFromBytes(BA, "windows-1252"))

RB Smissaert · Dec 11, 2023

MicroDrie said:
What I don't understand is why you want to use a byte array to read a CSV text file that, according to the CVS file by default, can only contain text characters. Of course the advantage of a standard is that everyone can create their own and therefore different standard with exceptions to the rule. You refer to the NHS files, but they are purely text. A text-only file can of course start with a number of bytes written before the definitions/signature of the application of that file or a different character code. A base64 image file may be included, but those image bytes are still written as base64 text characters and can be read as a text string by your app and written to the text-based SQLite database. Why do you want to use a bytre array?

I started to use byte arrays as it was then easy to read a file (with RandomAccessFile) in small parts, to deal with large files.
The other reason was that I was dealing with misformed files, for example a trailing comma (as field delimiter) when there shouldn't be one.
The last thing to mention is that using a byte array just gives you full control, admittedly at the cost of extra complexity.

I had a go with TextReader to read the file line by line, but for some reason that still gave me out of memory errors.

In any case I have a feeling that I am not far off from making it work OK, so will persevere for now.

RBS

emexes · Dec 11, 2023

RB Smissaert said:
I had a go with TextReader to read the file line by line, but for some reason that still gave me out of memory errors.

Was this only with B4A? I had no problems using TextReader.ReadLine in B4J on the NHS files.

Perhaps the problem is caused by .ReadLine expecting *valid* UTF-8 and choking on the -84 byte (which is *not* valid UTF-8).

RB Smissaert · Dec 11, 2023

emexes said:
Was this only with B4A? I had no problems using TextReader.ReadLine in B4J on the NHS files.

Perhaps the problem is caused by .ReadLine expecting *valid* UTF-8 and choking on the -84 byte (which is *not* valid UTF-8).

Yes this was with B4A, not using anything else.

RBS

RB Smissaert · Dec 11, 2023

emexes said:
Was this only with B4A? I had no problems using TextReader.ReadLine in B4J on the NHS files.

Perhaps the problem is caused by .ReadLine expecting *valid* UTF-8 and choking on the -84 byte (which is *not* valid UTF-8).

I can see now that TextReader has Initialize2, which allows specifying a characterset. Not used that before and
will have a look at that.

RBS

emexes · Dec 11, 2023

RB Smissaert said:
I can see now that TextReader has Initialize2, which allows specifying a characterset.

Lol I am blind as a bat sometimes.

RB Smissaert said:
will have a look at that.

Definitely worth a shot.

TextReader.ReadLine should be much easier than manually doing it via byte arrays.

Although... it's still a mild worry if TextReader.ReadLine can choke on invalid input.

RB Smissaert · Dec 11, 2023

emexes said:
Lol I am blind as a bat sometimes.

Definitely worth a shot.

TextReader.ReadLine should be much easier than manually doing it via byte arrays.

Although... it's still a mild worry if TextReader.ReadLine can choke on invalid input.

Specifying the charset does seem to work OK. Tested with that TCL.csv file:

B4X:

Sub TestTextReader
    
    Dim oTR As TextReader
    Dim lBytes As Long
    Dim strLine As String
    
    Dim rs As ResumableSub = PickFile("Parse a .csv or .txt file", _
                                      "Pick a .csv or .txt file or a folder", _
                                       File.DirRootExternal & "/PhonePats", _
                                       Array As String("csv", "txt"), _
                                       True)
    Wait For (rs) Complete (tFF As tFolderAndFile)
    
    RAF.Initialize(tFF.strFolder, tFF.strFile, True)
    lBytes = 10000 'enough at least in this file to give me characterset
    Dim arrBytes(lBytes) As Byte
    RAF.ReadBytes(arrBytes, 0, lBytes, 0)
    
    Dim rs1 As ResumableSub  = DetectCharacterEncoding(arrBytes)
    Wait For (rs1) Complete (strCharset As String)
    
    Log(strCharset) '>>   windows-1252
    
    oTR.Initialize2(File.OpenInput(tFF.strFolder, tFF.strFile), strCharset)
    
    Do While True
        strLine = oTR.ReadLine
        If strLine = Null Then Exit
        If strLine.StartsWith("8329479") Then 'this is the line with Complete Contemporary Care Triple "C" etc.
            Log(strLine)
        End If
    Loop
    
    oTR.Close
    
End Sub

Will give me this:

8329479¬1-4207346477¬Care homes and care at home¬UNKNOWN¬Visible¬False¬Complete Contemporary Care Triple "C" Ltd¬Suite 1, Regent business centre¬16 to 20 Regent Street¬¬Leeds¬¬LS2 7QA¬53.801853179931641¬-1.5328665971755981¬1-4111445769¬Complete Contemporary Care. Triple "C". Ltd¬07733800660¬¬¬¬

And that looks all fine.

RBS

RB Smissaert · Dec 11, 2023

RB Smissaert said:

Specifying the charset does seem to work OK. Tested with that TCL.csv file:

B4X:

Sub TestTextReader
   
    Dim oTR As TextReader
    Dim lBytes As Long
    Dim strLine As String
   
    Dim rs As ResumableSub = PickFile("Parse a .csv or .txt file", _
                                      "Pick a .csv or .txt file or a folder", _
                                       File.DirRootExternal & "/PhonePats", _
                                       Array As String("csv", "txt"), _
                                       True)
    Wait For (rs) Complete (tFF As tFolderAndFile)
   
    RAF.Initialize(tFF.strFolder, tFF.strFile, True)
    lBytes = 10000 'enough at least in this file to give me characterset
    Dim arrBytes(lBytes) As Byte
    RAF.ReadBytes(arrBytes, 0, lBytes, 0)
   
    Dim rs1 As ResumableSub  = DetectCharacterEncoding(arrBytes)
    Wait For (rs1) Complete (strCharset As String)
   
    Log(strCharset) '>>   windows-1252
   
    oTR.Initialize2(File.OpenInput(tFF.strFolder, tFF.strFile), strCharset)
   
    Do While True
        strLine = oTR.ReadLine
        If strLine = Null Then Exit
        If strLine.StartsWith("8329479") Then 'this is the line with Complete Contemporary Care Triple "C" etc.
            Log(strLine)
        End If
    Loop
   
    oTR.Close
   
End Sub

Will give me this:

8329479¬1-4207346477¬Care homes and care at home¬UNKNOWN¬Visible¬False¬Complete Contemporary Care Triple "C" Ltd¬Suite 1, Regent business centre¬16 to 20 Regent Street¬¬Leeds¬¬LS2 7QA¬53.801853179931641¬-1.5328665971755981¬1-4111445769¬Complete Contemporary Care. Triple "C". Ltd¬07733800660¬¬¬¬

And that looks all fine.

RBS

Fully tested now with TextReader and all seems OK and good performance as well:

B4X:

Sub CountFieldsInTextLine(strCSVLine As String, strSeparator As String, strEncloser As String) As Int
    
    Dim c As Int
    Dim iEndIndex As Int
    Dim bInsideQuotes As Boolean = False
        
    For iEndIndex = 0 To strCSVLine.Length - 1
        If strCSVLine.CharAt(iEndIndex) = strEncloser Then
            bInsideQuotes = bInsideQuotes = False
        Else
            If strCSVLine.CharAt(iEndIndex) = strSeparator Then
                If bInsideQuotes = False Then
                    c = c + 1
                End If
            End If
        End If
    Next
    
    Return c + 1
    
End Sub

Sub ParseCSVLine(strCSVLine As String, iCols As Int, strSeparator As String, strEncloser As String) As String()
    
    Dim c As Int
    Dim arrValues(iCols) As String
    Dim iStartIndex As Int
    Dim iEndIndex As Int
    Dim bInsideQuotes As Boolean = False
        
    For iEndIndex = 0 To strCSVLine.Length - 1
        If strCSVLine.CharAt(iEndIndex) = strEncloser Then
            bInsideQuotes = bInsideQuotes = False
        Else
            If strCSVLine.CharAt(iEndIndex) = strSeparator Then
                If bInsideQuotes = False Then
                    arrValues(c) = strCSVLine.SubString2(iStartIndex, iEndIndex)
                    iStartIndex = iEndIndex + 1
                    c = c + 1
                End If
            End If
        End If
    Next
    
    arrValues(c) = strCSVLine.SubString2(iStartIndex, iEndIndex )
    
    Return arrValues
    
End Sub

Sub TestTextReader
    
    Dim c As Int 'ignore
    Dim oTR As TextReader
    Dim lBytes As Long
    Dim strLine As String
    Dim iCols As Int
    Dim lstCSV As List
    
    Dim rs As ResumableSub = PickFile("Parse a .csv or .txt file", _
                                      "Pick a .csv or .txt file or a folder", _
                                       File.DirRootExternal & "/PhonePats", _
                                       Array As String("csv", "txt"), _
                                       True)
    Wait For (rs) Complete (tFF As tFolderAndFile)
    
    StartSW(0)
    
    RAF.Initialize(tFF.strFolder, tFF.strFile, True)
    lBytes = 1000 'this is enough at least in case of the SCL.csv file
    Dim arrBytes(lBytes) As Byte
    RAF.ReadBytes(arrBytes, 0, lBytes, 0)
    
    Dim rs1 As ResumableSub  = DetectCharacterEncoding(arrBytes)
    Wait For (rs1) Complete (strCharset As String)
    
    Log(strCharset) '>>   windows-1252
    
    oTR.Initialize2(File.OpenInput(tFF.strFolder, tFF.strFile), strCharset)
    lstCSV.Initialize
    
    'first line will be column names
    strLine = oTR.ReadLine
    iCols = CountFieldsInTextLine(strLine, "¬", Chr(34))
    Dim arrColumnNames() As String = ParseCSVLine(strLine, iCols, "¬", Chr(34)) 'ignore, will be used to create SQLite table

    Do While True
        strLine = oTR.ReadLine
        If strLine = Null Then Exit
        Dim arr() As String = ParseCSVLine(strLine, iCols, "¬", Chr(34))
'        If arr(0) = "8329479" Then
'            For c = 0 To arr.Length - 1
'                Log(arr(c))
'            Next
'        End If
        lstCSV.Add(arr) 'will need to move list to DB if gets too big and Dim a new list
    Loop
    Log("lstCSV.Size: " & lstCSV.Size & ", " & FormatMilliSecs(StopSW(0), False)) 'lstCSV.Size: 28269, 110 milli-seconds
    
    oTR.Close
        
End Sub

That troublesome line with comes out fine as well:

8329479
1-4207346477
Care homes and care at home
UNKNOWN
Visible
False
Complete Contemporary Care Triple "C" Ltd
Suite 1, Regent business centre
16 to 20 Regent Street
Leeds
LS2 7QA
53.801853179931641
-1.5328665971755981
1-4111445769
Complete Contemporary Care. Triple "C". Ltd
07733800660

Not sure it is all solved as I specified ¬ as the value separator and Chr(34) as the value encloser and not sure yet how to determine these 2.
Still, at least using TextReader makes it all simpler.

RBS

emexes · Dec 11, 2023

RB Smissaert said:
Not sure it is all solved as I specified ¬ as the value separator and Chr(34) as the value encloser and not sure yet how to determine these 2.

As in: how to automatically determine, without human help?

That'd be an interesting puzzle to solve.

Although: even if you did solve it, the computer then has to know what to do with each column... like, how is it going handle phone number vs fax number?

RB Smissaert · Dec 11, 2023

emexes said:
As in: how to automatically determine, without human help?

That'd be an interesting puzzle to solve.

Although: even if you did solve it, the computer then has to know what to do with each column... like, how is it going handle phone number vs fax number?

Probably will need user input, that is for them to specify the column separator string at least, although if there are field names in row 1 analysing then
analysing them will often make it clear what that should be, especially if there are no characters past 127.

Handling the colums I have solved with a simple function that takes a list of arrays and produces the data types, that is text, integer, long or double.

RBS

Android Question [RESOLVED] Different results from: string.GetBytes, File.ReadBytes and RandomAccessFile.ReadBytes

Well-Known Member

Well-Known Member

Attachments

Expert

Expert

Well-Known Member

Expert

Well-Known Member

Expert

Expert

Well-Known Member

Well-Known Member

Expert

Well-Known Member

Expert

Well-Known Member

Well-Known Member

Expert

Well-Known Member

Well-Known Member

Expert

Well-Known Member

Similar Threads