'reading string:
arrBytes = strString.GetBytes("UTF8")
'reading file, using File:
arrBytes = File.ReadBytes(tFF.strFolder, tFF.strFile)
'reading the file using RAF:
RAF.Initialize(tFF.strFolder, tFF.strFile, True)
lBytes = RAF.Size
Dim arrBytes(lBytes) As Byte
RAF.ReadBytes(arrBytes, 0, lBytes, 0)
That file (SCL.csv) looks indeed a bit more complex:
-110 10
-106 20
-96 1
-84 593649
-55 2
-37 1
-19 2
- It loads the csv file directly from the specified URL (from the hospital overview) into "temp.csv"
Simplest solution is probably ByteConverter.StringFromBytes(bytes(), "windows-1252")
Thanks, and will have a look at your code.Here is a B4J and B4A program that takes the following steps:
The program is not perfect, but it creates or overwrites a table in an specified SQLite database. In B4J the program works without any problems, in B4A the reading starts for a reason that is unclear to me. Due to the current Android restrictions on directories I could not read the result in a file exploxer. In addition, the size of the files places quite a large demand on memory. It may be worth considering a more client server-like solution in which the client only downloads a limited subset of the information to reduce the resource requirement.
- It loads the csv file directly from the specified URL (from the hospital overview) into "temp.csv"
- It replaces the separator with a tab value and writes the record to "temp2.csv"
- It reads the file "temp2.csv" in the database test.db in the table "Hospitals"
You can further expand this concept to your own wishes by making the filename, database table name and strQuery structure references multiple.
Because all fields in the database table TEXT have values, you may find it easier to define the strQuery with a loop. The current method allows for this in contrast to a different format.
arrData(c) = BytesToString(arrBytes, lSeparatorPos + 1, (i - 1) - lSeparatorPos, "UTF-8")
2. Working out somehow what the charset of the file is.
This I haven't figured out yet.
I will need to build this in the app, so users can view or import (to DB) text files without manipulating (eg in Excel) these files themselves.If this is a once-off mission done when you first use data from a new source, then use something like https://abctool.gitlab.io/encoding-detector/
If it is an ability that you want to build into your program, that is a tad harder, but:
1/ if all byte values are ASCII ie 0 thru 127 (or even better: 9, 10, 13 and 32 thru 126) then it is interpretable as US-ASCII (or as any of the three other encodings below, which are all supersets of ASCII)
2/ otherwise if all bytes beginning with high bits 11 are followed by a byte beginning with high bits 10, and all bytes beginning with high bits 10 are preceded by a byte beginning with high bit 1, then it is probably UTF-8
3/ otherwise it is probably windows-1252
but if you want to be more certain, then:
4/ if the count of IBM437 accented letters (~byte values 128 thru 165) is greater than the count of windows-1252 accented letters (byte values 192 thru 255) then it is more probably IBM437 (the original IBM PC character set, with line-drawing characters 179 thru 218 that are unlikely to appear in text data)
Do you have code that will take a byte array and produce the charset
I take it I have to look at all the bytes of the file, so not for example the first 1000 or so.
Do you have code that will take a byte array and produce the charset
Sub DetectCharacterEncoding(TextBytes() As Byte) As String
Dim InvalidUTF8Flag As Boolean = False
Dim PreviousByte As Int 'actually just the top 2 bits
Dim ThisByte As Int = 0 'actually just the top 2 bits
For I = 0 To TextBytes.Length - 1
PreviousByte = ThisByte
ThisByte = Bit.And(TextBytes(I), 0xC0) 'top 2 bits
If PreviousByte = 0xC0 Then 'if multibyte start
If ThisByte <> 0x80 Then 'not followed by multibyte continuation
InvalidUTF8Flag = True
Exit
End If
else if ThisByte = 0x80 Then 'if multibyte continuation
If PreviousByte < 0x80 Then 'not preceded by multibyte start or continuation
InvalidUTF8Flag = True
Exit
End If
End If
Next
If InvalidUTF8Flag Then
Return "windows-1252"
Else
Return "UTF-8" 'or plain ASCII
End If
End Sub
Thanks for that and your code picks out rightly that SCL.csv file (the one with triple "C") as windows-1252.I do but it's in PowerBasic.
Now in B4X (simplified by ignoring old IBM PC IBM437) :
B4X:Sub DetectCharacterEncoding(TextBytes() As Byte) As String Dim InvalidUTF8Flag As Boolean = False Dim PreviousByte As Int 'actually just the top 2 bits Dim ThisByte As Int = 0 'actually just the top 2 bits For I = 0 To TextBytes.Length - 1 PreviousByte = ThisByte ThisByte = Bit.And(TextBytes(I), 0xC0) 'top 2 bits If PreviousByte = 0xC0 Then 'if multibyte start If ThisByte <> 0x80 Then 'not followed by multibyte continuation InvalidUTF8Flag = True Exit End If else if ThisByte = 0x80 Then 'if multibyte continuation If PreviousByte < 0x80 Then 'not preceded by multibyte start or continuation InvalidUTF8Flag = True Exit End If End If Next If InvalidUTF8Flag Then Return "windows-1252" Else Return "UTF-8" 'or plain ASCII End If End Sub
Do While True
Dim L As String = TR.ReadLine
If L = Null Then
Exit
End If
L = L.Replace(Chr(65533), Chr(9)) 'invalid UTF-8 translates to Chr(65533); change those to tabs
Would it be possible to do this:
In my byte arrays rather than on strings.B4X:L = L.Replace(Chr(65533), Chr(9)) 'invalid UTF-8 translates to Chr(65533); change those to tabs
Dim BA() As Byte = bc.StringToBytes("first¬second¬third", "windows-1252")
Log("Before" & TAB & bc.StringFromBytes(BA, "windows-1252"))
Dim ChangeFrom As Byte = Asc("¬")
Dim ChangeTo As Byte = Asc(TAB)
For I = 0 To BA.Length - 1
If BA(I) = ChangeFrom Then BA(I) = ChangeTo 'standardise field separator to TAB
Next
Log("After" & TAB & bc.StringFromBytes(BA, "windows-1252"))
I started to use byte arrays as it was then easy to read a file (with RandomAccessFile) in small parts, to deal with large files.What I don't understand is why you want to use a byte array to read a CSV text file that, according to the CVS file by default, can only contain text characters. Of course the advantage of a standard is that everyone can create their own and therefore different standard with exceptions to the rule. You refer to the NHS files, but they are purely text. A text-only file can of course start with a number of bytes written before the definitions/signature of the application of that file or a different character code. A base64 image file may be included, but those image bytes are still written as base64 text characters and can be read as a text string by your app and written to the text-based SQLite database. Why do you want to use a bytre array?
I had a go with TextReader to read the file line by line, but for some reason that still gave me out of memory errors.
Yes this was with B4A, not using anything else.Was this only with B4A? I had no problems using TextReader.ReadLine in B4J on the NHS files.
Perhaps the problem is caused by .ReadLine expecting *valid* UTF-8 and choking on the -84 byte (which is *not* valid UTF-8).
I can see now that TextReader has Initialize2, which allows specifying a characterset. Not used that before andWas this only with B4A? I had no problems using TextReader.ReadLine in B4J on the NHS files.
Perhaps the problem is caused by .ReadLine expecting *valid* UTF-8 and choking on the -84 byte (which is *not* valid UTF-8).
I can see now that TextReader has Initialize2, which allows specifying a characterset.
will have a look at that.
Specifying the charset does seem to work OK. Tested with that TCL.csv file:Lol I am blind as a bat sometimes.
Definitely worth a shot.
TextReader.ReadLine should be much easier than manually doing it via byte arrays.
Although... it's still a mild worry if TextReader.ReadLine can choke on invalid input.
Sub TestTextReader
Dim oTR As TextReader
Dim lBytes As Long
Dim strLine As String
Dim rs As ResumableSub = PickFile("Parse a .csv or .txt file", _
"Pick a .csv or .txt file or a folder", _
File.DirRootExternal & "/PhonePats", _
Array As String("csv", "txt"), _
True)
Wait For (rs) Complete (tFF As tFolderAndFile)
RAF.Initialize(tFF.strFolder, tFF.strFile, True)
lBytes = 10000 'enough at least in this file to give me characterset
Dim arrBytes(lBytes) As Byte
RAF.ReadBytes(arrBytes, 0, lBytes, 0)
Dim rs1 As ResumableSub = DetectCharacterEncoding(arrBytes)
Wait For (rs1) Complete (strCharset As String)
Log(strCharset) '>> windows-1252
oTR.Initialize2(File.OpenInput(tFF.strFolder, tFF.strFile), strCharset)
Do While True
strLine = oTR.ReadLine
If strLine = Null Then Exit
If strLine.StartsWith("8329479") Then 'this is the line with Complete Contemporary Care Triple "C" etc.
Log(strLine)
End If
Loop
oTR.Close
End Sub
Fully tested now with TextReader and all seems OK and good performance as well:Specifying the charset does seem to work OK. Tested with that TCL.csv file:
B4X:Sub TestTextReader Dim oTR As TextReader Dim lBytes As Long Dim strLine As String Dim rs As ResumableSub = PickFile("Parse a .csv or .txt file", _ "Pick a .csv or .txt file or a folder", _ File.DirRootExternal & "/PhonePats", _ Array As String("csv", "txt"), _ True) Wait For (rs) Complete (tFF As tFolderAndFile) RAF.Initialize(tFF.strFolder, tFF.strFile, True) lBytes = 10000 'enough at least in this file to give me characterset Dim arrBytes(lBytes) As Byte RAF.ReadBytes(arrBytes, 0, lBytes, 0) Dim rs1 As ResumableSub = DetectCharacterEncoding(arrBytes) Wait For (rs1) Complete (strCharset As String) Log(strCharset) '>> windows-1252 oTR.Initialize2(File.OpenInput(tFF.strFolder, tFF.strFile), strCharset) Do While True strLine = oTR.ReadLine If strLine = Null Then Exit If strLine.StartsWith("8329479") Then 'this is the line with Complete Contemporary Care Triple "C" etc. Log(strLine) End If Loop oTR.Close End Sub
Will give me this:
8329479¬1-4207346477¬Care homes and care at home¬UNKNOWN¬Visible¬False¬Complete Contemporary Care Triple "C" Ltd¬Suite 1, Regent business centre¬16 to 20 Regent Street¬¬Leeds¬¬LS2 7QA¬53.801853179931641¬-1.5328665971755981¬1-4111445769¬Complete Contemporary Care. Triple "C". Ltd¬07733800660¬¬¬¬
And that looks all fine.
RBS
Sub CountFieldsInTextLine(strCSVLine As String, strSeparator As String, strEncloser As String) As Int
Dim c As Int
Dim iEndIndex As Int
Dim bInsideQuotes As Boolean = False
For iEndIndex = 0 To strCSVLine.Length - 1
If strCSVLine.CharAt(iEndIndex) = strEncloser Then
bInsideQuotes = bInsideQuotes = False
Else
If strCSVLine.CharAt(iEndIndex) = strSeparator Then
If bInsideQuotes = False Then
c = c + 1
End If
End If
End If
Next
Return c + 1
End Sub
Sub ParseCSVLine(strCSVLine As String, iCols As Int, strSeparator As String, strEncloser As String) As String()
Dim c As Int
Dim arrValues(iCols) As String
Dim iStartIndex As Int
Dim iEndIndex As Int
Dim bInsideQuotes As Boolean = False
For iEndIndex = 0 To strCSVLine.Length - 1
If strCSVLine.CharAt(iEndIndex) = strEncloser Then
bInsideQuotes = bInsideQuotes = False
Else
If strCSVLine.CharAt(iEndIndex) = strSeparator Then
If bInsideQuotes = False Then
arrValues(c) = strCSVLine.SubString2(iStartIndex, iEndIndex)
iStartIndex = iEndIndex + 1
c = c + 1
End If
End If
End If
Next
arrValues(c) = strCSVLine.SubString2(iStartIndex, iEndIndex )
Return arrValues
End Sub
Sub TestTextReader
Dim c As Int 'ignore
Dim oTR As TextReader
Dim lBytes As Long
Dim strLine As String
Dim iCols As Int
Dim lstCSV As List
Dim rs As ResumableSub = PickFile("Parse a .csv or .txt file", _
"Pick a .csv or .txt file or a folder", _
File.DirRootExternal & "/PhonePats", _
Array As String("csv", "txt"), _
True)
Wait For (rs) Complete (tFF As tFolderAndFile)
StartSW(0)
RAF.Initialize(tFF.strFolder, tFF.strFile, True)
lBytes = 1000 'this is enough at least in case of the SCL.csv file
Dim arrBytes(lBytes) As Byte
RAF.ReadBytes(arrBytes, 0, lBytes, 0)
Dim rs1 As ResumableSub = DetectCharacterEncoding(arrBytes)
Wait For (rs1) Complete (strCharset As String)
Log(strCharset) '>> windows-1252
oTR.Initialize2(File.OpenInput(tFF.strFolder, tFF.strFile), strCharset)
lstCSV.Initialize
'first line will be column names
strLine = oTR.ReadLine
iCols = CountFieldsInTextLine(strLine, "¬", Chr(34))
Dim arrColumnNames() As String = ParseCSVLine(strLine, iCols, "¬", Chr(34)) 'ignore, will be used to create SQLite table
Do While True
strLine = oTR.ReadLine
If strLine = Null Then Exit
Dim arr() As String = ParseCSVLine(strLine, iCols, "¬", Chr(34))
' If arr(0) = "8329479" Then
' For c = 0 To arr.Length - 1
' Log(arr(c))
' Next
' End If
lstCSV.Add(arr) 'will need to move list to DB if gets too big and Dim a new list
Loop
Log("lstCSV.Size: " & lstCSV.Size & ", " & FormatMilliSecs(StopSW(0), False)) 'lstCSV.Size: 28269, 110 milli-seconds
oTR.Close
End Sub
Not sure it is all solved as I specified ¬ as the value separator and Chr(34) as the value encloser and not sure yet how to determine these 2.
Probably will need user input, that is for them to specify the column separator string at least, although if there are field names in row 1 analysing thenAs in: how to automatically determine, without human help?
That'd be an interesting puzzle to solve.
Although: even if you did solve it, the computer then has to know what to do with each column... like, how is it going handle phone number vs fax number?
We use cookies and similar technologies for the following purposes:
Do you accept cookies and these technologies?
We use cookies and similar technologies for the following purposes:
Do you accept cookies and these technologies?