Moving data from a text file to SQLite database.
If I have a line in the text file like for example this: 123, "John,", Smith
Then what would be the fastest way to split this line in these 3 fields:
123
John,
Smith
As these files can be large, I need to process the file line by line, so I can't use for example StringUtils.
Below is an optimised CSV line parsing algorithm which avoids intermediate copies in a StringBuilder by using start and end indexes in the original string, and which uses the Char type for the current character rather than String. ( chr(34) = QUOTE and chr(44) = COMMA)
Compared with the algorithm proposed in #14, this saves around 50% of the execution time.
B4X:
Sub ParseCSVLine(csvLine As String) As List
Dim values As List
values.Initialize
Dim startIndex, endIndex As Int
Dim insideQuotes As Boolean = False
Dim c As Char
For endIndex = 0 To csvLine.Length - 1
c = csvLine.CharAt(endIndex)
If c = Chr(34) Then
insideQuotes = Not (insideQuotes)
Else
If (c = Chr(44)) And Not (insideQuotes) Then
values.Add(csvLine.SubString2(startIndex, endIndex).trim)
startIndex = endIndex + 1
End If
End If
Next
values.Add(csvLine.SubString2(startIndex, endIndex ).trim)
Return values
End Sub
The algorithm, yours and mine, requires tweaking to account for escaped quotes inside quotes.
Also, a more general approach should allow for multi-character separators.
Here is a test string to consider:
"Far and wide" and "Lynx, Cat, Tiger" and "A double quote -> \" and a single quote -> '" and and
Split by "and" to yield
Far and wide
Lynx, Cat, Tiger
A double quote -> " and a single quote -> '
-empty-
-empty-
Below is an optimised CSV line parsing algorithm which avoids intermediate copies in a StringBuilder by using start and end indexes in the original string, and which uses the Char type for the current character rather than String. ( chr(34) = QUOTE and chr(44) = COMMA)
Compared with the algorithm proposed in #14, this saves around 50% of the execution time.
B4X:
Sub ParseCSVLine(csvLine As String) As List
Dim values As List
values.Initialize
Dim startIndex, endIndex As Int
Dim insideQuotes As Boolean = False
Dim c As Char
For endIndex = 0 To csvLine.Length - 1
c = csvLine.CharAt(endIndex)
If c = Chr(34) Then
insideQuotes = Not (insideQuotes)
Else
If (c = Chr(44)) And Not (insideQuotes) Then
values.Add(csvLine.SubString2(startIndex, endIndex).trim)
startIndex = endIndex + 1
End If
End If
Next
values.Add(csvLine.SubString2(startIndex, endIndex ).trim)
Return values
End Sub
Fast CSV decoding by not using a StringBuilder and by limiting implicit type conversions.
Compared with the previous version, this version adds a few lines to handle escape quotes (a quote inside a string delimited by quotes must be doubled).
B4X:
Sub ParseCSVLine(csvLine As String) As List
Dim const qotecar As Char = Chr(34)
Dim const separator As Char = Chr(44)
Dim startIndex, endIndex As Int
Dim insideQuotes, DoubleQuote As Boolean
Dim c As Char
Dim values As List : values.Initialize
For endIndex = 0 To csvLine.Length - 1
c = csvLine.CharAt(endIndex)
If c = qotecar Then
If endIndex < csvLine.Length-1 And csvLine.CharAt(endIndex+1) = qotecar Then
endIndex = endIndex +1
DoubleQuote = True
Else
insideQuotes = Not (insideQuotes)
End If
Else
If (c = separator ) And Not (insideQuotes) Then
If DoubleQuote Then
DoubleQuote = False
values.Add(csvLine.SubString2(startIndex, endIndex).trim.Replace(qotecar & qotecar ,qotecar ))
Else
values.Add(csvLine.SubString2(startIndex, endIndex).trim)
End If
startIndex = endIndex + 1
End If
End If
Next
values.Add(csvLine.SubString2(startIndex, endIndex ).trim)
Return values
End Sub