Bug? using Map with Text file which encoded UTF-8 without BOM

Theera · Sep 24, 2013

Hi all,
I found the problem about using Map with Text file which encoded UTF-8 without BOM. I 've tried simple coding to test as attached below.

P.S. I had used with list have no problem.

Erel · Sep 24, 2013

Unlike all other methods File.ReadMap / WriteMap uses ISO 8859-1 encoding by default and it cannot be changed. Note that if you do not need a text file you can use RandomAccessFile.WriteObject (or KeyValueStore class) instead.

Theera · Sep 24, 2013

I need to use text file for adding dictionary in order to using my ThaiKaraokeTranslator library. Thai users could add a new word in the text file by theierselves. Please guide me how should I do?

P.S. 98% of Thai people isn't good at English(Include me too).

Erel · Sep 24, 2013

If you can use ISO-8859-1 encoding then it will work. If you want to use UTF8 then you will need to manually parse the file. You can read it with File.ReadList and then go over the lines and split them with Regex.Split.

Theera · Sep 24, 2013

I think that If I use ISO-8859-1 encoding,then I need to show data ,it couldn't show Thai as same as I do. But the another way,I 'm interested in using File.ReadList and Regex. I've the question,Could you change encoding in post # 2 the next version?

P.S. I think that the text file is the best solution of easy database for the simple developers,because they are not expert.

Thank you in advance.

Erel · Sep 25, 2013

It is not possible to change the encoding in this case.

Johnmcenroy · Nov 25, 2013

In my app I had similar problem. I need maps in various languages in UTF-8 and map is only ISO-8859-1 . It is possible to make this. Map it is JAVA properties file and it can be translated in various languages just chinese. Here is a tool :
https://code.google.com/a/eclipselabs.org/p/tapiji/

Also from wikipedia:
The encoding of a .properties file is ISO-8859-1, also known as Latin-1. All non-Latin-1 characters must be entered by using Unicode escape characters, e.g. \uHHHH where HHHH is a hexadecimal index of the character in the Unicode character set. This allows for using .properties files as resource bundles for localization. A non-Latin-1 text file can be converted to a correct .properties file by using the native2ascii tool that is shipped with the JDK or by using a tool, such as po2prop, that manages the transformation from a bilingual localization format into .properties escaping.

And the easiest way is Tapiji Translator

LucaMs · Apr 24, 2015

Erel said:
If you can use ISO-8859-1 encoding then it will work. If you want to use UTF8 then you will need to manually parse the file. You can read it with File.ReadList and then go over the lines and split them with Regex.Split.

I'm trying to create a function to read an UTF8 text file in a Map, with the help of this UnescapeUnicode function posted by Erel.
Can you find the error, please?

Thank you

[P.S. the Map to be read was saved using File.WriteMap]

B4X:

Public Sub ReadUTF8Map(Dir As String, FileName As String) As Map
    If Not(File.Exists(Dir, FileName)) Then Return Null

    Dim mapResult As Map : mapResult.Initialize
    Dim lstLines As List = File.ReadList(Dir, FileName)
    Dim LineTexts(2) As String
    Dim Line As String
    Dim LineFirstChar As String
    Dim EqualCharPos As Int

    For i = 0 To lstLines.Size -1
        Line = lstLines.Get(i)
Log(Line)
Line = UnescapeUnicode(Line) ' <--- Error
Log(Line)
        LineFirstChar = Line.SubString2(i, i + 1)
        If LineFirstChar <> "#" AND LineFirstChar <> "!" Then ' if it is not a comment line
'            LineTexts = Regex.Split(Line, "=") ' this does not work
            EqualCharPos = Line.IndexOf("=")
            LineTexts(0) = Line.SubString2(0, EqualCharPos)
            LineTexts(1) = Line.SubString(EqualCharPos + 1)
            mapResult.Put(LineTexts(0), LineTexts(1))
        End If
    Next

    Return mapResult
End Sub



Public Sub UnescapeUnicode(s As String) As String
   Dim sb As StringBuilder
   sb.Initialize
   Dim i As Int
   Do While i < s.Length
      Dim c As Char = s.CharAt(i)
      If c = "\" AND i < s.Length - 1 AND s.CharAt(i + 1) = "u" Then
         Dim unicode As StringBuilder
         unicode.Initialize
         i = i + 2
         Do While i < s.Length
            Dim cc As String = s.CharAt(i)
            Dim n As Int = Asc(cc.ToLowerCase)
            If (n >= Asc("0") AND n <= Asc("9")) OR (n >= Asc("a") AND n <= Asc("z")) Then
               unicode.Append(s.CharAt(i))
            Else
               i = i - 1
               Exit
            End If
            i = i + 1
         Loop
         sb.Append(Chr(Bit.ParseInt(unicode.ToString, 16)))
      Else
         sb.Append(c)
      End If
      i = i + 1
   Loop
   Return sb.ToString
End Sub

LucaMs · Apr 24, 2015

Modifying UnescapeUnicode now it seems to work:

B4X:

Public Sub UnescapeUnicode(s As String) As String
   Dim sbResult As StringBuilder
   sbResult.Initialize
    Dim Unicode As String
    Dim HexCharsOnly As Boolean
    Dim cc As String
    Dim n As Int
  
   Dim i As Int
   Do While i < s.Length
      Dim c As Char = s.CharAt(i)
      If c = "\" AND i < s.Length - 1 AND s.CharAt(i + 1) = "u" Then
         i = i + 2
            Unicode = s.SubString2(i, i + 4)
            If Unicode.Length = 4 Then
                ' Checks if Unicode contains hex chars only
                HexCharsOnly = True
                For k = 0 To 3
                cc = Unicode.CharAt(k)
                n = Asc(cc.ToLowerCase)
                HexCharsOnly = ((n >= Asc("0") AND n <= Asc("9")) OR (n >= Asc("a") AND n <= Asc("z")))
                    If Not(HexCharsOnly) Then Exit
                Next
                If HexCharsOnly Then
                    sbResult.Append(Chr(Bit.ParseInt(Unicode, 16)))
                    i = i + 3
                Else
                    i = i - 2
                End If
            End If
      Else
            If c = "\" Then s.CharAt(i + 1) = " " Then
                sbResult.Append(" ")
              i = i + 1
            Else
             sbResult.Append(c)
            End If
      End If
      i = i + 1
   Loop
   Return sbResult.ToString
End Sub

Bug? using Map with Text file which encoded UTF-8 without BOM

Theera

Well-Known Member

Attachments

Erel

B4X founder

Theera

Well-Known Member

Erel

B4X founder

Theera

Well-Known Member

Erel

B4X founder

Johnmcenroy

Active Member

LucaMs

Expert

LucaMs

Expert