Android Code Snippet Unescape Unicode sequences for Spanish language

GeoT · Sep 15, 2023

Erel created a sub for unescape or decode Unicode sequences to transform unicode characters, with the structure \uXXXX, into real characters.
In https://www.b4x.com/android/forum/t...-results-i-cant-read-hebrew.27461/post-159533
Is good for processing texts that come from scraping web pages, for example.

But he created it to handle Hebrew characters, which can contain several Unicode sequences in a row.
I have corrected it for the Spanish language by adding the condition

B4X:

And unicode.Length < 4

So that it only processes the 6 characters that make up a Unicode sequence, to avoid mistakes in case there are more characters after between 0 and 9 or between a and f

His code would look like this at the end:

B4X:

Log(UnescapeUnicode("p\u00fablico.")) 'prints: público.    (Or pass a text with several words, in Unicode or not)

Sub UnescapeUnicode(s As String) As String
 
    Dim sb As StringBuilder
    sb.Initialize
    Dim i As Int
 
    Do While i < s.Length
        Dim c As Char = s.CharAt(i)
        If c = "\" And i < s.Length - 1 And s.CharAt(i + 1) = "u" Then
            Dim unicode As StringBuilder
            unicode.Initialize
            i = i + 2
       
            Do While i < s.Length
           
                Dim cc As String = s.CharAt(i)
                Dim n As Int = Asc(cc.ToLowerCase)
           
                'Only up to 4 hexadecimal characters are accepted after \u
                If (n >= Asc("0") And n <= Asc("9")) Or (n >= Asc("a") And n <= Asc("f") And unicode.Length < 4) Then
                    unicode.Append(s.CharAt(i))
'                    Log(unicode.ToString)
'                    Log(unicode.Length)
                Else
                    i = i - 1
                    Exit
                End If

                i = i + 1
            Loop
       
            sb.Append(Chr(Bit.ParseInt(unicode.ToString, 16)))
        Else
            sb.Append(c)
        End If
        i = i + 1
    Loop
 
    Return sb.ToString
End Sub

GeoT · Sep 15, 2023

I have created a second way to do the same:

B4X:

Log(DecodeUnicode("canci\u00f3n p\u00fablica"))    'prints: canción pública

Sub DecodeUnicode(strOriginal As String) As String
   
    ' Pattern to find Unicode escape sequences like \uXXXX
    Dim m As Matcher
    m = Regex.Matcher("\\u[0-9a-fA-F]{4}", strOriginal)   'Double slash to escape '\' character in regular expression

    Dim resultBuilder As StringBuilder
    resultBuilder.Initialize
   
    Dim currentIndex As Int   'To track the current position in the text
   
    Do While m.Find
        Dim match As String
        match = m.Match
'        LogColor(match, Colors.Green)
       
        If match <> "" Then
           
            ' Take actions with the matches found
'            Log("Match found in position: " & m.GetStart(0))        'Match Positions
           
            ' Adds unfound characters from the current position to the match to the StringBuilder
            resultBuilder.Append(strOriginal.SubString2(currentIndex, m.GetStart(0)))
           
            ' Add the substitute character to the StringBuilder
            Dim unicodeValue As Int
            unicodeValue = Bit.ParseInt(match.SubString(2), 16)  'Convert Unicode value to integer, omitting the leading '\'
            Dim charValue As String
            charValue = Chr(unicodeValue)  'Convert Unicode value to normal character
            resultBuilder.Append(charValue)
           
            ' Updates current position at the end of the match
            currentIndex = m.GetEnd(0)
        End If
    Loop
   
    ' Add any characters not found after the last match
    If currentIndex < strOriginal.Length Then
        resultBuilder.Append(strOriginal.SubString(currentIndex))
    End If
   
    ' Now you have all the characters (matches and non-matches) in resultBuilder  
    Return resultBuilder.ToString
End Sub

GeoT · Sep 15, 2023

I see that they also work for other Romance languages.
But I don't know if it also works in other types of languages.

I would appreciate confirmations or comments.

byz · Nov 25, 2023

GeoT said:
I see that they also work for other Romance languages.
But I don't know if it also works in other types of languages.

I would appreciate confirmations or comments.

hi,chinese is ok!

GeoT · Nov 25, 2023

Hi byz!
Ok, good!
Thank you for your info.

Android Code Snippet Unescape Unicode sequences for Spanish language

GeoT

Active Member

GeoT

Active Member

byz

Active Member

GeoT

Active Member

Similar Threads

Android Code Snippet Unescape Unicode sequences for Spanish language

GeoT

Active Member

GeoT

Active Member

byz

Active Member

GeoT

Active Member

Similar Threads

Privacy & Transparency

Privacy & Transparency