Android Question Read file with UTF8 encoding

Rusty · Oct 30, 2013

How can I read a UTF8 encoded file from within my application.
The file was created with Notepad++ encoded with UTF8, no BOM (but I can change this)
I have copied the file from Windows to the Android tablet and wish to read it into a string.
Your advice is appreciated.
Thanks,
Rusty

NJDude · Oct 30, 2013

Check File.GetText

Theera · Oct 31, 2013

Please don't Pump the question. This link

Dim Baznr · Aug 23, 2014

Rusty said:
How can I read a UTF8 encoded file from within my application.
The file was created with Notepad++ encoded with UTF8, no BOM (but I can change this)
I have copied the file from Windows to the Android tablet and wish to read it into a string.
Your advice is appreciated.
Thanks,
Rusty

I use this two complementary functions to pass from 2-bytes-UTF8 to 16-bits-Android-UTF8.
Very usefull, also, if you want to exchange strings between a php script and an Android applet.

B4X:

' returns a 2-bytes-UTF8 String from a 16-Bit Android String
Sub android2utf(android As String) As String
    Dim c() As Byte = android.getbytes("UTF-8")
    Dim utf As String =""
    Dim m As Int = c.Length-1
    For i=0 To m
        utf = utf & Chr(c(i))
    Next
    Return utf
End Sub

' returns a 16-Bit Android String from 2-bytes-UTF8 String
Sub utf2android(utf As String) As String
    Dim m As Int = utf.Length-1
    Dim i As Int
    Dim android(m+1) As Byte
    For i=0 To m
        android(i) = Asc(utf.CharAt(i))
    Next
    Return BytesToString(android, 0, android.Length, "UTF-8")
End Sub

Erel · Aug 24, 2014

Can you explain when these methods are useful?

It is very simple to read text with any encoding and UTF8 is the simplest one.

Dim Baznr · Aug 24, 2014

Erel said:
Can you explain when these methods are useful?

It is very simple to read text with any encoding and UTF8 is the simplest one.

Hi Erel,

Suppose that we want to send a string of non latin chars (mixed with numeric data) via POST (or GET) to a php script on a remote server.

Suppose also that we want -in some manner- guarantee the integrity of those data, so we decide to encode64 them (or perhaps also internally encrypt them, add a CRC etc.)
So we must broke utf-strings to single bytes (to guarantee the consistency of the encoding/decoding process).

For example, say that you try to send to a php-script the string "ΔW" (Ucase Delta, W).
In Android-UTF, "Δ" is assigned as dec:916 and if you call something like

"ΔW".getbytes("UTF-8") returns the 3-bytes byte-array: [-50, -108, 87] because in B4A, Byte type is Signed.

If you make the same trick on the php-side:
array_slice( unpack("C*", "\0"."ΔW"), 1 ) you get the array: [206, 148, 87] because in php, the Byte "type" is Unsigned.

In this situation described, if you send encoded numeric data mixed with utf strings, the transmission is faulty.
But if you use these two functions, prior encoding or after decoding, all goes smooth.

Perhaps the problem could be resolved, if there was a sub, something like
str.getchars("UTF8").

Certainly, an analog procedure (data conditioning) could be made on the php-side and leave the B4A code intact.

A php example code for similar conversions is:

B4X:

// returns a 2-bytes-UTF8 string from a 16-bit Android-UTF8 array.
function android2utf($a){
    $s="";
    foreach ($a as $r) {
        if ($r < 128 ){ //ascii
            $s .= chr($r);
        }else{ // utf
            $z = intval($r / 64);
            $s .= chr($z+192) . chr($r - ($z-2)*64) ;
        }
    }
    return $s;
}

// returns a 16-bit Android-UTF8 array from 2-bytes-UTF8 string
function utf2android($s){
    $m=strlen($s);
    $i=0;
    $r=array();
    while ($i < $m) {
        if (ord($s[$i]) < 128 ){ //ascii
            $r[] = ord($s[$i]);
            $i++;              
        }else{ // utf
            $r[] = (ord($s[$i])-194 )*64 + ord($s[$i+1]) ;
            $i +=2;
        }
    }
    return $r;
}

Erel · Aug 24, 2014

In this situation described, if you send encoded numeric data mixed with utf strings, the transmission is faulty.

This is not correct. It doesn't matter whether the bytes are signed or unsigned. The same value is sent.

Dim Baznr · Aug 25, 2014

Hi Erel,

I tried to analyze the issue and after several hours of debuging (because i had first to make up a client-server setup), I concluded that you're right!

The problem, in fact, resides in the Bit.ToBinaryString(num) function that has a strange behavior, ruining the encoding, creating other strange behaviors. (Problems, that were "magicaly" solved using the two functions, and so I took the... wrong way...)

Bit.ToBinaryString(num)
if num is negative Byte returns a 32bit binary(!)
if num is negative Int returns a 16bit binary
if num is negative Long returns a 16bit binary

Is this a bug?

(I think that the function would be more versatile if it was something like Bit.ToBinaryString(num, bits) with msb zero-pading.)

Thanks!

Erel · Aug 25, 2014

You can read the full description of this method here: http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#toBinaryString(int)

You shouldn't use for bit calculations.

Phayao · Jun 25, 2015

Dim Baznr said:

I use this two complementary functions to pass from 2-bytes-UTF8 to 16-bits-Android-UTF8.
Very usefull, also, if you want to exchange strings between a php script and an Android applet.

B4X:

' returns a 2-bytes-UTF8 String from a 16-Bit Android String
Sub android2utf(android As String) As String
    Dim c() As Byte = android.getbytes("UTF-8")
    Dim utf As String =""
    Dim m As Int = c.Length-1
    For i=0 To m
        utf = utf & Chr(c(i))
    Next
    Return utf
End Sub

' returns a 16-Bit Android String from 2-bytes-UTF8 String
Sub utf2android(utf As String) As String
    Dim m As Int = utf.Length-1
    Dim i As Int
    Dim android(m+1) As Byte
    For i=0 To m
        android(i) = Asc(utf.CharAt(i))
    Next
    Return BytesToString(android, 0, android.Length, "UTF-8")
End Sub

Hello, that helped me alot ! I want to read a text in thai letters from a textfie into a variabe in B4A - when the textfile is UTF8 encoded, the text is converted in something like: à¹à¸¥à¸µà¹à¸¢à¸§à¸à¸§à¸²
When I use your utf2android function, then I get the correct thai font.
Thanks a lot !

Chris

moster67 · Jun 14, 2016

If you don't know beforehand the character encoding, my ICUB4A-library may help:

https://www.b4x.com/android/forum/threads/icub4a-detecting-character-encoding-formats.65411/

Android Question Read file with UTF8 encoding

Rusty

Well-Known Member

NJDude

Expert

Theera

Well-Known Member

Dim Baznr

Member

Erel

B4X founder

Dim Baznr

Member

Erel

B4X founder

Dim Baznr

Member

Erel

B4X founder

Phayao

Active Member

moster67

Expert

Similar Threads