Android Question Read file with UTF8 encoding

Rusty

Well-Known Member
Licensed User
Longtime User
How can I read a UTF8 encoded file from within my application.
The file was created with Notepad++ encoded with UTF8, no BOM (but I can change this)
I have copied the file from Windows to the Android tablet and wish to read it into a string.
Your advice is appreciated.
Thanks,
Rusty
 

Theera

Well-Known Member
Licensed User
Longtime User
Please don't Pump the question. This link
 
Upvote 0

Dim Baznr

Member
Licensed User
Longtime User

I use this two complementary functions to pass from 2-bytes-UTF8 to 16-bits-Android-UTF8.
Very usefull, also, if you want to exchange strings between a php script and an Android applet.

B4X:
' returns a 2-bytes-UTF8 String from a 16-Bit Android String
Sub android2utf(android As String) As String
    Dim c() As Byte = android.getbytes("UTF-8")
    Dim utf As String =""
    Dim m As Int = c.Length-1
    For i=0 To m
        utf = utf & Chr(c(i))
    Next
    Return utf
End Sub

' returns a 16-Bit Android String from 2-bytes-UTF8 String
Sub utf2android(utf As String) As String
    Dim m As Int = utf.Length-1
    Dim i As Int
    Dim android(m+1) As Byte
    For i=0 To m
        android(i) = Asc(utf.CharAt(i))
    Next
    Return BytesToString(android, 0, android.Length, "UTF-8")
End Sub
 
Last edited:
Upvote 0

Dim Baznr

Member
Licensed User
Longtime User
Can you explain when these methods are useful?

It is very simple to read text with any encoding and UTF8 is the simplest one.

Hi Erel,

Suppose that we want to send a string of non latin chars (mixed with numeric data) via POST (or GET) to a php script on a remote server.

Suppose also that we want -in some manner- guarantee the integrity of those data, so we decide to encode64 them (or perhaps also internally encrypt them, add a CRC etc.)
So we must broke utf-strings to single bytes (to guarantee the consistency of the encoding/decoding process).

For example, say that you try to send to a php-script the string "ΔW" (Ucase Delta, W).
In Android-UTF, "Δ" is assigned as dec:916 and if you call something like

"ΔW".getbytes("UTF-8") returns the 3-bytes byte-array: [-50, -108, 87] because in B4A, Byte type is Signed.

If you make the same trick on the php-side:
array_slice( unpack("C*", "\0"."ΔW"), 1 ) you get the array: [206, 148, 87] because in php, the Byte "type" is Unsigned.

In this situation described, if you send encoded numeric data mixed with utf strings, the transmission is faulty.
But if you use these two functions, prior encoding or after decoding, all goes smooth.

Perhaps the problem could be resolved, if there was a sub, something like
str.getchars("UTF8").

Certainly, an analog procedure (data conditioning) could be made on the php-side and leave the B4A code intact.

A php example code for similar conversions is:
B4X:
// returns a 2-bytes-UTF8 string from a 16-bit Android-UTF8 array.
function android2utf($a){
    $s="";
    foreach ($a as $r) {
        if ($r < 128 ){ //ascii
            $s .= chr($r);
        }else{ // utf
            $z = intval($r / 64);
            $s .= chr($z+192) . chr($r - ($z-2)*64) ;
        }
    }
    return $s;
}

// returns a 16-bit Android-UTF8 array from 2-bytes-UTF8 string
function utf2android($s){
    $m=strlen($s);
    $i=0;
    $r=array();
    while ($i < $m) {
        if (ord($s[$i]) < 128 ){ //ascii
            $r[] = ord($s[$i]);
            $i++;              
        }else{ // utf
            $r[] = (ord($s[$i])-194 )*64 + ord($s[$i+1]) ;
            $i +=2;
        }
    }
    return $r;
}
 
Upvote 0

Dim Baznr

Member
Licensed User
Longtime User
Hi Erel,

I tried to analyze the issue and after several hours of debuging (because i had first to make up a client-server setup), I concluded that you're right!

The problem, in fact, resides in the Bit.ToBinaryString(num) function that has a strange behavior, ruining the encoding, creating other strange behaviors. (Problems, that were "magicaly" solved using the two functions, and so I took the... wrong way...)

Bit.ToBinaryString
(num)
if num is negative Byte returns a 32bit binary(!)
if num is negative Int returns a 16bit binary
if num is negative Long returns a 16bit binary

Is this a bug?

(I think that the function would be more versatile if it was something like Bit.ToBinaryString(num, bits) with msb zero-pading.)

Thanks!
 
Upvote 0

Phayao

Active Member
Licensed User
Longtime User

Hello, that helped me alot ! I want to read a text in thai letters from a textfie into a variabe in B4A - when the textfile is UTF8 encoded, the text is converted in something like: เลี้ยวขวา
When I use your utf2android function, then I get the correct thai font.
Thanks a lot !

Chris
 
Upvote 0
Cookies are required to use this site. You must accept them to continue using the site. Learn more…