Android Question Discard Unicode In Listview

bocker77 · Sep 20, 2022

I use downloaded csv data from a website that I import into a database. One of the fields is a string that can contain Unicode characters and I add these to a Listview. These Unicode characters are displayed as either a box or replacement character � in the Listview. I can replace the replacement character easy enough which I replace with a single quote but I can't seem to handle the others. The culprits are typically the left and right double quotes and a few others. When sending the strings (variables) to the B4A Log I noticed that the offending characters are discarded. I am wondering what is used in the Log command to do this. I was going to use the log so that I could see the hex codes in an editor but none of those characters show up.

Thanks,
Greg

agraham · Sep 20, 2022

I suspect that your csv data is being converted to ASCII somehow when you download it as Android strings are comprised of Unicode UTF16 characters and should handle it with no problem.

bocker77 · Sep 20, 2022

Here is what it looks like in the csv file before importing into a database.

Lincolnﾒs Farewell to Springfield

but shows up in the database as BLOB using DBBrowser for SQLite.

When viewed in the Listview it displays the "ﾒ" as a box. All the other ascii characters are displayed.

I use sqlite3.exe in a VBScript to import into a new database. That database file is then used in an SMB2 function to bring into my app. Maybe if I find an encoding that can replace those characters with �. Also I have to find out how sqlite3.exe can use the new encoding value if I find one. If importing using "DB Browser for SQLite" they all show up as replacement character �. If I can get sqlite3.exe to do that then I can handle it with a string replacement. I will contact SQLite forum and ask this question.

Still this doesn't answer the question as how the B4A Log command discards the Unicode. However that is done would be nice to know.

agraham · Sep 20, 2022

bocker77 said:
Still this doesn't answer the question as how the B4A Log command discards the Unicode.

I don't think it does. I think its converted before getting to a string in B4A. When you convert Unicode to 7 or 8 bit Windows code pages the out of page characters appear as question marks or boxes.

teddybear · Sep 21, 2022

This is a text encoded issue, manually download csv data to local , check what its encoding is using notepad, I guess it might is ANSI or UNICODE. save as using encoding UTF-8，then load it into database, see if it is ok.

Erel · Sep 21, 2022

You can use TextReader together with CSVParser: https://www.b4x.com/android/forum/threads/b4x-csvparser-csv-parser-and-generator.110901/#content to load non-UTF8 CSF files.

bocker77 · Sep 29, 2022

Let me try this again and see if I can explain this a little better because I am so confused. Some of the data in my database looks like these lines below which are in utf8. These lines are originally coming from HTML downloaded in a csv file.

Lincolnﾒs Farewell to Springfield (the unprintable character is a right single quote, hex code "0xef 0xbe 0x92")
The K�enster Building (the unprintable character is an umlaut "u", hex code "0xef 0xbf 0xbd")

From the forum I use this code.

B4X:

bFix = strTitle.GetBytes("utf8")
strTitle = BytesToString(bFix, 0, bFix.Length, "windows-1252")

Then I get this. BTW character set "Windows-1252" is the closest I come in seeing some of the characters.

Lincolnï¾’s Farewell to Springfield (the right single quote is displayed along with garbage in front, hex code "0xc3 0xaf 0xc2 0xbe 0xe2 0x80 0x99")
The Kï¿½enster Building (the umlaut "u" and btw all umlauts look like this, hex code "0xc3 0xaf 0xc2 0xbf 0xc2 0xbd")

For the first one I can do something like this but that just doesn't seem that I should need to do this.

B4X:

strTitle = strTitle.Replace("ï¾", "")

I have tried numerous encoding and am getting nowhere. As you can see I am not efficient in encoding/decoding techniques. Also I am not sure why this is so hard.

DonManfred · Sep 29, 2022

bocker77 said:
These lines are originally coming from HTML downloaded in a csv file.

can you post the csv-file please? UNCHANGED in encoding

teddybear · Sep 29, 2022

The first you should check what encoding the csv file you downloaded is, it is a key to covert encoding, you can see its encoding using notepad or notepad++

bocker77 · Sep 29, 2022

I actually think that it is how the csv file is saved. I will attach the csv file but feel I need to see if the editor of the website could save his HTML data in the Title field a different way.

bocker77 · Sep 29, 2022

I am not seeing the csv file being attached. Let me zip it and see if that makes a difference. That hopefully fixed it.

bocker77 · Sep 29, 2022

BTW I use Apache Open Office Calc to view it but I do not save it. It, by default, uses utf8.

bocker77 · Sep 29, 2022

Just came across this article and am wondering if I would kindly ask the editor of the website to save their csv files this way and maybe that would solve the problem.

https://support.meistertask.com/hc/en-us/articles/4406395262354-How-Do-I-Encode-My-CSV-File-Using-the-UTF-8-Format-

emexes · Sep 29, 2022

My first pass was a ~~fail~~. success (turns out I can't count bits accurately

)

The what-seems-to-be-an-apostrophe in Lincoln's is a three-byte UTF sequence:

hex: EF BE 92
binary: 1110 1111 : 1011 1110 : 1001 0010

which should be Unicode character:

binary: 1111 111110 010010 = 1111 1111 1001 0010
hex: FF92

which is:

ﾒ Halfwidth Katakana Letter Me

bocker77 · Sep 30, 2022

Here is the website that I download the csv files from. To view one of these markers you can use the number in the first column in an advanced search on the site.

The Historical Marker Database

Public history cast in metal, carved on stone, permanently marked.

www.hmdb.org

emexes · Sep 30, 2022

bocker77 said:
When viewed in the Listview it displays the "ﾒ" as a box. All the other ascii characters are displayed.

My first guess would be that the font used in the Listview does not contain glyphs for all ~140,000 Unicode characters, and that "ﾒ" is one of the missing glyphs.

When I first load the file into Windows Notepad it shows as:

but when I change the font to something more comprehensive, like Arial:

then, instead of the "unknown" character displaying as a placeholder character, it now shows correctly(?) :

bocker77 · Sep 30, 2022

emexes,

Yes I seen what you have discovered but to get to the "right quote" requires that string replace that I noted above. That seems to work for those characters, "Left Quote, Left double quote. etc." but the amulets are a different story.

The website is user contributed that are stored in their database and subsequently saved in a csv file so whoever adds a historical marker can enter unicode characters in the titles. The problem seems to be how the csv file is saved. I get what they provide me in their csv file download function. As stated the Listview in my app displays garbage for these characters. Not very professional I might add.

emexes · Sep 30, 2022

bocker77 said:
Still this doesn't answer the question as how the B4A Log command discards the Unicode. However that is done would be nice to know.

ASCII 0x00 - 0x7F = Unicode 0x0000 - 0x007F = UTF-8 0x00 - 0x7F ie high bit is 0

Any characters greater than ASCII (ie > 0x7F) encode to a multibyte UTF-8 with the high bits of all bytes set to 1

So to filter out non-ASCII characters from UTF-8: discard all bytes in the string that have the high bit set.

Or convert to an array of Chars, and if any of the Chars are > 127, then rebuild the String from the array but leaving out Chars > 127.
(if none of the Chars are > 127, then can just use original string ie no need to rebuild it)

bocker77 · Sep 30, 2022

emexes,

I believe that the Listview has no problem viewing any encoding. The hex code from these characters in my created DB from the downloaded csv files are corrupted. Garbage in garbage out.

emexes · Sep 30, 2022

bocker77 said:
but to get to the "right quote" requires that string replace that I noted above.
That seems to work for those characters, "Left Quote, Left double quote. etc."
but the amulets are a different story.

Does this get the different story back on track? :

B4X:

Dim Historical As String = "4611,Lincolnﾒs Farewell to Springfield,39.79933,-89.64238"
Dim Filtered As String = Historical.Replace(Chr(0xFF92), "'")
Log(Filtered)

Android Question Discard Unicode In Listview

Active Member

Expert

Active Member

Expert

Well-Known Member

B4X founder

Active Member

Expert

Well-Known Member

Active Member

Active Member

Attachments

Active Member

Active Member

Expert

Active Member

Expert

Active Member

Expert

Active Member

Expert

Similar Threads