Android Question Discard Unicode In Listview

bocker77

Active Member
Licensed User
Longtime User
I use downloaded csv data from a website that I import into a database. One of the fields is a string that can contain Unicode characters and I add these to a Listview. These Unicode characters are displayed as either a box or replacement character � in the Listview. I can replace the replacement character easy enough which I replace with a single quote but I can't seem to handle the others. The culprits are typically the left and right double quotes and a few others. When sending the strings (variables) to the B4A Log I noticed that the offending characters are discarded. I am wondering what is used in the Log command to do this. I was going to use the log so that I could see the hex codes in an editor but none of those characters show up.

Thanks,
Greg
 

bocker77

Active Member
Licensed User
Longtime User
I believe I need to explain the process better in populating the database in my app. The app comes with a database with old data in it and just for a few states. I built that with downloaded files from the website way back when. It only contains a few state's historical markers. To get around this I decided to write a VBScript that can, from downloaded csv files, create a new database on the PC that has only the states the user desires. I need to use sqlite3.exe in the script to accomplish this. Since sqlite3 doesn't support Windows-1252 the data gets imported in UTF-8 encoding. The new database is imported into the app using SMB. From inside the app I can't do any conversion that can put it back to its original form. Or at least I can't figure out how to do it.
 
Upvote 0

teddybear

Well-Known Member
Licensed User
Dim b() As Byte
b = File.ReadBytes(File.DirAssets, "Markers.csv")
Dim s As String
Dim bc As ByteConverter
s = bc.StringFromBytes(b, "windows-1252")
File.WriteString("d:\\", "output.csv", s)
Did you try doing with the snippet, what result is it?
 
Upvote 0

emexes

Expert
Licensed User
I believe I need to explain the process better in populating the database in my app. The app comes with a database with old data in it and just for a few states. I built that with downloaded files from the website way back when. It only contains a few state's historical markers. To get around this I decided to write a VBScript that can, from downloaded csv files, create a new database on the PC that has only the states the user desires. I need to use sqlite3.exe in the script to accomplish this. Since sqlite3 doesn't support Windows-1252 the data gets imported in UTF-8 encoding. The new database is imported into the app using SMB. From inside the app I can't do any conversion that can put it back to its original form. Or at least I can't figure out how to do it.

I'm still not sure about:

a/ who downloads the csv from hmdb: you, or your users?
b/ who runs the VBScript and sqlite3: you, or your users?

I think it's you, but in that case, why not just replace your csv-to-sqlite VBScript with B4J? Or have a B4J program that rencodes your Windows-1252+HTML csv to UTF-8 or whatever Unicode encoding that SQLite3.exe can handle?

Btw I downloaded all the data when we started a couple of weeks back : it was 53 MB of csv, zips to 11.4 MB. If you're already distributing data with your app, maybe it is plausible to distribute all of it, not just custom subsections.
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
My users can download the csv files and run the VBScript that uses the sqlite3 executable. The csv files that are downloaded have numerous columns that are not needed in the app's database. It only needs four of the columns. The VBScript deletes the unneeded columns then merges the files into one. It is this file that is imported into the newly created database. sqlite3 has a PRAGMA Encoding statement but it does not support Windows-1252. Because I have DB Browser for SQLite I can import the merged csv file and use Windows-1252 which honors the encoding. I don't expect the users to install DB Browser For SQLite. Since sqlite3 uses UTF-8 the text data gets converted and once converted there is no way that I can find to fix it. The problem characters are all mucked up and show as BLOBS. This is not really a big problem it just for those Titles that have them have funky replacement characters in them. The normal text looks good.

An FYI, the entries in the website are user contributed so they can use these non standard characters such as right/left double quotes and umlauts. Windows-1252 encompasses them and has no problem displaying. Most contributors do not enter these characters.
 
Upvote 0

emexes

Expert
Licensed User
Putting aside momentarily the question of how to do it - just trust in B4X ? - is there any reason that the csv download and reformat and add to the local Android SQLite database can't all be done by the app? Like, is your usual use case that a teacher or walking-group leader does the download and CSV generation once, and then those CSV files are distributed to their students or members? Eg so that the teacher knows that all students have the relevant data, instead of later finding out that some students downloaded "Virginia" instead of "West Virginia"?

If you were ok with searching by "all markers within x degrees of (lat, lon)" - and presumably (lat, lon) would usually be the user's current location, which they wouldn't even need to type in - then the list of all marker (lat, lon)s is 1.6 MB and Zips to 1.2 MB, which seems a feasible file to package with your app. The user could specify the search-around point, and the app could then download the marker information from hmdb.org, capped at say the nearest 2000-5000 entries so as not to upset the server administrator.

Or, your app could do the search-by-state directly eg:

https://www.hmdb.org/results.asp?Search=State&State=Wyoming

https://www.hmdb.org/results.asp?Search=Keyword&SearchFor=arizona,wyoming&KeywordEnhancement=CommaIsOr

and then download the associated csv and add it to user's local SQLite database.
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
then the list of all marker (lat, lon)s is 1.6 MB and Zips to 682 kB, which seems a feasible file to package with your app.

I just realised that the search-by (lat, lon) coordinates don't have to be accurate to a millionth of a degree eg within a couple of inches. So if we just store it to the nearest minute aka nautical mile eg within 1000 yards, then the uncompressed data shrinks by half (from 4-byte Ints to 2-byte Shorts) and the Zip to 408 kB.

Hey, that might even be small enough to sneak under the forum's attachment size limit... ?
 

Attachments

  • latlon.zip
    398 KB · Views: 190
Upvote 0

emexes

Expert
Licensed User
searching by "all markers within x degrees of (lat, lon)" - and presumably (lat, lon) would usually be the user's current location, which they wouldn't even need to type in

Better again might be "the closest 2000 markers to (lat, lon)" ?

The tightest cluster of 2000 markers has a radius of 30 miles. The tightest cluster of 5000 markers has a radius of 60 miles. How keen are these marker visitors?
 
Last edited:
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
Let me explain what my app does. It was created when taking car rides in the countryside with a friend where we would notice historical markers on the side of the road. Going fast it was hard to even see the titles. Living in Ohio I found the state's website that has all of its markers recorded. When looking at other states it became obvious that by trying to code for each state that this was not going to be feasible. During my searching though I stumbled across the website that made this app possible. The editor of the site even directed me to where you can download there markers by state in csv format. Exactly what I needed to proceed. My app needed the marker number, title, and the GPS coordinates. So my app is continuously receiving GPS coordinates and scanning my data to see if any markers are in reach. If found it will read the markers title and text to the user as one is driving. It does a bunch of other things also but I succeeded in my original intent. Now my users can enjoy going on vacation and as long as they load the app with the states that they are going through and to they can listen to historical happening along the way. With the bluetooth audio that is in most cars you can hear it over the car's speakers. The only problem with that is you have to forgo listening to music.
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
This problem is solved. I worked with the owner of the website and he graciously converted his csv files to UTF-8. He thanked me for bringing this to his attention since it solved a problem that he was having. He stated that this should have been done a long time ago when UTF-8 won the encoding war.

For all those that took the time to give me suggestions on how to fix this problem I thank you. Like I said before this is an awesome community.
 
Upvote 0

emexes

Expert
Licensed User
This problem is solved. I worked with the owner of the website and he graciously converted his csv files to UTF-8. He thanked me for bringing this to his attention since it solved a problem that he was having.

You'd reckon he'd've done the HTML entity numbers at the same time, but on the other hand that's probably more than just flicking a switch.

https://www.hmdb.org/m.asp?m=137240

137240,,"Faith ★ Honor ★ Virtue"

1666738162570.png
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
I believe he now runs the csv files through some Microsoft utility to convert from Windows-1252 to UTF-8 but am not sure. I still need to eliminate any HTML code in my app though. These characters in question as stated by you, emexes in your last post do cause the text-to-speech engine say some weird stuff.

As a side note, Google's text-to-speech engine renamed Speech Services by Google is now garbling the beginning words if there is a lot of text being sent to it. I let Google know about this but as usual no response back from them. I believe this has been caused by recent updates to the engine. I think they want you to use their new cloud TTS but I am not going to do it. Any of the other TTS engines that I have looked at want you to pay for something or other.
 
Upvote 0

emexes

Expert
Licensed User
Lol you're right, heck there's a lot of them &# characters. With a few &#x characters to muck us about, too.

You could define a translation table like:
Say unicode character as:
Dim SayAs() As String = Array As String( _
    "★ black star", _
    "→ right arrow", _
    "← left arrow", _
    "☆ star", _
    "Σ sigma", _
    "Ω omega", _
    "Λ lambda", _
    "Φ phi", _
    "✯ star", _
    "Δ delta", _
    "№ number", _
    "½ one-half", _
    "♀ female", _
    "⚜ fleur-de-lis", _
    "? eagle", _
    "? elephant", _
    "? possibly a penguin", _
    "Ψ trident", _
    "✵ star", _
    "͠  tilde", _
    "π pie", _
    "❤ heart", _
    "़ circle", _
    "⅛ one-eighth", _
    "? pie", _
    "◆ diamond", _
    "? bricks, or maybe blocks of wood", _
    "? flag", _
    "? rainbow", _
    "? cold hard cash", _
    "♂ male", _
    "? shirt and tie", _
    "⛪ church", _
    "? sheep", _
    "Ꮺ squiggle", _
    "⑤ five", _
    "⑦ seven", _
    "⑪ eleven", _
    "❄ snow flake", _
    "? music", _
    "✞ cross", _
    "㉑ twenty-one", _
    "㊵ forty", _
    "㊷ the answer to life, the universe, and everything", _
    "Ꭳ what looks like a penny-farthing bicycle", _
    "리 half the answer to life, the universe, and everything", _
    "? pineapple, or a very large acorn", _
    "? bananas, perhaps", _
    "? key", _
    "? evergreen tree", _
    "? fire", _
    "● big solid dot", _
    "? railway track", _
    "? police officer", _
    "μ mu", _
    "♫ beamed eighth notes", _
    "↔ left right arrows", _
    "? Don't tell mom the babysitter's dead", _
    "➝ right arrow", _
    "? potato", _
    "? shield", _
    "? crown", _
    "? right arrow", _
    "? left arrow", _
    "⅓ one-third" _
)

and then despatch the current wtfru-sayings with:

Replace unicode characters with something speakable:
Dim Example As String = "Faith ★ Honor ★ Virtue"
For Each Replacement As String in SayAs
    '''Example = Example.Replace( Replacement.CharAt(0), " " & Replacement.SubString(2).Trim & " " )
    Dim EntityHtml As String = "&#" + Asc(Replacement.CharAt(0)) + ";"    'eg "★"
    Dim EntitySpeakable As String = Replacement.SubString(2).Trim         'eg "black star"
    Example = Example.Replace(EntityHtml, " " & EntitySpeakable & " ")
Next
Example = Example.Replace("  ", " ").Replace(" . ", ". ").Replace(" ! ", "! ").Replace(" ? ", "? ")    'tidy up extraneous spaces (not that TTS likely to care...)

HMDB HTML entities in descending frequency order:
1118   9733
ı   1097   305
ł    843   322781   8594718   8592
İ    671   304
č    510   269
ş    474   351
ć    418   263
ę    313   281
ō    299   333
ś    295   347264   9734208   8212
ǧ    195   487
ū    193   363
ā    192   257
ą    190   261
ı    160   x131
Ś    155   346
ń    154   324145   64257
Σ    124   931
ğ    107   287
ī    101   299
Ω     94   937
ə     90   601
ה     87   1492
Λ     78   923
Φ     78   93474   1003172   8206
ר     68   1512
Δ     66   916
Ş     66   350
ř     65   345
י     64   1497
ת     61   1514
Π     60   928
מ     60   1502
ו     58   149356   65039
о     55   108653   8470
½     52   189
א     52   1488
ğ     52   x11F
Θ     48   920
ş     48   x15F
♀     48   9792
υ     48   965
ο     48   95946   9884
Α     46   913
Γ     40   91538   823237   7497
е     36   1077
ʔ     36   660
ʷ     36   695
ק     35   1511
ż     32   380
с     32   1089
σ     30   963
и     30   1080
ב     30   1489
ע     29   150629   20154
ρ     28   961
ד     28   1491
ε     28   949
ǎ     27   462
т     27   1090
ʼ     27   70025   64259
ש     24   1513
ט     24   1496
τ     24   964
פ     24   1508
ס     23   1505
ל     23   1500
?     22   129413
?     22   128024
н     22   1085
ʉ     22   649
ן     22   1503
    22   8203
Ż     22   379
đ     22   273
ň     21   328
в     21   107421   24055
а     20   1072
נ     20   150420   8211
р     20   108819   30340
Ł     19   321
ח     19   1495
ς     18   962
ġ     18   289
α     18   945
κ     18   954
Ο     18   927
ɬ     18   620
Υ     18   93317   5122817   51452
Ş     16   x15E
м     16   1084
?     16   128039
Ψ     16   936
ʻ     15   699
ź     14   37814   2343514   2157014   6425814   6503814   39321
ό     13   97213   2485913   21644
Κ     13   92213   2713813   872913   24179
л     12   108312   820512   8610
?     12   127480
̴     12   820
殿     12   2758312   10037
ا     12   1575
?     12   127482
ί     12   943
̇     12   775
ι     12   953
γ     12   947
λ     12   955
ν     12   957
Ι     12   921
ű     12   369
χ     12   96712   1253912   64256
Τ     12   932
Х     12   1061
͠     12   864
Ε     12   917
Ō     12   33212   3377512   8213
ě     11   28311   49437
д     11   107610   44397
ג     10   149010   44288
й     10   108110   44277
כ     10   149910   3490310   5462010   4812010   2456510   4932410   4582410   3200010   21776
Ի      8   1339
Ա      8   1329
      8   65279
?      8   127894
π      8   960
Ս      8   1357
Χ      8   9357   260317   48169
̓      7   7877   224787   385977   23277   213267   336557   23697   474767   236217   382157   100847   23257   23647   8539
?      7   1205877   367987   23477   223127   201267   23667   33290
ŭ      7   3657   389577   54616
Ľ      7   317
́      7   7697   78897   30908
Đ      7   2727   97447   45224
ˀ      7   7047   9642
ē      7   2757   460287   460207   49457
Β      7   9147   392087   327527   335907   265197   78917   9670
?      7   120507
?      6   1295216   50725
?      6   1279876   42889
?      6   1277526   51204
?      6   1281816   498286   551846   9794
?      6   1280846   9962
?      6   1280176   50986   93166   93186   50786   50866   50506   93226   10052
?      6   1279326   100146   128816   129816   129836   50276   475326   97686   494366   538046   522966   539526   511166   9767
?      6   1278216   50256   54596
?      6   129439
?      6   128273
η      6   951
С      6   1057
θ      6   952
ύ      6   9736   123986   21407
ј      6   11126   37912
ή      6   9426   214646   364686   362086   212706   27578
έ      6   9416   387066   366556   228236   222836   200136   287606   280236   320436   201136   254906   38660
צ      6   1510
у      6   1091
Р      6   10566   9532
к      6   10826   26371
ם      6   15016   280246   204496   260856   234476   236656   318596   11063
ז      6   14946   279346   26361
?      6   127996
?      6   127794
?      6   1282936   235466   96796   282076   26705
?      6   128740
?      6   128110
μ      6   9566   218306   406436   242156   23567
В      6   10426   96636   242666   98356   202106   85966   22275
Η      6   919
Ν      6   9256   28006
Ϊ      6   9386   21315
Ρ      6   929
г      4   1075
?      4   1293624   10141
И      4   10484   30865
?      4   129364
ф      4   1092
Г      4   1043
Ζ      4   918
ц      4   1094
?      4   128737
?      4   128081
Б      4   1041
ч      4   1095
я      4   1103
?      4   1290464   510884   512014   546444   10013
?      4   1290444   52712
ի      4   1387
վ      4   1406
Դ      4   1332
Թ      4   13374   210334   228104   22495
Ī      4   2984   8531
Č      4   268
ů      4   367
Ց      4   1361
Ն      4   1350
Ո      4   1352
ش      4   1588
و      4   1608
ه      4   1607
ف      4   16014   397464   37806
ط      4   1591
م      4   16054   82214   8220
غ      4   1594
ي      4   1610
ة      4   1577
ر      4   1585
 
Last edited:
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
Thanks for the suggestion. I already run the spoken text through a routine to clean it up a bit. I'll look into adding your code or something similar to it so the TTS engine doesn't "f" it up too much.
 
Upvote 0
Top