Android Question Discard Unicode In Listview

bocker77

Active Member
Licensed User
Longtime User
I use downloaded csv data from a website that I import into a database. One of the fields is a string that can contain Unicode characters and I add these to a Listview. These Unicode characters are displayed as either a box or replacement character � in the Listview. I can replace the replacement character easy enough which I replace with a single quote but I can't seem to handle the others. The culprits are typically the left and right double quotes and a few others. When sending the strings (variables) to the B4A Log I noticed that the offending characters are discarded. I am wondering what is used in the Log command to do this. I was going to use the log so that I could see the hex codes in an editor but none of those characters show up.

Thanks,
Greg
 

bocker77

Active Member
Licensed User
Longtime User
I believe I need to explain the process better in populating the database in my app. The app comes with a database with old data in it and just for a few states. I built that with downloaded files from the website way back when. It only contains a few state's historical markers. To get around this I decided to write a VBScript that can, from downloaded csv files, create a new database on the PC that has only the states the user desires. I need to use sqlite3.exe in the script to accomplish this. Since sqlite3 doesn't support Windows-1252 the data gets imported in UTF-8 encoding. The new database is imported into the app using SMB. From inside the app I can't do any conversion that can put it back to its original form. Or at least I can't figure out how to do it.
 
Upvote 0

teddybear

Well-Known Member
Licensed User
Dim b() As Byte
b = File.ReadBytes(File.DirAssets, "Markers.csv")
Dim s As String
Dim bc As ByteConverter
s = bc.StringFromBytes(b, "windows-1252")
File.WriteString("d:\\", "output.csv", s)
Did you try doing with the snippet, what result is it?
 
Upvote 0

emexes

Expert
Licensed User
I believe I need to explain the process better in populating the database in my app. The app comes with a database with old data in it and just for a few states. I built that with downloaded files from the website way back when. It only contains a few state's historical markers. To get around this I decided to write a VBScript that can, from downloaded csv files, create a new database on the PC that has only the states the user desires. I need to use sqlite3.exe in the script to accomplish this. Since sqlite3 doesn't support Windows-1252 the data gets imported in UTF-8 encoding. The new database is imported into the app using SMB. From inside the app I can't do any conversion that can put it back to its original form. Or at least I can't figure out how to do it.

I'm still not sure about:

a/ who downloads the csv from hmdb: you, or your users?
b/ who runs the VBScript and sqlite3: you, or your users?

I think it's you, but in that case, why not just replace your csv-to-sqlite VBScript with B4J? Or have a B4J program that rencodes your Windows-1252+HTML csv to UTF-8 or whatever Unicode encoding that SQLite3.exe can handle?

Btw I downloaded all the data when we started a couple of weeks back : it was 53 MB of csv, zips to 11.4 MB. If you're already distributing data with your app, maybe it is plausible to distribute all of it, not just custom subsections.
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
My users can download the csv files and run the VBScript that uses the sqlite3 executable. The csv files that are downloaded have numerous columns that are not needed in the app's database. It only needs four of the columns. The VBScript deletes the unneeded columns then merges the files into one. It is this file that is imported into the newly created database. sqlite3 has a PRAGMA Encoding statement but it does not support Windows-1252. Because I have DB Browser for SQLite I can import the merged csv file and use Windows-1252 which honors the encoding. I don't expect the users to install DB Browser For SQLite. Since sqlite3 uses UTF-8 the text data gets converted and once converted there is no way that I can find to fix it. The problem characters are all mucked up and show as BLOBS. This is not really a big problem it just for those Titles that have them have funky replacement characters in them. The normal text looks good.

An FYI, the entries in the website are user contributed so they can use these non standard characters such as right/left double quotes and umlauts. Windows-1252 encompasses them and has no problem displaying. Most contributors do not enter these characters.
 
Upvote 0

emexes

Expert
Licensed User
Putting aside momentarily the question of how to do it - just trust in B4X ? - is there any reason that the csv download and reformat and add to the local Android SQLite database can't all be done by the app? Like, is your usual use case that a teacher or walking-group leader does the download and CSV generation once, and then those CSV files are distributed to their students or members? Eg so that the teacher knows that all students have the relevant data, instead of later finding out that some students downloaded "Virginia" instead of "West Virginia"?

If you were ok with searching by "all markers within x degrees of (lat, lon)" - and presumably (lat, lon) would usually be the user's current location, which they wouldn't even need to type in - then the list of all marker (lat, lon)s is 1.6 MB and Zips to 1.2 MB, which seems a feasible file to package with your app. The user could specify the search-around point, and the app could then download the marker information from hmdb.org, capped at say the nearest 2000-5000 entries so as not to upset the server administrator.

Or, your app could do the search-by-state directly eg:

https://www.hmdb.org/results.asp?Search=State&State=Wyoming

https://www.hmdb.org/results.asp?Search=Keyword&SearchFor=arizona,wyoming&KeywordEnhancement=CommaIsOr

and then download the associated csv and add it to user's local SQLite database.
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
then the list of all marker (lat, lon)s is 1.6 MB and Zips to 682 kB, which seems a feasible file to package with your app.

I just realised that the search-by (lat, lon) coordinates don't have to be accurate to a millionth of a degree eg within a couple of inches. So if we just store it to the nearest minute aka nautical mile eg within 1000 yards, then the uncompressed data shrinks by half (from 4-byte Ints to 2-byte Shorts) and the Zip to 408 kB.

Hey, that might even be small enough to sneak under the forum's attachment size limit... ?
 

Attachments

  • latlon.zip
    398 KB · Views: 154
Upvote 0

emexes

Expert
Licensed User
searching by "all markers within x degrees of (lat, lon)" - and presumably (lat, lon) would usually be the user's current location, which they wouldn't even need to type in

Better again might be "the closest 2000 markers to (lat, lon)" ?

The tightest cluster of 2000 markers has a radius of 30 miles. The tightest cluster of 5000 markers has a radius of 60 miles. How keen are these marker visitors?
 
Last edited:
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
Let me explain what my app does. It was created when taking car rides in the countryside with a friend where we would notice historical markers on the side of the road. Going fast it was hard to even see the titles. Living in Ohio I found the state's website that has all of its markers recorded. When looking at other states it became obvious that by trying to code for each state that this was not going to be feasible. During my searching though I stumbled across the website that made this app possible. The editor of the site even directed me to where you can download there markers by state in csv format. Exactly what I needed to proceed. My app needed the marker number, title, and the GPS coordinates. So my app is continuously receiving GPS coordinates and scanning my data to see if any markers are in reach. If found it will read the markers title and text to the user as one is driving. It does a bunch of other things also but I succeeded in my original intent. Now my users can enjoy going on vacation and as long as they load the app with the states that they are going through and to they can listen to historical happening along the way. With the bluetooth audio that is in most cars you can hear it over the car's speakers. The only problem with that is you have to forgo listening to music.
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
This problem is solved. I worked with the owner of the website and he graciously converted his csv files to UTF-8. He thanked me for bringing this to his attention since it solved a problem that he was having. He stated that this should have been done a long time ago when UTF-8 won the encoding war.

For all those that took the time to give me suggestions on how to fix this problem I thank you. Like I said before this is an awesome community.
 
Upvote 0

emexes

Expert
Licensed User
This problem is solved. I worked with the owner of the website and he graciously converted his csv files to UTF-8. He thanked me for bringing this to his attention since it solved a problem that he was having.

You'd reckon he'd've done the HTML entity numbers at the same time, but on the other hand that's probably more than just flicking a switch.

https://www.hmdb.org/m.asp?m=137240

137240,,"Faith ★ Honor ★ Virtue"

1666738162570.png
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
I believe he now runs the csv files through some Microsoft utility to convert from Windows-1252 to UTF-8 but am not sure. I still need to eliminate any HTML code in my app though. These characters in question as stated by you, emexes in your last post do cause the text-to-speech engine say some weird stuff.

As a side note, Google's text-to-speech engine renamed Speech Services by Google is now garbling the beginning words if there is a lot of text being sent to it. I let Google know about this but as usual no response back from them. I believe this has been caused by recent updates to the engine. I think they want you to use their new cloud TTS but I am not going to do it. Any of the other TTS engines that I have looked at want you to pay for something or other.
 
Upvote 0

emexes

Expert
Licensed User
Lol you're right, heck there's a lot of them &# characters. With a few &#x characters to muck us about, too.

You could define a translation table like:
Say unicode character as:
Dim SayAs() As String = Array As String( _
    "★ black star", _
    "→ right arrow", _
    "← left arrow", _
    "☆ star", _
    "Σ sigma", _
    "Ω omega", _
    "Λ lambda", _
    "Φ phi", _
    "✯ star", _
    "Δ delta", _
    "№ number", _
    "½ one-half", _
    "♀ female", _
    "⚜ fleur-de-lis", _
    "? eagle", _
    "? elephant", _
    "? possibly a penguin", _
    "Ψ trident", _
    "✵ star", _
    "͠  tilde", _
    "π pie", _
    "❤ heart", _
    "़ circle", _
    "⅛ one-eighth", _
    "? pie", _
    "◆ diamond", _
    "? bricks, or maybe blocks of wood", _
    "? flag", _
    "? rainbow", _
    "? cold hard cash", _
    "♂ male", _
    "? shirt and tie", _
    "⛪ church", _
    "? sheep", _
    "Ꮺ squiggle", _
    "⑤ five", _
    "⑦ seven", _
    "⑪ eleven", _
    "❄ snow flake", _
    "? music", _
    "✞ cross", _
    "㉑ twenty-one", _
    "㊵ forty", _
    "㊷ the answer to life, the universe, and everything", _
    "Ꭳ what looks like a penny-farthing bicycle", _
    "리 half the answer to life, the universe, and everything", _
    "? pineapple, or a very large acorn", _
    "? bananas, perhaps", _
    "? key", _
    "? evergreen tree", _
    "? fire", _
    "● big solid dot", _
    "? railway track", _
    "? police officer", _
    "μ mu", _
    "♫ beamed eighth notes", _
    "↔ left right arrows", _
    "? Don't tell mom the babysitter's dead", _
    "➝ right arrow", _
    "? potato", _
    "? shield", _
    "? crown", _
    "? right arrow", _
    "? left arrow", _
    "⅓ one-third" _
)

and then despatch the current wtfru-sayings with:

Replace unicode characters with something speakable:
Dim Example As String = "Faith ★ Honor ★ Virtue"
For Each Replacement As String in SayAs
    '''Example = Example.Replace( Replacement.CharAt(0), " " & Replacement.SubString(2).Trim & " " )
    Dim EntityHtml As String = "&#" + Asc(Replacement.CharAt(0)) + ";"    'eg "★"
    Dim EntitySpeakable As String = Replacement.SubString(2).Trim         'eg "black star"
    Example = Example.Replace(EntityHtml, " " & EntitySpeakable & " ")
Next
Example = Example.Replace("  ", " ").Replace(" . ", ". ").Replace(" ! ", "! ").Replace(" ? ", "? ")    'tidy up extraneous spaces (not that TTS likely to care...)

HMDB HTML entities in descending frequency order:
★   1118   9733
ı   1097   305
ł    843   322
→    781   8594
←    718   8592
İ    671   304
č    510   269
ş    474   351
ć    418   263
ę    313   281
ō    299   333
ś    295   347
☆    264   9734
—    208   8212
ǧ    195   487
ū    193   363
ā    192   257
ą    190   261
ı    160   x131
Ś    155   346
ń    154   324
fi    145   64257
Σ    124   931
ğ    107   287
ī    101   299
Ω     94   937
ə     90   601
ה     87   1492
Λ     78   923
Φ     78   934
✯     74   10031
‎     72   8206
ר     68   1512
Δ     66   916
Ş     66   350
ř     65   345
י     64   1497
ת     61   1514
Π     60   928
מ     60   1502
ו     58   1493
️     56   65039
о     55   1086
№     53   8470
½     52   189
א     52   1488
ğ     52   x11F
Θ     48   920
ş     48   x15F
♀     48   9792
υ     48   965
ο     48   959
⚜     46   9884
Α     46   913
Γ     40   915

     38   8232
ᵉ     37   7497
е     36   1077
ʔ     36   660
ʷ     36   695
ק     35   1511
ż     32   380
с     32   1089
σ     30   963
и     30   1080
ב     30   1489
ע     29   1506
人     29   20154
ρ     28   961
ד     28   1491
ε     28   949
ǎ     27   462
т     27   1090
ʼ     27   700
ffi     25   64259
ש     24   1513
ט     24   1496
τ     24   964
פ     24   1508
ס     23   1505
ל     23   1500
?     22   129413
?     22   128024
н     22   1085
ʉ     22   649
ן     22   1503
    22   8203
Ż     22   379
đ     22   273
ň     21   328
в     21   1074
巷     21   24055
а     20   1072
נ     20   1504
–     20   8211
р     20   1088
的     19   30340
Ł     19   321
ח     19   1495
ς     18   962
ġ     18   289
α     18   945
κ     18   954
Ο     18   927
ɬ     18   620
Υ     18   933
제     17   51228
주     17   51452
Ş     16   x15E
м     16   1084
?     16   128039
Ψ     16   936
ʻ     15   699
ź     14   378
宋     14   23435
呂     14   21570
fl     14   64258
︎     14   65038
香     14   39321
ό     13   972
愛     13   24859
和     13   21644
Κ     13   922
樂     13   27138
∙     13   8729
平     13   24179
л     12   1083
‍     12   8205
↢     12   8610
?     12   127480
̴     12   820
殿     12   27583
✵     12   10037
ا     12   1575
?     12   127482
ί     12   943
̇     12   775
ι     12   953
γ     12   947
λ     12   955
ν     12   957
Ι     12   921
ű     12   369
χ     12   967
・     12   12539
ff     12   64256
Τ     12   932
Х     12   1061
͠     12   864
Ε     12   917
Ō     12   332
華     12   33775
―     12   8213
ě     11   283
석     11   49437
д     11   1076
국     10   44397
ג     10   1490
관     10   44288
й     10   1081
공     10   44277
כ     10   1499
街     10   34903
한     10   54620
미     10   48120
念     10   24565
사     10   49324
대     10   45824
紀     10   32000
唐     10   21776
Ի      8   1339
Ա      8   1329
      8   65279
?      8   127894
π      8   960
Ս      8   1357
Χ      8   935
斯      7   26031
방      7   48169
̓      7   787
城      7   22478
雅      7   38597
ग      7   2327
华      7   21326
荷      7   33655
ु      7   2369
르      7   47476
居      7   23621
镇      7   38215
❤      7   10084
क      7   2325
़      7   2364
⅛      7   8539
?      7   120587
达      7   36798
फ      7   2347
在      7   22312
亞      7   20126
ा      7   2366
舊      7   33290
ŭ      7   365
頭      7   38957
하      7   54616
Ľ      7   317
́      7   769
ố      7   7889
碼      7   30908
Đ      7   272
☐      7   9744
남      7   45224
ˀ      7   704
▪      7   9642
ē      7   275
돌      7   46028
도      7   46020
성      7   49457
Β      7   914
館      7   39208
翰      7   32752
茶      7   33590
林      7   26519
ồ      7   7891
◆      7   9670
?      7   120507
?      6   129521
옥      6   50725
?      6   127987
꞉      6   42889
?      6   127752
전      6   51204
?      6   128181
스      6   49828
힐      6   55184
♂      6   9794
?      6   128084
⛪      6   9962
?      6   128017
Ꮺ      6   5098
⑤      6   9316
⑦      6   9318
Ꮦ      6   5078
Ꮮ      6   5086
Ꮊ      6   5050
⑪      6   9322
❄      6   10052
?      6   127932
✞      6   10014
㉑      6   12881
㊵      6   12981
㊷      6   12983
Ꭳ      6   5027
리      6   47532
☨      6   9768
서      6   49436
투      6   53804
챈      6   52296
틀      6   53952
재      6   51116
☧      6   9767
?      6   127821
Ꭱ      6   5025
필      6   54596
?      6   129439
?      6   128273
η      6   951
С      6   1057
θ      6   952
ύ      6   973
の      6   12398
原      6   21407
ј      6   1112
鐘      6   37912
ή      6   942
变      6   21464
蹴      6   36468
走      6   36208
化      6   21270
殺      6   27578
έ      6   941
露      6   38706
輯      6   36655
大      6   22823
國      6   22283
中      6   20013
灘      6   28760
海      6   28023
紫      6   32043
云      6   20113
排      6   25490
霄      6   38660
צ      6   1510
у      6   1091
Р      6   1056
┼      6   9532
к      6   1082
會      6   26371
ם      6   1501
浸      6   28024
信      6   20449
日      6   26085
宗      6   23447
山      6   23665
米      6   31859
⬷      6   11063
ז      6   1494
洞      6   27934
曹      6   26361
?      6   127996
?      6   127794
?      6   128293
寺      6   23546
●      6   9679
港      6   28207
桑      6   26705
?      6   128740
?      6   128110
μ      6   956
商      6   21830
黃      6   40643
店      6   24215
小      6   23567
В      6   1042
▿      6   9663
廊      6   24266
♫      6   9835
仲      6   20210
↔      6   8596
圃      6   22275
Η      6   919
Ν      6   925
浦      6   28006
Ϊ      6   938
千      6   21315
Ρ      6   929
г      4   1075
?      4   129362
➝      4   10141
И      4   1048
碑      4   30865
?      4   129364
ф      4   1092
Г      4   1043
Ζ      4   918
ц      4   1094
?      4   128737
?      4   128081
Б      4   1041
ч      4   1095
я      4   1103
?      4   129046
자      4   51088
적      4   51201
해      4   54644
✝      4   10013
?      4   129044
취      4   52712
ի      4   1387
վ      4   1406
Դ      4   1332
Թ      4   1337
利      4   21033
多      4   22810
域      4   22495
Ī      4   298
⅓      4   8531
Č      4   268
ů      4   367
Ց      4   1361
Ն      4   1350
Ո      4   1352
ش      4   1588
و      4   1608
ه      4   1607
ف      4   1601
魂      4   39746
鎮      4   37806
ط      4   1591
م      4   1605
”      4   8221
“      4   8220
غ      4   1594
ي      4   1610
ة      4   1577
ر      4   1585
 
Last edited:
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
Thanks for the suggestion. I already run the spoken text through a routine to clean it up a bit. I'll look into adding your code or something similar to it so the TTS engine doesn't "f" it up too much.
 
Upvote 0
Top