Android Question Discard Unicode In Listview

bocker77

Active Member
Licensed User
Longtime User
I use downloaded csv data from a website that I import into a database. One of the fields is a string that can contain Unicode characters and I add these to a Listview. These Unicode characters are displayed as either a box or replacement character � in the Listview. I can replace the replacement character easy enough which I replace with a single quote but I can't seem to handle the others. The culprits are typically the left and right double quotes and a few others. When sending the strings (variables) to the B4A Log I noticed that the offending characters are discarded. I am wondering what is used in the Log command to do this. I was going to use the log so that I could see the hex codes in an editor but none of those characters show up.

Thanks,
Greg
 

bocker77

Active Member
Licensed User
Longtime User
emexes,

I will try your suggestion once I find the B4X methods(code) to accomplish this. Seeing that this is one of my hobbies it may take awhile and my app is only being used by some of my friends and acquittances who love history.

Other than this small problem the app works great and before I retired I was looking for an IDE to do some programming. Luckily I came across B4A and have been enjoying coding in it immensely. And this Forum is top notch.
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
emexes,

Converting to "Windows-1252" from "utf8" will display those characters prefixed with "ï¾". I can just replace those by eliminating with "". This works for those but as I stated the amulets are a different story.

The K�enster Building

0x54 0x68 0x65 0x20 0x4b 0xef 0xbf 0xbd 0x65 0x6e 0x73 0x74 0x65 0x72 0x20 0x42 0x75 0x69 0x6c 0x64 0x69 0x6e 0x67

amulet "u"
 
Upvote 0

teddybear

Well-Known Member
Licensed User
then, instead of the "unknown" character displaying as a placeholder character, it now shows correctly(?) :
There is the "メ" in it.
I went to the website, searched by marker number,
The results are:
4611 Lincoln’s Farewell to Springfield
140348 The Küenster Building
160821 "Wisteria Café" Mural
......
In your csv file: they are :
4611,Lincolnメs Farewell to Springfield,
140348,The K�enster Building
160821,"""Wisteria Caf�"" Mural"

How did you download or generate the csv file?
 
Upvote 0

emexes

Expert
Licensed User
Rather than discarding Unicode, could try keeping it instead:

B4X:
Dim TestMarkerID() As Int = Array As Int( _
      4611,   4683,   4861,  11030,  39210,  47609,  47692,  47842, _
     54765,  55618,  61452,  76532,  82400,  94232,  99733, 105653, _
    110681, 115585, 132021, 138435, 140348, 143986, 144862, 160588, _
    160821, 177500, 185190, 188068, 190190 _
)

For Each MarkerID As Int In TestMarkerID
    Wait For(GetMarkerName(MarkerID)) Complete (MarkerName As String)    
    Log(MarkerID & " = """ & MarkerName & """")
Next


Sub GetMarkerName(MarkerID As Int) As ResumableSub
 
    Dim dlh As HttpJob

    dlh.Initialize("", Me)
    
    Dim MarkerURL As String = "https://www.hmdb.org/m.asp?m=" & MarkerID

    Dim WholePage As String = ""
    dlh.Download(MarkerURL)
    Wait For (dlh) JobDone(dlh As HttpJob)
    If(dlh.Success) Then
        WholePage = dlh.GetString2("Windows-1252")
    End If
    dlh.Release
    
    If WholePage.Length <> 0 Then
        Dim Temp As String = WholePage
        
        Dim I As Int = Temp.ToUpperCase.IndexOf("<TITLE")    'Marker name is HTML title
        If I >= 0 Then
            Temp = Temp.SubString(I)
            I = Temp.IndexOf(">")
            If I >= 0 Then
                Temp = Temp.SubString(I + 1)    'after the closing ">"
                I = Temp.ToUpperCase.IndexOf("</TITLE")
                If I >= 0 Then
                    Temp = Temp.SubString2(0, I)
                    
                    'better do some basic tidy-ups ?
                    Temp = Temp.Replace("<U>", "").Replace("</U>", "").Replace("<u>", "").Replace("</u>", "")
                    Temp = Temp.Replace("<B>", "").Replace("</B>", "").Replace("<b>", "").Replace("</b>", "")
                    Temp = Temp.Replace("<I>", "").Replace("</I>", "").Replace("<i>", "").Replace("</i>", "")

                    Return Temp
                End If
            End If
        End If
    End If

    Return "!!!"    'or some other indication that couldn't get marker name

End Sub

Log output:
Waiting for debugger to connect...
Program started.
4611 = "Lincoln’s Farewell to Springfield Historical Marker"
4683 = "Victory, World War I Black Soldiers’ Memorial, a War Memorial"
4861 = "Lincoln’s Tomb Historical Marker"
11030 = "“You can fool all the people part of the time . . .” Historical Marker"
39210 = "The Burlington Zephyrs / Articulated Trains Historical Marker"
47609 = "Old Town’s Entrepreneur Spirit (#1) Historical Marker"
47692 = "“Hubbard’s Folly” Historical Marker"
47842 = "“Rites of Spring” Historical Marker"
54765 = "Abraham Lincoln and the Talisman Historical Marker"
55618 = "Lincoln’s First Illinois Home Historical Marker"
61452 = "The Eastland Disaster Historical Marker"
76532 = "McHenry County’s First Courthouse Historical Marker"
82400 = "Former Site of the “Zum Deutschen Eck” Restaurant Historical Marker"
94232 = "Aurora Hotel • Leland Hotel Historical Marker"
99733 = "Louis Jolliet & Père Jacques Marquette Historical Marker"
105653 = "Jardin aux Potages Historical Marker"
110681 = "Logan Square • Palmer Square Historical Marker"
115585 = "Haymarket Martyrs’ Monument Historical Marker"
132021 = "Mary Bartelme, Illinois’ First Female Judge Historical Marker"
138435 = "The Ariston Café, Litchfield, Illinois Historical Marker"
140348 = "The Küenster Building Historical Marker"
143986 = "The Küenster Building Historical Marker"
144862 = "Grant’s March to Naples Historical Marker"
160588 = ""Palms Grill Café" Mural Historical Marker"
160821 = ""Wisteria Café" Mural Historical Marker"
177500 = "Noël Le Vasseur Historical Marker"
185190 = "Père Marquette Historical Marker"
188068 = "László Moholy-Nagy Historical Marker"
190190 = "A Lot of Activism in the Neighborhood Historical Marker"
 
Last edited:
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
teddybear,

I select download from the website once I log in and get the csv file as is. I have sent a message to the editor of the website letting him know what the issue is and am waiting for a response. The problem is that the site is user contributed. A person can add a historical marker and in the Title field can enter any character they wish. This I assume gets added to their database and then the csv file is created from the database. I just ran a test using Open Office Calc where I was able to save a csv file with these characters using utf-8 character set and all looked good.
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
Emexes,

Thank you for your hard work with this but I do believe the issue is on the way the csv file is created and saved on the website. To find out how I am going to hopefully resolve this please read my response to teddybear. If I get no response from the editor then I will look into your solution.

Also here some code that I use to handle the html. I found the code on this forum and it works great. You may find a need for this in the future.

B4X:
If strTitle.Contains("</") Then
           strTitle = RegexReplace("<[^>]*>", strTitle, "")
End If

Sub RegexReplace(Pattern As String, Text As String, Replacement As String) As String
    Dim m As Matcher
    m = Regex.Matcher(Pattern, Text)
    Dim r As Reflector
    r.Target = m
    Return r.RunMethod2("replaceAll", Replacement, "java.lang.String")
End Sub
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
One other thing that I learned is that you can add a new keyboard in Windows using US English International that allows one to enter umlauts in text files. Something good to know.
 
Upvote 0

emexes

Expert
Licensed User
I select download from the website once I log in and get the csv file as is.

Lol after I worked out that I don't have to log in, I downloaded a csv file containing the two Küenster entries. The file is encoded as Windows 1252, and reads into B4J no problem (although I've just realised I should check that it works in ListView too...)

If it is easier for your program to process UTF-8, then the simplest way to convert it is to load the CSV into Notepad, and then Save As UTF-8 (without the Byte Order Mark, unless you're a glutton for punishment).

Righto, now to check that ListView...

edit: I don't have B4A set up ready for testing, so I'll leave the ListView test for you
 

Attachments

  • MarkersInTwoEncodings.zip
    26 KB · Views: 175
Upvote 0

emexes

Expert
Licensed User
I have sent a message to the editor of the website letting him know what the issue is and am waiting for a response.

I agree that download-as-UTF-8 would be better, but your editor's probably also thinking:

heck, if I change that, then I'm going to get a heap of complaints from people whose currently-working downloads no longer work properly

so best brace yourself for inaction. ?
 
Upvote 0

emexes

Expert
Licensed User
One other thing that I learned is that you can add a new keyboard in Windows using US English International that allows one to enter umlauts in text files. Something good to know.

Back in the '80s I bought a VT-220 terminal, and the keyboard had a Compose key, which would "add" the next two typed characters into one, eg:

[Compose] [e] ['] would type é
[Compose] {u} ["] would type ü
[Compose] [a] [e] would type æ
[Compose] [c] [,] would type ç
[Compose] [+] [-] would type ±
[Compose] [?] [?] would type ¿
[2] [5] [Compose] [o] [o] [C] would type 25°C

I don't understand why IBM didn't do similar with the PC. Maybe DEC had a patent on it.

Hey, I just discovered that [Alt [1] [3] [0]] still types é. Far out. ?
 
Last edited:
Upvote 0

teddybear

Well-Known Member
Licensed User
A person can add a historical marker and in the Title field can enter any character they wish.
There is no problem with this. you know it 's correct to we see anything on the website, I read your markers.csv, its encoding is ANSI, I would like to know if it is correct to read it with notpad on your PC. which country do you live or which language do you set on your windows?
 
Upvote 0

emexes

Expert
Licensed User
I read your markers.csv, its encoding is ANSI

It is "windows-1252", with mostly HTML &#n; character codes used to represent characters not included in "windows-1252".

Clues to this are the dagger/cross in : https://www.hmdb.org/m.asp?m=27199

and the black star of in : https://www.hmdb.org/m.asp?m=192870

Also a variety of HTML formatting tags are used, eg for italics in : https://www.hmdb.org/m.asp?m=156154

edit: and in the downloaded CSV files, quotation marks " are escaped (doubled-up) ie ""


 
Last edited:
Upvote 0

teddybear

Well-Known Member
Licensed User
Is it correct to you see the markers,csv using notpad?
 
Upvote 0

bocker77

Active Member
Licensed User
Longtime User
If anyone is interested here is what I found out. sqlite3.exe command line does not support Window-1252 encoding. I use this in a VBScript. The data in the downloaded csv file from the website uses this. If I use DB Browser for SQLite I can import the csv file using Windows-1252 and this works. The website creates the csv file using Windows-1252 because when it was built UTF-8 was not the standard. As of now they do not have any plans of correcting this. Until then my app will have to live with the funky characters in the Listview.
 
Upvote 0

DonManfred

Expert
Licensed User
Longtime User
Until then my app will have to live with the funky characters in the Listview.
why? You can read textfiles in another Charsetset than UTF8 easily

Example written in B4J

B4X:
    File.Copy(File.DirAssets,"windows-1252.txt",File.DirTemp,"windows-1252.txt")
  
    Dim Reader As TextReader
    Reader.Initialize2(File.OpenInput(File.DirTemp, "windows-1252.txt"),"Windows-1252") ' Use the correct Encoding when reading a File. In this Case Windows-1252
    Dim line As String
    line = Reader.ReadLine
    Do While line <> Null
        Log(line) ' The german Umlauts are correctly shown in B4X
        line = Reader.ReadLine
    Loop
    Reader.Close

B4A

B4X:
    Activity.LoadLayout("Layout")
    File.Copy(File.DirAssets,"windows-1252.txt",File.DirInternal,"windows-1252.txt")
    
    Dim Reader As TextReader
    Reader.Initialize2(File.OpenInput(File.DirInternal, "windows-1252.txt"),"Windows-1252")
    Dim line As String
    line = Reader.ReadLine
    Do While line <> Null
        ListView1.AddSingleLine(line) ' Shows the correct characters in Listview...
        line = Reader.ReadLine
    Loop
    Reader.Close
 
Last edited:
Upvote 0

teddybear

Well-Known Member
Licensed User
I tested to import the csv(encoding windows-1252), using DB Browser for sqlite, and also I tested to covert the csv to utf-8 using b4j, then import the csv(utf-8), the results are exactly t same. I think if DB browser works, so b4j also does. it's correct to test the csv I downloaded from the website.
B4J code snippet as follows:
B4X:
    Dim b() As Byte
    b = File.ReadBytes(File.DirAssets, "Markers.csv")
    Dim s As String
    Dim bc As ByteConverter
    s = bc.StringFromBytes(b, "windows-1252")
    File.WriteString("d:\\", "output.csv", s)
 
Upvote 0
Cookies are required to use this site. You must accept them to continue using the site. Learn more…