Android Question Special chars and TextReader

GiovanniPolese · Jan 31, 2025

Hi to everybody here
I am using TextReader with files of which I don't know the encoding. As a matter of fact, it may be either "Windows-1252" or "UTF-8". The problem arises when these files contain special characters. Despite, in theory, the encoding is declared in the file itself, it is not respected, in practice. What I see is that some files are successfully read with TextReader when one encoding is used, while others, in same conditions, are not read correctly. In other words, while all files are normally declared as "Windows-1252", very often to read them correctly it is necessary to initialize the TextReader with UTF8. Besides the explicit (often false) declaration inside the file, I tried also an analysis on the very first characters of the file, with no success. The only way to have a sure answer is to read the raw bytes to discover when and if the special characters appear. This creates new problems, because the files are often huge (over one Gb) and cannot be entirely loaded in memory with File.ReadBytes. Alternatively maybe I must read byte by byte. I use TextReader because, due to file size, I read them line by line. Moreover the structure of these files is such to require a line by line processing. Is there any way to read raw bytes from a Text file, line by line, not using TextReader? thank in advance for any suggestion.

GiovanniPolese · Feb 3, 2025

And in tablet. Colors are not the same, for clarity.

GiovanniPolese · Feb 3, 2025

In the tablet, special chars are converted to their english counterpart. As you see, some are missing, so the code that I posted here should be updated.. Or another sub should be written.

emexes · Feb 3, 2025

GiovanniPolese said:
Inside a circle there were added all special chars.

Those seem to have come through to this end ok.

As you see, some chars are still fooling me

Which characters inside the circle are fooling you? They look ok to me, but on the other hand, I don't know exactly what characters they are meant to be, other than special, which I understand to mean: characters outside of ASCII characters 0..127

Ignore that last bit, I just spotted the next two posts, including the tablet screenshot showing the accented characters having been converted to their unaccented ASCII equivalents.

Do you know which program is removing the accents, ie replacing the accented characters with their ASCII equivalents?

Is that the EXACT same DXF file displayed in both of those screenshots? ie NOT two different versions of the one file.

emexes · Feb 3, 2025

GiovanniPolese said:
In the tablet, special chars are converted to their english counterpart.

What software is being used on the tablet to display the DXF file?

Is it a third-party app? In which case, you should be asking them about why their app is not correctly displaying accented characters like DwgTrueViewer and the sharecad.org online viewer do.

Or is it your app, in which case: what does your app show if you turn off or bypass your translation-to-ASCII fix? I think what you actually need is a Unicode-to-AutoCad-8-bit-narrow-characters-which-are-actually-Windows-1252-characters translation.

emexes · Feb 3, 2025

In your app, what is drawing the text strings and line segments from the DXF file, onto the screen as shown in the tablet screenshot?

Is it a "black box" library or class that you're calling, or did you write your own code to do it?

emexes · Feb 4, 2025

Ok, I think I'm getting closer to the problem.

Is it that some of your .DXF files encode non-ASCII characters using Windows-1252 code page, and others encode non-ASCII characters using UTF-8?

And that the tablet library needs to know which encoding is used?

I have a feeling that there is an obscure entry in the DXF headers that indicates which encoding is used.

Perhaps the tablet library is ignoring that.

The Portuguese characters DXF file is using UTF-8 for the line of letter e's with various accent marks.

But if I copy that exact same line as UTF-8 bytes into my hand-written test DXF file, then it displays two wrong characters in place of each letter e.

If I then change my hand-written test DXF file so that the letter e's are encoded as Windows-1252 rather than UTF-8 ie as single bytes rather than multi-byte sequences, then my test DXF file displays correctly.

It just occurred to me that the Euro currency symbol might be a good test case. Characters 160 to 255 are the same in both Windows-1252 and Unicode, whereas the Euro currency symbol is 0x80 (128 decimal) in Windows-1252 but 0x20AC in Unicode

ASCII Code 128 (Windows-1252)

In the Windows-1252 character set, ASCII code 128 is represented by the character €, also known as the euro sign.

www.ascii-code.com

Find all Unicode Characters from Hieroglyphs to Dingbats – Unicode Compart

U+20AC is the unicode hex value of the character Euro Sign. Char U+20AC, Encodings, HTML Entitys:€,€,€, UTF-8 (hex), UTF-16 (hex), UTF-32 (hex)

www.compart.com

GiovanniPolese · Feb 4, 2025

Hi. I don't use any third party code to read the dxf, because it doesn't exist, for what I know. I posted a code in the first messages and the discussion should be related to it. Things are rather simple: i open that dxf with my App and get the strings (dxf is tag oriented file: you always meet a label (tag) and a data; the tag indicates the meaning of the following data; if you open the dxf with notepad++ and look for tag 304, you will find the texts). After getting the strings, I convert them to english chars. So, the various A with accents, are all converted to A. If in the tablet a character doesn't appear, it is because it is not correctly understood by the code that I posted here. This means that, while you see an A with accent in Autocad, in B4A, with textReader, it is read as a "diamond" simbol or something else. My code tried to avoid this fact. As I wrote in other posts here, the codepage indicated in the dxf beginning, may not be true. Assuming that it is true, It should be enough to use "Windows_1252" (if codepage is ANSI_1252), to have a correct conversion, but this doesn't happen. In this particular dxf, written by Autocad, the CodePage must be correct. In the very first rows of the dxf, you will find the CODEPAGE, followed ANSI_1252. Then try to use a TextReader on file opened with Windows-1252.: you will meet my problems, some characters are read and other not. This is the problem. After they are correctly read, forget the rest: I convert them by another sub. The focus in this example is: assuming that the strings are coded as Windows-1252 (or ANSI_1252, hoping that they are the same), can we correctly read with B4A? Try it and forget everything else. Maybe I was wrong in my tests...

emexes · Feb 4, 2025

I am slogging through that paragraph, sentence by sentence. And I can tell that it's good, which is why I momentarily

'd it, then thought... perhaps I should read it all first.

Also, are you in Portugul or Italy? Not that it matters: I get the vibe that you speak at least three languages, which is pretty impressive too. I did French and then German at high school here, and now all school students have to do a language from year 0: my son did Greek for the first few years (because Melbourne is apparently the largest Greek city outside of Greece, or something like that: a third of the students at that school had Greek parents or grandparents), and then Italian, and then French. A lot of schools do Indonesian, but I suspect that's just as an excuse to have a school trip to Bali.

Sorry, I've gone off the rails. First question is:

GiovanniPolese said:
I don't use any third party code to read the dxf

Does that mean you wrote code that draws the yellow annotations on that tablet app? (and did you write rest of the tablet app too?!?!)

I would have thought that implementing that AutoCad DXF rendering would be a massive job. Or do you only need to implement a few elements of it, to do with locating and labelling points on the underlying aerial imagery?

When you render text, is it always from a "proper" font (that has all the Portuguese and other common Unicode characters)? Or will you at some point have to also do the AutoCad-style plotter-pen text too? Not that it matters: if you're up to rendering DXF files, then organising a Unicode plotter-pen font should be easy.

emexes · Feb 4, 2025

GiovanniPolese said:
look for tag 304, you will find the texts

I looked for space+e-acute+space, but... same result

great minds think alike!

Also I went in with PSPad text editor in hex mode, could see that it was UTF-8.

GiovanniPolese said:
After getting the strings, I convert them to english chars. So, the various A with accents, are all converted to A.

Why do that? Don't the fonts that you are using, have the accented characters?

Actually, now that I have a closer look at those lower-case e's ... they do look like they might be "manually" plotted, rather than a TrueType / OpenType font.

Are you plotting those characters yourself? Is it from a font file that you created? ie can you modify it? Does it have the accented characters etc in it?

emexes · Feb 4, 2025

GiovanniPolese said:
If in the tablet a character doesn't appear, it is because it is not correctly understood by the code that I posted here. This means that, while you see an A with accent in Autocad, in B4A, with textReader, it is read as a "diamond" simbol or something else.

Agreed. In the file Chr(193) = "Á" and is stored in the DXF file as the two UTF-8 bytes 0xC3 0x81, and according to:

https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout

character 0x81 is one of five unused character codes in Windows-1252, so it gets translated to the diamond character Chr(65533) which is described at:

https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character

GiovanniPolese · Feb 4, 2025

Hi. Sorry, I wrote the answer but didn't send, due to a little stress here. I am italian and also speak Portuguese (but I learned it with Parrot method). About dxf: I did the massive job of reading the dxf; but it is devoted only to geometrical entities, not a general function. It reads lines, polylines, hatches, splines and other entities; it treats also blocks, but I limit to one level (i mean not block, inside block, inside block etc..). Of course it is a nightmare. Then the entities are passed toOpenGL engine, which has not any font (at least the library on B4A). So fonts are hand defined,done not by me, and for this I haven't the special chars..

emexes · Feb 4, 2025

GiovanniPolese said:
Then try to use a TextReader on file opened with Windows-1252.: you will meet my problems

That is indeed the problem. Decode with UTF-8 rather than Windows-1252. It should mostly work, since Unicode characters 160 to 255 are also Windows-1252 characters 160 to 255.

But what about the Windows-1252 characters from 128 to 159? The Euro is the main one that stands out. In Windows-1252 it is character 128, but in Unicode it is code point 8364. I say, get the main bit working first, and then we can worry about that set of 32 mostly-obscure leftover characters.

I thought I read that DXF files were either UTF-8 or plain ASCII, but now I am finding conflicting information again.

Should be no problem, though - it's usually pretty easy to distinguish between Windows-1252 non-ASCII bytes, and UTF-8 non-ASCII bytes. But first, try reading the file as UTF-8, and not do any extra translation on that - I think you'll be pleasantly surprised.

emexes · Feb 4, 2025

GiovanniPolese said:
I did the massive job of reading the dxf

Holy smokes.

GiovanniPolese said:
Of course it is a nightmare.

I imagine it is one of those jobs where, everytime you think you're getting close to the finish line, turns out you have another lap to go.

emexes · Feb 4, 2025

GiovanniPolese said:
So fonts are hand defined,done not by me, and for this I haven't the special chars..

Are all the Windows-1252 characters defined?

And, for a future rainy-day project:

Is the font format able to handle more than 256 characters?

Is it in that AutoCAD SHX font format? Or is it a file of line segment coordinates or something easy like that?

I'm already dreaming up an automatic way to display all the characters from a free and comprehensive OpenType font, and construct line segment representations from that. Like with ASCII art, but in reverse.

Lol reminds me of when I used to print electrical diagrams on a daisy-wheel printer. I learned quick to use more than just the full stop character.

GiovanniPolese · Feb 4, 2025

Finally: I confirm that the code that I posted before, resolves the problem to read dxf files with Special characters (at least Portuguese language). After reading correctly the string, its characters are remapped to english. A screen shot of what happens is attached. The remapping is the following (before I was missing some entries):

B4X:

Dim PortugueseToEnglish As Map = CreateMap( "Ç":"C", _
                                                "é":"e", _
                                                "â":"a", _
                                                "ã":"a", _
                                                "á":"a", _
                                                "à":"a", _
                                                "ç":"c", _
                                                "ê":"e", _
                                                "Ê":"E", _
                                                "Í":"I", _
                                                "í":"i", _
                                                "Ô":"O", _
                                                "Ã":"A", _
                                                "Â":"A", _
                                                "Á":"A", _
                                                "À":"A", _
                                                "É":"E", _
                                                "ô":"o", _
                                                "õ":"o", _
                                                "Ú":"U", _
                                                "Õ":"O", _
                                                "Ó":"O", _
                                                "ó":"o", _
                                                "ú":"u", _
                                                "ñ":"n", _
                                                "Ñ":"N")

GiovanniPolese · Feb 4, 2025

In theory, it shouldn't matter whether the file is Windows-1252 or UTF-8. I use an InputStream for reading. So I don't specify the file coding at opening, and manage the cases that I met (not said that it will work in any case) at line reading level, which was the problem. In other words, simply using TextWriter initialized with a given coding (which may also be wrong or not available), was not satisfactory. This is why I wrote that code. Of course if I didn't miss anything .. Thanks for reading.

emexes · Feb 4, 2025

GiovanniPolese said:
I use an InputStream for reading.

Lol so did I, but I was waiting to see whether we needed any adaptive code page handling. If your code works, then brilliant.

Given that it's already written, here's how I did it, in case it inspires any ideas.

Main difference is that I split the line reading up into two phases: first I read a line of bytes, then I convert those bytes to a string.

Oh, and it doesn't matter whether lines are terminated with CR or LF or both: whichever one it finds first in the file, it uses that as the line terminator, and ignores the other.

B4X:

Sub Txr_ReadLineBytes(InFile As InputStream) As Byte()
    Dim buffer(1) As Byte
 
    Dim bb As B4XBytesBuilder
    bb.Initialize
 
    Do While InFile.BytesAvailable <> 0
        InFile.ReadBytes(buffer, 0, 1)    'just checked to ensure at least 1 byte was available
     
        Dim Ch As Byte = buffer(0)
     
        If Ch = 13 Or Ch = 10 Then    'CR, LF 
            If LineTerminator = 0 Then
                LineTerminator = Ch    'whichever we hit first, that's our line terminator
            End If
         
            If Ch = LineTerminator Then
                Exit 'do
            End If
        Else
            bb.Append(buffer)
        End If
    Loop
 
    Return bb.ToArray
End Sub

B4X:

Sub Txr_ReadLine(InFile As InputStream) As String
    Dim buffer() As Byte = Txr_ReadLineBytes(InFile)
 
    If buffer.Length = 0 Then Return ""    'bypass needless processing below
 
    Dim bc As ByteConverter
    Dim TextLine As String = bc.StringFromBytes(buffer, "UTF-8")

    If TextLine.Contains(Chr(65533)) Then    'if Unicode replacement character, then wasn't valid UTF-8
        TextLine = bc.StringFromBytes(buffer, "windows-1252")
    End If
 
    Return TextLine
End Sub

B4X:

Sub Process_Globals
    Dim LineTerminator As Byte
End Sub

Sub AppStart (Args() As String)
    Dim InFile As InputStream = File.OpenInput("e:\dxf", "floorplan.dxf")
    LineTerminator = 0    'restart cr/lf line terminator discovery
    Dim NumLines As Int = 0
 
    Do While InFile.BytesAvailable <> 0
        Dim L As String = Txr_ReadLine(InFile)
        NumLines = NumLines + 1
        Log(NumLines & TAB & L)
    Loop

    InFile.Close
End Sub

emexes · Feb 4, 2025

GiovanniPolese said:
The remapping is the following

Brilliant. I like that your method could also do a replacement like "ß" to "ss".

I usually do it... probably "inside out" to your method, and use String.Replace to do the substitutions, but working though the substitution table rather than working though the string being de-accented.

emexes · Feb 4, 2025

GiovanniPolese said:
Thanks for reading.

Thanks for sharing the puzzle. From this safe distance, it looks like an interesting app. And some challenges are fun. Although it was probably more fun for me than for you, because I don't have customers whining at me to "fix it, by yesterday if possible".

GiovanniPolese · Feb 4, 2025

Yes. I will see better your code asap. My situation is rather frustrating, but not due to this topic. My main problem is that Android closes my App without any message, sometimes yes, sometimes not.. Not a problem that debugger may resolve. Moreover with big text files, or with OpenGl, debugger is not usable.... defenetly not fun. And customers say: Why the Tablet crashes randomly ? Autocad reads 1 Gb file with no problem.. The tablet has 16 Gb!!". Etc. Etc. Thanks. Bye..

Android Question Special chars and TextReader

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Expert

Expert

Expert

Expert

Well-Known Member

Expert

Expert

Expert

Well-Known Member

Expert

Expert

Expert

Well-Known Member

Attachments

Well-Known Member

Expert

Expert

Expert

Well-Known Member

Similar Threads