Android Question Special chars and TextReader

GiovanniPolese

Well-Known Member
Licensed User
Longtime User
Hi to everybody here
I am using TextReader with files of which I don't know the encoding. As a matter of fact, it may be either "Windows-1252" or "UTF-8". The problem arises when these files contain special characters. Despite, in theory, the encoding is declared in the file itself, it is not respected, in practice. What I see is that some files are successfully read with TextReader when one encoding is used, while others, in same conditions, are not read correctly. In other words, while all files are normally declared as "Windows-1252", very often to read them correctly it is necessary to initialize the TextReader with UTF8. Besides the explicit (often false) declaration inside the file, I tried also an analysis on the very first characters of the file, with no success. The only way to have a sure answer is to read the raw bytes to discover when and if the special characters appear. This creates new problems, because the files are often huge (over one Gb) and cannot be entirely loaded in memory with File.ReadBytes. Alternatively maybe I must read byte by byte. I use TextReader because, due to file size, I read them line by line. Moreover the structure of these files is such to require a line by line processing. Is there any way to read raw bytes from a Text file, line by line, not using TextReader? thank in advance for any suggestion.
 
Solution
I post here the code resolving my problem. It reads a file using an external (global) defined InputStream. It is a sub reading a line of an Ascii file, byte by byte.
It returns a decoded line. It works for Portugues special chars written (sometimes improperly definition of coding inside) in dxf files. It first decodes the line with UTF8 and, if special chars are empirically detected, redoes the conversion using Windows-1252. Any eventual improvement is welcome. As a matter of fact I have doubts on my empirical detection with Asc(char)>256, despite for my cases it works.

B4X:
private Sub Txr_ReadLine As String
#if TEXTR    
    Return Txr.Readline ' if TEXTR is defined, it uses a global TextReader Txr. of course this is not the case...

GiovanniPolese

Well-Known Member
Licensed User
Longtime User
In the tablet, special chars are converted to their english counterpart. As you see, some are missing, so the code that I posted here should be updated.. Or another sub should be written.
 
Upvote 0

emexes

Expert
Licensed User
Inside a circle there were added all special chars.

Those seem to have come through to this end ok. 🏆

As you see, some chars are still fooling me

Which characters inside the circle are fooling you? They look ok to me, but on the other hand, I don't know exactly what characters they are meant to be, other than special, which I understand to mean: characters outside of ASCII characters 0..127

Ignore that last bit, I just spotted the next two posts, including the tablet screenshot showing the accented characters having been converted to their unaccented ASCII equivalents.

Do you know which program is removing the accents, ie replacing the accented characters with their ASCII equivalents?

Is that the EXACT same DXF file displayed in both of those screenshots? ie NOT two different versions of the one file.


1738621116222.png
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
In the tablet, special chars are converted to their english counterpart.

What software is being used on the tablet to display the DXF file?

Is it a third-party app? In which case, you should be asking them about why their app is not correctly displaying accented characters like DwgTrueViewer and the sharecad.org online viewer do.

Or is it your app, in which case: what does your app show if you turn off or bypass your translation-to-ASCII fix? I think what you actually need is a Unicode-to-AutoCad-8-bit-narrow-characters-which-are-actually-Windows-1252-characters translation.
 
Upvote 0

emexes

Expert
Licensed User
In your app, what is drawing the text strings and line segments from the DXF file, onto the screen as shown in the tablet screenshot?

Is it a "black box" library or class that you're calling, or did you write your own code to do it?

1738623180162.png
 
Upvote 0

emexes

Expert
Licensed User
Ok, I think I'm getting closer to the problem.

Is it that some of your .DXF files encode non-ASCII characters using Windows-1252 code page, and others encode non-ASCII characters using UTF-8?

And that the tablet library needs to know which encoding is used?

I have a feeling that there is an obscure entry in the DXF headers that indicates which encoding is used.

Perhaps the tablet library is ignoring that.

The Portuguese characters DXF file is using UTF-8 for the line of letter e's with various accent marks.

But if I copy that exact same line as UTF-8 bytes into my hand-written test DXF file, then it displays two wrong characters in place of each letter e.

If I then change my hand-written test DXF file so that the letter e's are encoded as Windows-1252 rather than UTF-8 ie as single bytes rather than multi-byte sequences, then my test DXF file displays correctly.

It just occurred to me that the Euro currency symbol might be a good test case. Characters 160 to 255 are the same in both Windows-1252 and Unicode, whereas the Euro currency symbol is 0x80 (128 decimal) in Windows-1252 but 0x20AC in Unicode


 
Upvote 0

GiovanniPolese

Well-Known Member
Licensed User
Longtime User
Hi. I don't use any third party code to read the dxf, because it doesn't exist, for what I know. I posted a code in the first messages and the discussion should be related to it. Things are rather simple: i open that dxf with my App and get the strings (dxf is tag oriented file: you always meet a label (tag) and a data; the tag indicates the meaning of the following data; if you open the dxf with notepad++ and look for tag 304, you will find the texts). After getting the strings, I convert them to english chars. So, the various A with accents, are all converted to A. If in the tablet a character doesn't appear, it is because it is not correctly understood by the code that I posted here. This means that, while you see an A with accent in Autocad, in B4A, with textReader, it is read as a "diamond" simbol or something else. My code tried to avoid this fact. As I wrote in other posts here, the codepage indicated in the dxf beginning, may not be true. Assuming that it is true, It should be enough to use "Windows_1252" (if codepage is ANSI_1252), to have a correct conversion, but this doesn't happen. In this particular dxf, written by Autocad, the CodePage must be correct. In the very first rows of the dxf, you will find the CODEPAGE, followed ANSI_1252. Then try to use a TextReader on file opened with Windows-1252.: you will meet my problems, some characters are read and other not. This is the problem. After they are correctly read, forget the rest: I convert them by another sub. The focus in this example is: assuming that the strings are coded as Windows-1252 (or ANSI_1252, hoping that they are the same), can we correctly read with B4A? Try it and forget everything else. Maybe I was wrong in my tests...
 
Upvote 0

emexes

Expert
Licensed User
I am slogging through that paragraph, sentence by sentence. And I can tell that it's good, which is why I momentarily 😍'd it, then thought... perhaps I should read it all first.

Also, are you in Portugul or Italy? Not that it matters: I get the vibe that you speak at least three languages, which is pretty impressive too. I did French and then German at high school here, and now all school students have to do a language from year 0: my son did Greek for the first few years (because Melbourne is apparently the largest Greek city outside of Greece, or something like that: a third of the students at that school had Greek parents or grandparents), and then Italian, and then French. A lot of schools do Indonesian, but I suspect that's just as an excuse to have a school trip to Bali.

Sorry, I've gone off the rails. First question is:

I don't use any third party code to read the dxf

Does that mean you wrote code that draws the yellow annotations on that tablet app? (and did you write rest of the tablet app too?!?!)

I would have thought that implementing that AutoCad DXF rendering would be a massive job. Or do you only need to implement a few elements of it, to do with locating and labelling points on the underlying aerial imagery?

When you render text, is it always from a "proper" font (that has all the Portuguese and other common Unicode characters)? Or will you at some point have to also do the AutoCad-style plotter-pen text too? Not that it matters: if you're up to rendering DXF files, then organising a Unicode plotter-pen font should be easy.
 
Upvote 0

emexes

Expert
Licensed User
look for tag 304, you will find the texts

I looked for space+e-acute+space, but... same result 👍 great minds think alike!

Also I went in with PSPad text editor in hex mode, could see that it was UTF-8.

After getting the strings, I convert them to english chars. So, the various A with accents, are all converted to A.

Why do that? Don't the fonts that you are using, have the accented characters?

Actually, now that I have a closer look at those lower-case e's ... they do look like they might be "manually" plotted, rather than a TrueType / OpenType font.

Are you plotting those characters yourself? Is it from a font file that you created? ie can you modify it? Does it have the accented characters etc in it?

1738665327241.png
 
Upvote 0

emexes

Expert
Licensed User
If in the tablet a character doesn't appear, it is because it is not correctly understood by the code that I posted here. This means that, while you see an A with accent in Autocad, in B4A, with textReader, it is read as a "diamond" simbol or something else.

Agreed. In the file Chr(193) = "Á" and is stored in the DXF file as the two UTF-8 bytes 0xC3 0x81, and according to:

https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout

character 0x81 is one of five unused character codes in Windows-1252, so it gets translated to the diamond character Chr(65533) which is described at:

https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character

1738671275194.png
 
Upvote 0

GiovanniPolese

Well-Known Member
Licensed User
Longtime User
Hi. Sorry, I wrote the answer but didn't send, due to a little stress here. I am italian and also speak Portuguese (but I learned it with Parrot method). About dxf: I did the massive job of reading the dxf; but it is devoted only to geometrical entities, not a general function. It reads lines, polylines, hatches, splines and other entities; it treats also blocks, but I limit to one level (i mean not block, inside block, inside block etc..). Of course it is a nightmare. Then the entities are passed toOpenGL engine, which has not any font (at least the library on B4A). So fonts are hand defined,done not by me, and for this I haven't the special chars..
 
Upvote 0

emexes

Expert
Licensed User
Then try to use a TextReader on file opened with Windows-1252.: you will meet my problems

That is indeed the problem. Decode with UTF-8 rather than Windows-1252. It should mostly work, since Unicode characters 160 to 255 are also Windows-1252 characters 160 to 255.

But what about the Windows-1252 characters from 128 to 159? The Euro is the main one that stands out. In Windows-1252 it is character 128, but in Unicode it is code point 8364. I say, get the main bit working first, and then we can worry about that set of 32 mostly-obscure leftover characters.

I thought I read that DXF files were either UTF-8 or plain ASCII, but now I am finding conflicting information again.

Should be no problem, though - it's usually pretty easy to distinguish between Windows-1252 non-ASCII bytes, and UTF-8 non-ASCII bytes. But first, try reading the file as UTF-8, and not do any extra translation on that - I think you'll be pleasantly surprised.
 
Upvote 0

emexes

Expert
Licensed User
So fonts are hand defined,done not by me, and for this I haven't the special chars..

Are all the Windows-1252 characters defined?

And, for a future rainy-day project:

Is the font format able to handle more than 256 characters?

Is it in that AutoCAD SHX font format? Or is it a file of line segment coordinates or something easy like that?

I'm already dreaming up an automatic way to display all the characters from a free and comprehensive OpenType font, and construct line segment representations from that. Like with ASCII art, but in reverse.

Lol reminds me of when I used to print electrical diagrams on a daisy-wheel printer. I learned quick to use more than just the full stop character.
 
Upvote 0

GiovanniPolese

Well-Known Member
Licensed User
Longtime User
Finally: I confirm that the code that I posted before, resolves the problem to read dxf files with Special characters (at least Portuguese language). After reading correctly the string, its characters are remapped to english. A screen shot of what happens is attached. The remapping is the following (before I was missing some entries):

B4X:
Dim PortugueseToEnglish As Map = CreateMap( "Ç":"C", _
                                                "é":"e", _
                                                "â":"a", _
                                                "ã":"a", _
                                                "á":"a", _
                                                "à":"a", _
                                                "ç":"c", _
                                                "ê":"e", _
                                                "Ê":"E", _
                                                "Í":"I", _
                                                "í":"i", _
                                                "Ô":"O", _
                                                "Ã":"A", _
                                                "Â":"A", _
                                                "Á":"A", _
                                                "À":"A", _
                                                "É":"E", _
                                                "ô":"o", _
                                                "õ":"o", _
                                                "Ú":"U", _
                                                "Õ":"O", _
                                                "Ó":"O", _
                                                "ó":"o", _
                                                "ú":"u", _
                                                "ñ":"n", _
                                                "Ñ":"N")
 

Attachments

  • Clipboard01.png
    Clipboard01.png
    364.7 KB · Views: 36
Upvote 0

GiovanniPolese

Well-Known Member
Licensed User
Longtime User
In theory, it shouldn't matter whether the file is Windows-1252 or UTF-8. I use an InputStream for reading. So I don't specify the file coding at opening, and manage the cases that I met (not said that it will work in any case) at line reading level, which was the problem. In other words, simply using TextWriter initialized with a given coding (which may also be wrong or not available), was not satisfactory. This is why I wrote that code. Of course if I didn't miss anything .. Thanks for reading.
 
Upvote 0

emexes

Expert
Licensed User
I use an InputStream for reading.

Lol so did I, but I was waiting to see whether we needed any adaptive code page handling. If your code works, then brilliant.

Given that it's already written, here's how I did it, in case it inspires any ideas.

Main difference is that I split the line reading up into two phases: first I read a line of bytes, then I convert those bytes to a string.

Oh, and it doesn't matter whether lines are terminated with CR or LF or both: whichever one it finds first in the file, it uses that as the line terminator, and ignores the other.

B4X:
Sub Txr_ReadLineBytes(InFile As InputStream) As Byte()
    Dim buffer(1) As Byte
 
    Dim bb As B4XBytesBuilder
    bb.Initialize
 
    Do While InFile.BytesAvailable <> 0
        InFile.ReadBytes(buffer, 0, 1)    'just checked to ensure at least 1 byte was available
     
        Dim Ch As Byte = buffer(0)
     
        If Ch = 13 Or Ch = 10 Then    'CR, LF 
            If LineTerminator = 0 Then
                LineTerminator = Ch    'whichever we hit first, that's our line terminator
            End If
         
            If Ch = LineTerminator Then
                Exit 'do
            End If
        Else
            bb.Append(buffer)
        End If
    Loop
 
    Return bb.ToArray
End Sub
B4X:
Sub Txr_ReadLine(InFile As InputStream) As String
    Dim buffer() As Byte = Txr_ReadLineBytes(InFile)
 
    If buffer.Length = 0 Then Return ""    'bypass needless processing below
 
    Dim bc As ByteConverter
    Dim TextLine As String = bc.StringFromBytes(buffer, "UTF-8")

    If TextLine.Contains(Chr(65533)) Then    'if Unicode replacement character, then wasn't valid UTF-8
        TextLine = bc.StringFromBytes(buffer, "windows-1252")
    End If
 
    Return TextLine
End Sub
B4X:
Sub Process_Globals
    Dim LineTerminator As Byte
End Sub

Sub AppStart (Args() As String)
    Dim InFile As InputStream = File.OpenInput("e:\dxf", "floorplan.dxf")
    LineTerminator = 0    'restart cr/lf line terminator discovery
    Dim NumLines As Int = 0
 
    Do While InFile.BytesAvailable <> 0
        Dim L As String = Txr_ReadLine(InFile)
        NumLines = NumLines + 1
        Log(NumLines & TAB & L)
    Loop

    InFile.Close
End Sub
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
The remapping is the following

Brilliant. I like that your method could also do a replacement like "ß" to "ss".

I usually do it... probably "inside out" to your method, and use String.Replace to do the substitutions, but working though the substitution table rather than working though the string being de-accented.
 
Upvote 0

emexes

Expert
Licensed User
Thanks for reading.

Thanks for sharing the puzzle. From this safe distance, it looks like an interesting app. And some challenges are fun. Although it was probably more fun for me than for you, because I don't have customers whining at me to "fix it, by yesterday if possible".
 
Upvote 0

GiovanniPolese

Well-Known Member
Licensed User
Longtime User
Yes. I will see better your code asap. My situation is rather frustrating, but not due to this topic. My main problem is that Android closes my App without any message, sometimes yes, sometimes not.. Not a problem that debugger may resolve. Moreover with big text files, or with OpenGl, debugger is not usable.... defenetly not fun. And customers say: Why the Tablet crashes randomly ? Autocad reads 1 Gb file with no problem.. The tablet has 16 Gb!!". Etc. Etc. Thanks. Bye..
 
Upvote 0
Top