Android Question Special chars and TextReader

GiovanniPolese · Jan 31, 2025

Hi to everybody here
I am using TextReader with files of which I don't know the encoding. As a matter of fact, it may be either "Windows-1252" or "UTF-8". The problem arises when these files contain special characters. Despite, in theory, the encoding is declared in the file itself, it is not respected, in practice. What I see is that some files are successfully read with TextReader when one encoding is used, while others, in same conditions, are not read correctly. In other words, while all files are normally declared as "Windows-1252", very often to read them correctly it is necessary to initialize the TextReader with UTF8. Besides the explicit (often false) declaration inside the file, I tried also an analysis on the very first characters of the file, with no success. The only way to have a sure answer is to read the raw bytes to discover when and if the special characters appear. This creates new problems, because the files are often huge (over one Gb) and cannot be entirely loaded in memory with File.ReadBytes. Alternatively maybe I must read byte by byte. I use TextReader because, due to file size, I read them line by line. Moreover the structure of these files is such to require a line by line processing. Is there any way to read raw bytes from a Text file, line by line, not using TextReader? thank in advance for any suggestion.

LucaMs · Jan 31, 2025

RandomAccessFile - ReadBytes.

GiovanniPolese said:
(over one Gb)

100,000 copies of the "Divine Comedy" all at once?

GiovanniPolese · Feb 1, 2025

I post here the code resolving my problem. It reads a file using an external (global) defined InputStream. It is a sub reading a line of an Ascii file, byte by byte.
It returns a decoded line. It works for Portugues special chars written (sometimes improperly definition of coding inside) in dxf files. It first decodes the line with UTF8 and, if special chars are empirically detected, redoes the conversion using Windows-1252. Any eventual improvement is welcome. As a matter of fact I have doubts on my empirical detection with Asc(char)>256, despite for my cases it works.

B4X:

private Sub Txr_ReadLine As String
#if TEXTR    
    Return Txr.Readline ' if TEXTR is defined, it uses a global TextReader Txr. of course this is not the case
#else
    Dim sb As StringBuilder
    sb.Initialize
    Dim buffer(1) As Byte 
    Dim bytesRead As Int
    Dim eol As Boolean = False
    Dim rigaDecodificata As String
    
    RigaDecodificata=""

    Do While Not(eol)
        
        bytesRead = InXr.ReadBytes(buffer, 0, 1)  ' global InXr
        
        If bytesRead <= 0 Then ' end of file
            eol = True
        Else
            Dim byteVal As Int = Bit.And(buffer(0), 255)

            If byteVal = 10 Then ' Rileva '\n'
                Dim rigaBytes() As Byte = sb.ToString.GetBytes("ISO-8859-1")
                Dim i As Int
                Dim redo As Boolean
    
                redo=False
                
                rigaDecodificata = BytesToString(rigaBytes, 0, rigaBytes.Length, "UTF-8")
                
                For i=0 To rigaDecodificata.Length-1 ' empirical check of decoded line chars
                    If Asc(rigaDecodificata.CharAt(i))>256 Then
                        redo=True
                        Exit
                    End If
                Next

                If redo Then
                    rigaDecodificata = BytesToString(rigaBytes, 0, rigaBytes.Length, "Windows-1252")
                End If
                
                eol=True
            Else If byteVal <> 13 Then ' Ignora '\r'
                sb.Append(Chr(byteVal)) ' Accumula caratteri
            End If
        End If
    Loop
    ' Decodifica l'ultima riga se esiste
    Return rigaDecodificata'DecodificaERispondi(sb, possibileWin1252)

#end if
End Sub

emexes · Feb 1, 2025

If the end-of-line characters are ASCII control characters, then there should be no problem, because those characters are always single bytes in any sane 8-bit encoding.

emexes · Feb 1, 2025

Why do you not know whether the input is Windows-1252 or UTF-8? And how can you tell anyway - all UTF-8 is also valid Windows-1252.

The only thing you can prove is that the input is not UTF-8, but you can only be sure of that if the input happens to contain Windows-1252 high-bit bytes or byte sequences that are invalid UTF-8 encodings.

But either way, shouldn't affect the line parsing, eg UTF-8 multi-byte sequences shouldn't get split across lines, because the multi-byte sequences have all byte high-bits set, and thus don't include ASCII control characters like 13 (carriage return) and 10 (line feed).

emexes · Feb 2, 2025

GiovanniPolese said:
the files are often huge (over one Gb) and cannot be entirely loaded in memory with File.ReadBytes

Do the files exist on a local disk, or are you trying to decode them as they come in over the internet?

Are they split into lines? Is there a maximum line length?

Not that it matters, because should be no problem going through the file in fixed-length chunks, because the maximum size of a UTF-8 encoding is 4 bytes, which is easy to handle re; multibyte encodings split across a chunk boundary.

Can a single file have some sections encoded using UTF-8 and other sections encoded using Windows-1252, in the one file file? Or is a file always entirely coded with UTF-8 *OR* Windows-1252, ie one or the other but NOT both?

Can you point us to some samples on the internet? Or upload a Zip of some files here, to give an idea of what you're dealing with?

GiovanniPolese · Feb 2, 2025

Hi. My problems come from reading Autocad dxf files, coming from world-wide sources. Dxf fille has a declaration on the header, saying which is the encoding. Maybe this declation is respected in files directly generated by Autocad, but there many softwares writing dxf files with no care about this aspect. I cannot send these files because they are not mine but client property. Anyway, I will attach an extract made with NotePad++. I don't know whether NotePad++ will modify the encoding while saving this sample or during copy/paste process. I attach notepad and notepad++ files.
The only new observation is that I found also an ANSI_1252 codification. I don't know whether using ANSI_1252 may change the situation or not. Thanks for your observations.

emexes · Feb 2, 2025

According to this:

https://forums.autodesk.com/t5/auto...f-dxf-file-with-properly-encoded/td-p/8320680

all DXF files use either UTF-8 or plain ASCII (byte values 0..127). An ASCII file is a UTF-8 file. There should be no high-bit bytes (values 128..255) in a DXF file, other than UTF-8 multibyte representations of non-ASCII characters ie characters with code value aka code point > 127

Have you found DXF files that contain high-bit characters (byte values 128..255) that are not UTF-8? Just because they're against the rules, doesn't mean that some rogue software has done it anyway. But if somebody did create such files, presumably AutoCAD wouldn't read them, and users of the rogue software would be hammering the software's author to fix that bug.

But... there can be a code page specified, and if there is, then it is used to translate high-bit characters that are encoded in ASCII and then interpreted "with the encoding set by the header variable $DWGCODEPAGE, which is ANSI_1252 by default".

I am getting mixed messaging about the encoding to ASCII, and when code pages are used instead of Unicode, but the gist is:

https://www.cadforum.cz/en/unicode-and-dxf-files-tip12069

I found a sample DXF file at:

https://github.com/jscad/sample-files/blob/master/dxf/dxf-parser/floorplan.dxf

but this file doesn't seem to contain any non-ASCII characters (although writing that made me realize that perhaps they're encoded as UTF-8)

emexes · Feb 2, 2025

So now that it looks like you don't have to - or at least, shouldn't have to - do any code-page translations of high-bit bytes (128..255) in the DXF, is the problem that:

1/ you are getting UTF-8 errors (which, from memory, show up as white-on-black diamonds or question marks)
2/ you want to translate the backslashed encodings of non-ASCII characters to Unicode (presumably UTF-8)

or something else again?

If you could find an sharable example DXF file that exhibits the problem, that'd be great.

GiovanniPolese · Feb 2, 2025

The text that I attached is extracted from Dxf files. Maybe reading it with B4A will show what happens? With B4A logger, I see strange chars or even wrong ones (exchanging one special char with anothe special). But I didn't test the files that I attached. Only the originals. I have doubts because of copy/paste and saving operations.. Anyway my code resolves the problem, maybe in very empirical way. All strings of Portuguese language are detected (then I convert them is english equivalent, in practice avoiding their nasty accents). So, finally, a reason for which my code resolves must be.. At my level this is enough. But, seen your appreciable interest, I will make a dxf with Portuguese special chars, and post it, maybe not to day, because I must ask to somebody here having Autocad to do it. I don't have autocad. After this, we can talk again. Wait my next post. Thanks a lot.

emexes · Feb 2, 2025

GiovanniPolese said:
a dxf with Portuguese special chars

That'd be great. I tried finding a sample at Brazilian universities, but no luck.

You shouldn't have to convert Portuguese to English just to avoid the accented characters. If we can replace them with their Unicode equivalents, encoded as UTF-8, such that they work with AutoCad 2007 onwards, will that be enough?

Or is there a follow-on problem that AutoCad can't render all necessary characters?

GiovanniPolese · Feb 2, 2025

The problem is that I must render the chars with OpenGL and I don't have a set of chars including the special ones.

emexes · Feb 3, 2025

GiovanniPolese said:
extract made with NotePad++

btw I looked at those samples, and they are definitely UTF-8 :

the non-ASCII characters are the double-gray blocks. UTF-8 encodes non-ASCII characters as multibyte high-bit sequences, never as a single high-bit byte. So if you ever found a single high-bit byte by itself, then that file would (well, should) not be UTF-8.

The yellow cl are CRLF (carriage return line feed) line terminators.

The last non-ASCII character in that file is the C3 8D on the last line:

https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=C3+8D&mode=bytes

which is a "LATIN CAPITAL LETTER I WITH ACUTE"

GiovanniPolese · Feb 3, 2025

emexes said:
btw I looked at those samples, and they are definitely UTF-8 :

View attachment 161330
the non-ASCII characters are the double-gray blocks. UTF-8 encodes non-ASCII characters as multibyte high-bit sequences, never as a single high-bit byte. So if you ever found a single high-bit byte by itself, then that file would (well, should) not be UTF-8.

The yellow cl are CRLF (carriage return line feed) line terminators.

The last non-ASCII character in that file is the C3 8D on the last line:

https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=C3+8D&mode=bytes

which is a "LATIN CAPITAL LETTER I WITH ACUTE"

Hi. Anyway the problem is not to know that there are special chars, but how to correctly read them.

emexes · Feb 3, 2025

Is this in the ballpark of what you're looking for from the sample Notepad++ file you provided? (assuming "INFLUʎCIA/" is meant to be "INFLUÊNCIA/")

GiovanniPolese · Feb 3, 2025

emexes said:
Is this in the ballpark of what you're looking for from the sample Notepad++ file you provided? (assuming "INFLUʎCIA/" is meant to be "INFLUÊNCIA/")

View attachment 161351

I copied some texts from dxf files. Not only from one. May be incoherent. The text is almost that, but IDENTFICA!O is wrong. Please note the apparently unimportant fact: texts are "almost" all correct, but not all.. This is what I meet : some cases work, some other don't. I even found a problem in my sub that I thought correct. Wait my file, not yet ready, which will be a "sum" of texts with special characters, as Autocad writes. It will be a dxf file, which you have to examine with a text editor. To see it graphically, I use free Autodesk "DWG TrueView 2023 - English" . Then try to read it with B4A TextWriter, because this I am doing: I dont use any tool to read the dxf, at least I didn't find it. Thanks.

emexes · Feb 3, 2025

GiovanniPolese said:
IDENTFICA!O is wrong

No worries, I can fix that. One step at a time, in the correct direction.

What is it meant to be?

In the Notepad++ file you sent, the 11th character of the word is UTF-8 bytes C7 83 which is:

https://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=c7+83&mode=bytes

GiovanniPolese · Feb 3, 2025

Hi. Here is a dxf file written by Autocad official program. Inside there is a plenty of texts. Inside a circle there were added all special chars. Many of them are attached to the "MULTILEADER" entity. Inside the circle I did also an image of DwgTrueViewer and what I get in my App. As you see, some chars are still fooling me ... Thanks.

GiovanniPolese · Feb 3, 2025

Only one file passed. Here are the other.

GiovanniPolese · Feb 3, 2025

In trueViewer

Android Question Special chars and TextReader

Well-Known Member

Expert

Well-Known Member

Expert

Expert

Expert

Well-Known Member

Attachments

Expert

Expert

Well-Known Member

Expert

Well-Known Member

Expert

Well-Known Member

Expert

Well-Known Member

Expert

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

Attachments

Similar Threads