Unicode is a standard for text representation and encoding.
The days where 127 English characters (ASCII) were enough for computer programs have gone long ago.
There are more than 1 million code points (~characters) in Unicode mapping.
Unicode covers many complexities involved with text representation, however from a developer perspective, most of these complexities are solved by the underlying OS as long as we follow best practices.
Lets start with a few terms:
- Code point – a number between 0 to 1M+. Most code points represent characters. Some code points represent other text features such as diacritics marks and joiners.
Log("אֶראֶל".Length) 'output is 6 (4 letters and two diacritics marks).
Dim s As String = "???"
Dim b() As Byte = s.GetBytes("UTF-32LE")
For i = 0 To b.Length - 1 Step 4
Log(BytesToString(b, i, 4, "UTF-32LE"))
Next
'output:
?
?
?
'black woman teacher emoji is actually made of 4 code points: woman, dark skin tone, zero width joiner and school.
- Unicode encoding – the format in which the code point numbers are encoded as bytes. Remember that files are always made of bytes. You cannot safely read text from a file unless you know its encoding.
- UTF8 – A popular encoding and the default encoding in B4X tools. The first 128 code points are represented as single bytes, this makes it compatible with ASCII encoding. Higher values are represented as 2 to 4 bytes.
- UTF16 – Each code point is represented as 2 or more bytes. The importance of UTF16 stems from the fact that the underlying frameworks use it to store the code points with the assumption that each character (code point) is made of 2 bytes. This works nicely until code points with values larger than 65k. It does cover most human languages but not all of them.
Log("?".Length) 'output is 2 although there is a single code point here.
- UTF32 – Fixed encoding. Each code point is encoded as 4 bytes. This encoding is useful when you want to deal with emojis and other high values code points. There are several variants related to the endianess and BOM marking. I recommend using UTF-32LE which explicitly sets the endianess and doesn’t have BOM marking.
- There are other less common encodings.
- BOM – Byte Order Mark, a mark that might appear at the beginning of the file. It is the encoded 0xFEFF code point. In UTF8 it is encoded as 0xEF, 0xBB, and 0xBF ( ). Many UTF8 decoders do not treat this marking in a special way and it will be added as the first character. Avoid it if you can.
The most common mistake developers do related to text files and encodings is to use inadequate text editors such as Windows Notepad. As developers we need to have the ability to see the encoding and change it. I highly recommend to use Notepad++.