Strings are not a sequence of Bytes but are a sequence of Chars. Each Char is actually a 16bit value representing the Unicode code point of a character.
You can used String.CharAt(pos) to get a single character from a String and use the Asc(char) keyword to get the code point of the character.
I think it might (now?) be larger than 16 bits. Pretty sure I have used something like U+1F6A9 (triangular flag on post), but not sure in which of B4A/I/J.
No. The internal coding for strings in Java (and Windows, C#, Javascript etc.) is UTF-16 which comprises a sequence of 16 bit values. Do not confuse a Unicode code point with how it is encoded. There are about 1,112,064 valid code points of Unicode which can be encoded in various ways. UTF-8 uses sequences of 8 bit values, UTF-16 uses 16 bit values and UTF-32 uses 32 bit values. Where the value of a code point exceeds that of the units within which it is encoded then it is encoded in multiple coding values. Wikipedia has several very good articles on Unicode itself and the various encodings.
Dim Str As String
Dim ii as int
str = Chr(0x00) & Chr(0x00) & Chr(0x00) & Chr(0xA9)
'just to make sure we're talking about the same thing
Log( "Length = " & Str.Length ) '4 characters
Log( "Last character code = " & Asc(Str.CharAt(3)) ) 'A9 hex = 169 decimal
'if so, then this be what ye want
ii = Asc(Str.CharAt(3)) + 2
MyEditText.Text = ii 'Int value 171 will be cast to a 3-character string "171"
This is not just Java. UTF-16 encoding is almost universally standard for the internal representation of strings in Windows, Linux, Android and MacOS and hence in most languages running on them.
Windows also supports 8 bit byte characters and code page definitions for backward compatibility. In fact just about every string API call in Win32 exists in both narrow and wide form, the difference being whether the code values are 8 code page or 16 bit Unicode values. It also has conversion APIs to convert 8 bit characters to and from Unicode that need a code page identifier to identify what glyphs the character values between 128 and 255 are meant to represent.
Dim S As String = Char(0x1F6A9)
Log(S.Length) 'where S is comprised of Chars, and .Length is the number of Chars in the string
Dim C As Char = S.CharAt(0)
Log(ASC(C))
got this log:
B4X:
Program started.
1
63145
which is completely contrary to my recollection, but compatible with your explanation. Although I did think that UTF-16 was also supposed to be able to encode the entire Unicode range, by using more than one 16-bit component (like UTF-8 does with 8-bit components). And it is a bit disappointing that a Char doesn't hold Unicode characters > 0xFFFF. I'll pull on those threads a bit more, see what comes out.
I did think that UTF-16 was also supposed to be able to encode the entire Unicode range, by using more than 16-bit component (like UTF-8 does with 8-bit components)
It does, in a similar manner using surrogate pairs of values. I don't know for sure but String.Length may return the number of Unicode code points in the string, not the number of individual 16 bit coding elements. I tend to ignore the Unicode complexities until it hits me in the face and I have never needed to use any characters that require more than a single 16 bit value when coded in UTF-16.
Note that 63145 is the decimal equivalent to 0xF6A9 so it looks like CharAt has returned the correct Unicode code point but by assigning it to a Char which is a 16 bit value it has truncated it. Try Asc(S.CharAt(0) into a Long.
55357 is 0xD83D which is a valid UTF-16 high surrogate pair value and 57001 is 0xDEA9 which is a valid UTF-16 low surrogate pair value so it looks correct to me but I can't be bothered to calculate them myself. The reason for the "?" is that CharAt is returning the individual surrogate pair values and the font used to log the values does not have a glyph defined for those codes.
Note that 63145 is the decimal equivalent to 0xF6A9 so it looks like CharAt has returned the correct Unicode code point but by assigning it to a Char which is a 16 bit value it has truncated it.
In retrospect this is most likely wrong. The real reason is probably that Char() masks any value passed to it to a 16 bit value which fits in a single Char variable. In your second case embedding the extended character in a string bypasses this limitation by generating a correctly encoded UTF-16 literal string.
I haven't either. The flag character was encountered whilst on a quest for somebody else, and my recollection there is that the new Unicode representation of flags is too new to have percolated down to being available for common usage, so we let go of that solution for the time being.
For us non-flag-waving English speakers, the 65536 codepoints of Unicode plane 0 may well be enough. But there are 16 other planes in Unicode too. About a week ago, there was a forum query along the lines of "my Arabic text strings display ok in the IDE, but go wonky when my program manipulates them" and now I am thinking: perhaps that issue was related to this half-baked handling of Unicode.
Note that even some plane 0 codepoints are going to be encoded as surrogate pairs in UTF-16 as some 0xD??? values are reserved for surrogate pair usage and so Unicode codepoints in this range need to be encoded as a surrogate pair.
EDIT: I'm wrong here. They are reserved for surrogate use only with no character allocations
55357 is 0xD83D which is a valid UTF-16 high surrogate pair value and 57001 is 0xDEA9 which is a valid UTF-16 low surrogate pair value so it looks correct to me but I can't be bothered to calculate them myself. The reason for the "?" is that CharAt is returning the individual surrogate pair values and the font used to log the values does not have a glyph defined for those codes.
Understood. And thank you for freeing me from a bad assumption. Or maybe: a good assumption, implemented badly. I am letting go of this topic, per your "until it hits me in the face" approach ;-)
Note that even some plane 0 codepoints are going to be encoded as surrogate pairs in UTF-16 as some 0xD??? values are reserved for surrogate pair usage and so Unicode codepoints in this range need to be encoded as a surrogate pair.