B4J Question Bytes -> String -> File = Different Results B4J vs C#

tchart

Well-Known Member
Licensed User
Longtime User
I have this B4J code

B4X:
Dim b As ByteConverter
Dim Bytes() As Byte = File.ReadBytes("D:\Temp","somefile.zip")
Dim sfb As String = b.StringFromBytes(Bytes,"UTF-8")   
File.WriteString("D:\Temp","test_b4j.txt",sfb)

Im looking to replicate this in C# with the code below, however the output files are not the same (ie different file length)

C#:
byte[] bytes = File.ReadAllBytes("D:\Temp\somefile.zip");
string result = Encoding.UTF8.GetString(bytes);
File.WriteAllText("D:\Temp\test_c#.txt", result);

Im not sure if this is a signed vs unsigned bytes issue or an encoding issue.

Any ideas what the problem could be?
 

Jeffrey Cameron

Well-Known Member
Licensed User
Longtime User
I may be incorrect as it's been a while, but I don't think B4J includes the BOM with the .WriteString as the C# .WriteAllText method does. You might try adding a MD5 calculation on the string before the write to see if they match.

Edit - never mind, I double-checked, and the C# default is to NOT include the BOM. In both B4J and C#, you're taking the raw bytes of a file (somefile.zip) and converting them into a UTF-8 encoded string. A ZIP file (or other binary files) often contains bytes that do not map cleanly to valid UTF-8 characters. When you attempt to interpret binary data as UTF-8, certain bytes may be replaced, dropped, or misinterpreted, depending on the encoding implementation.

By treating the binary data of the ZIP file as a UTF-8 string, you risk corrupting the data. Any bytes that do not conform to the UTF-8 standard may be altered or removed during the conversion to a string.

Try changing both to WriteAllBytes without converting to a string and see if the files match then.
 
Last edited:
Upvote 0

d3vc

Member
I have this B4J code

B4X:
Dim b As ByteConverter
Dim Bytes() As Byte = File.ReadBytes("D:\Temp","somefile.zip")
Dim sfb As String = b.StringFromBytes(Bytes,"UTF-8")  
File.WriteString("D:\Temp","test_b4j.txt",sfb)

Im looking to replicate this in C# with the code below, however the output files are not the same (ie different file length)

C#:
byte[] bytes = File.ReadAllBytes("D:\Temp\somefile.zip");
string result = Encoding.UTF8.GetString(bytes);
File.WriteAllText("D:\Temp\test_c#.txt", result);

Im not sure if this is a signed vs unsigned bytes issue or an encoding issue.

Any ideas what the problem could be?
B4X:
hi , try this
 byte[] bytes = File.ReadAllBytes(@"D:\Temp\somefile.zip");
 string result = Encoding.UTF8.GetString(bytes);
 File.WriteAllText(@"D:\Temp\test_c#.txt", result);

don't froget to import
using System;
using System.IO;
using System.Text;

good luck
 
Upvote 0

Daestrum

Expert
Licensed User
Longtime User
Maybe its related to strings in C being 0x00 terminated, it may read a 0x00 and think its the end of the string. (Although I think C# doesnt use 0x00 to end strings)
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
I haven't had the morning's first coffee yet so I cannot explain why @Jeffrey Cameron is correct, but give this a burl:

edit: three sips into that coffee and I see that I've got the problem around the wrong way around :rolleyes: give me a couple of seconds to sort it out

B4X:
Dim b As ByteConverter
Dim Bytes() As Byte = File.ReadBytes("D:\Temp","somefile.zip")

'''Dim sfb As String = b.StringFromBytes(Bytes,"UTF-8")
Dim sfb As String = b.StringFromBytes(Bytes,"ISO-8859-1")    'ISO-8859-1 encoding is 1:1
Dim stb() As Byte = b.StringToBytes(sfb, "UTF-8")

'''File.WriteString("D:\Temp","test_b4j.txt",sfb)
File.WriteBytes("D:\Temp", "test_b4j.txt", stb)
 

Attachments

  • somefile.zip
    69.5 KB · Views: 7
  • test_b4j.txt
    104.6 KB · Views: 8
Last edited:
Upvote 0

emexes

Expert
Licensed User
When you said:

Im looking to replicate this in C# with the code below

I read it as:

Im looking to replicate this C# code below

The C# code looks correct - it is encoding a binary file of Unicode code points 0..255 into UTF-8, which is always possible.

The B4J code looks incorrect - it is decoding a binary file as though it is UTF-8, which is not always possible, because it probably contains some byte sequences that are not valid UTF-8
 
Upvote 0

tchart

Well-Known Member
Licensed User
Longtime User
I am mystified as to how your B4J program is not losing information in that transformation.
100% the file isnt actually used for anything meaningful, its actually used as input into a SHA-256 hash.

Unfortunately the function in B4J wasnt entirely correct (because of the information loss from bytes->string) but Ive used it in several places and to fix it will cause rework in several apps that I dont want to deal with right now :(

I have rewritten the functions since and if I stick to bytes I get the same hash between B4J and C#
 
Upvote 0

tchart

Well-Known Member
Licensed User
Longtime User
Some more info

In B4X

bytes length = 10673127 (matches)
string length = 10160610

In C#

bytes length = 10673127 (matches)
string length =10148021

So even before writing to a text file the string is different.

As mentioned the content of the string isnt important I just need them to be consistent :(

PS if I use a non-binary file then the lengths are the same, so it must be the encoding?
 
Upvote 0

tchart

Well-Known Member
Licensed User
Longtime User
Bit more digging, dumped out each character as an asci character and the file starts to deviate here.

1736209017312.png
 
Upvote 0

emexes

Expert
Licensed User
PS if I use a non-binary file then the lengths are the same, so it must be the encoding?

I am going to predict that it has to do with differences in how B4J .StringFromBytes and C# .UTF8.GetString decoders handle invalid UTF-8 multibyte sequences.

Can you generate and post test_b4j.txt and test_c#.txt files produced from the innocuous and smaller somefile.zip that I attached above?

And their sizes, just in case the attachment process here somehow alters them.


edit: never mind, you beat me to where I was heading 🏆
 

Attachments

  • somefile.zip
    69.5 KB · Views: 6
Upvote 0

tchart

Well-Known Member
Licensed User
Longtime User
any chance of seeing the corresponding bytes of somefile.zip ?

I am guessing that bytes with 5 or more leading 1 bits are "error noted" differently.
Indeed :D

306 corresponds to HEX 12 which looks like C# is dropping rather than replacing with 65533 - I think my screen shot above is mislabelled

1736210510889.png
 
Last edited:
Upvote 0

emexes

Expert
Licensed User
Where are the two bytes at positions 303 and 304 of the text file dumps,

with values 76 and 20 = hex 4c and 14

in the somefile.zip dump?

Those two bytes shouldn't have changed, since UTF-8 leaves the 128 ASCII characters unchanged, so that ASCII text files are "automatically" also UTF-8 text files.
 
Upvote 0

tchart

Well-Known Member
Licensed User
Longtime User
Actually I took it a step further and dumped out the byte values and the ascii value

B4J is showing negatives (since they are signed). So even though the bytes are the same length they are different.

I guess at this stage I'll just go down the hard road and fix my legacy function

Thanks for the bouncing some ideas around @emexes
1736212661941.png
 
Upvote 0

emexes

Expert
Licensed User
These bytes should not have changed at all, because they are all ASCII characters, ie high bit 0, with values < 128

1736212700478.png


They should appear in the text files as:

0x67 = "g"
0x2c = ","
0x32 = "2"
0x2e = "."
0x77 = "w"
0x12 = control code DC2
0x28 = "("

ie the first five characters are printable characters (not control codes) shown as:

g,2.w
 
Upvote 0

emexes

Expert
Licensed User
the byte values and the ascii value

If "xx : yy" is "<byte value> : <unicode code point>"

then the B4J translations are correct, and the C# translations are wrong, because any byte values not within the ASCII range 0..127 are multi-byte UTF-8 sequences, but not all sequences are valid, thus this character:

Unicode character 65533 is known as the Replacement Character, represented by the symbol �. This character is used when there is an issue with encoding or when an incoming character's value is unknown or cannot be represented in Unicode. It acts as a placeholder for characters that could not be correctly decoded or are invalid within a particular text encoding scheme.

https://x.com/i/grok/share/jxjtDmadURGaLRRtVtRcm5OC5


fix my legacy function

🏆
 
Upvote 0

emexes

Expert
Licensed User
fix my legacy function

You would expect a binary file containing approximately same number of bytes with high bits set as high bits clear (ie values 128..255 as values 0..127) to grow by 50% when encoded with UTF-8

because Unicode character numbers 0..127 are valid ASCII and encoded as 1 byte, and Unicode character numbers 128..255 are encoded as 2 bytes, the first with high bits "110" and the second (ie last) with high bits "10"

or as explained better by https://en.wikipedia.org/wiki/UTF-8 :

UTF-8 encodes code points in one to four bytes, depending on the value of the code point. In the following table, the characters u to z are replaced by the bits of the code point, from the positions U+uvwxyz:

Code point ↔ UTF-8 conversion
First code pointLast code pointByte 1Byte 2Byte 3Byte 4
U+0000U+007F0yyyzzzz
U+0080U+07FF110xxxyy10yyzzzz
U+0800U+FFFF1110wwww10xxxxyy10yyzzzz
U+010000U+10FFFF11110uvv10vvwwww10xxxxyy10yyzzzz
The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode,
 
Upvote 0

tchart

Well-Known Member
Licensed User
Longtime User
Doing some further reading it appears that Java UTF-8 is a modified version;


I’ve found a C# library that will apparently encode a string in the same way as Java. Will test and report back.
 
Upvote 0

Jeffrey Cameron

Well-Known Member
Licensed User
Longtime User
In my experience, it is generally not worth the effort to attempt to appease the encoding. Leave it as a byte array if possible.

If you _really_ need a string, convert the byte array to a base-64 encoded string, send that then decode it back into a byte-array on the other end before working with it.
 
Upvote 0
Top