B4J Question Bytes -> String -> File = Different Results B4J vs C#

tchart · Monday at 9:21 PM

I have this B4J code

B4X:

Dim b As ByteConverter
Dim Bytes() As Byte = File.ReadBytes("D:\Temp","somefile.zip")
Dim sfb As String = b.StringFromBytes(Bytes,"UTF-8")   
File.WriteString("D:\Temp","test_b4j.txt",sfb)

Im looking to replicate this in C# with the code below, however the output files are not the same (ie different file length)

C#:

byte[] bytes = File.ReadAllBytes("D:\Temp\somefile.zip");
string result = Encoding.UTF8.GetString(bytes);
File.WriteAllText("D:\Temp\test_c#.txt", result);

Im not sure if this is a signed vs unsigned bytes issue or an encoding issue.

Any ideas what the problem could be?

Jeffrey Cameron · Monday at 9:31 PM

I may be incorrect as it's been a while, but I don't think B4J includes the BOM with the .WriteString as the C# .WriteAllText method does. You might try adding a MD5 calculation on the string before the write to see if they match.

Edit - never mind, I double-checked, and the C# default is to NOT include the BOM. In both B4J and C#, you're taking the raw bytes of a file (somefile.zip) and converting them into a UTF-8 encoded string. A ZIP file (or other binary files) often contains bytes that do not map cleanly to valid UTF-8 characters. When you attempt to interpret binary data as UTF-8, certain bytes may be replaced, dropped, or misinterpreted, depending on the encoding implementation.

By treating the binary data of the ZIP file as a UTF-8 string, you risk corrupting the data. Any bytes that do not conform to the UTF-8 standard may be altered or removed during the conversion to a string.

Try changing both to WriteAllBytes without converting to a string and see if the files match then.

d3vc · Monday at 9:57 PM

tchart said:
I have this B4J code

B4X:

Dim b As ByteConverter Dim Bytes() As Byte = File.ReadBytes("D:\Temp","somefile.zip") Dim sfb As String = b.StringFromBytes(Bytes,"UTF-8") File.WriteString("D:\Temp","test_b4j.txt",sfb)

Im looking to replicate this in C# with the code below, however the output files are not the same (ie different file length)

C#:

byte[] bytes = File.ReadAllBytes("D:\Temp\somefile.zip"); string result = Encoding.UTF8.GetString(bytes); File.WriteAllText("D:\Temp\test_c#.txt", result);

Im not sure if this is a signed vs unsigned bytes issue or an encoding issue.

Any ideas what the problem could be?

B4X:

hi , try this
 byte[] bytes = File.ReadAllBytes(@"D:\Temp\somefile.zip");
 string result = Encoding.UTF8.GetString(bytes);
 File.WriteAllText(@"D:\Temp\test_c#.txt", result);

don't froget to import
using System;
using System.IO;
using System.Text;

good luck

Daestrum · Monday at 10:32 PM

Maybe its related to strings in C being 0x00 terminated, it may read a 0x00 and think its the end of the string. (Although I think C# doesnt use 0x00 to end strings)

emexes · Monday at 10:45 PM

I haven't had the morning's first coffee yet so I cannot explain why @Jeffrey Cameron is correct, but give this a burl:

edit: three sips into that coffee and I see that I've got the problem around the wrong way around give me a couple of seconds to sort it out

B4X:

Dim b As ByteConverter
Dim Bytes() As Byte = File.ReadBytes("D:\Temp","somefile.zip")

'''Dim sfb As String = b.StringFromBytes(Bytes,"UTF-8")
Dim sfb As String = b.StringFromBytes(Bytes,"ISO-8859-1")    'ISO-8859-1 encoding is 1:1
Dim stb() As Byte = b.StringToBytes(sfb, "UTF-8")

'''File.WriteString("D:\Temp","test_b4j.txt",sfb)
File.WriteBytes("D:\Temp", "test_b4j.txt", stb)

emexes · Monday at 11:12 PM

emexes said:
give me a couple of seconds to sort it out

Like @Jeffrey Cameron , I am mystified as to how your B4J program is not losing information in that transformation.

Would you expect test_b4j.txt to be larger or smaller than somefile.zip ?

emexes · Monday at 11:25 PM

When you said:

Im looking to replicate this in C# with the code below

I read it as:

Im looking to replicate this C# code below

The C# code looks correct - it is encoding a binary file of Unicode code points 0..255 into UTF-8, which is always possible.

The B4J code looks incorrect - it is decoding a binary file as though it is UTF-8, which is not always possible, because it probably contains some byte sequences that are not valid UTF-8

tchart · Monday at 11:26 PM

emexes said:
I am mystified as to how your B4J program is not losing information in that transformation.

100% the file isnt actually used for anything meaningful, its actually used as input into a SHA-256 hash.

Unfortunately the function in B4J wasnt entirely correct (because of the information loss from bytes->string) but Ive used it in several places and to fix it will cause rework in several apps that I dont want to deal with right now

I have rewritten the functions since and if I stick to bytes I get the same hash between B4J and C#

tchart · Monday at 11:31 PM

Some more info

In B4X

bytes length = 10673127 (matches)
string length = 10160610

In C#

bytes length = 10673127 (matches)
string length =10148021

So even before writing to a text file the string is different.

As mentioned the content of the string isnt important I just need them to be consistent

PS if I use a non-binary file then the lengths are the same, so it must be the encoding?

tchart · Tuesday at 12:18 AM

Bit more digging, dumped out each character as an asci character and the file starts to deviate here.

emexes · Tuesday at 12:19 AM

tchart said:
PS if I use a non-binary file then the lengths are the same, so it must be the encoding?

I am going to predict that it has to do with differences in how B4J .StringFromBytes and C# .UTF8.GetString decoders handle invalid UTF-8 multibyte sequences.

Can you generate and post test_b4j.txt and test_c#.txt files produced from the innocuous and smaller somefile.zip that I attached above?

And their sizes, just in case the attachment process here somehow alters them.

edit: never mind, you beat me to where I was heading

emexes · Tuesday at 12:27 AM

tchart said:
dumped out each character

any chance of seeing the corresponding bytes of somefile.zip ?

I am guessing that bytes with 5 or more leading 1 bits are "error noted" differently.

tchart · Tuesday at 12:45 AM

emexes said:
any chance of seeing the corresponding bytes of somefile.zip ?

I am guessing that bytes with 5 or more leading 1 bits are "error noted" differently.

Indeed

306 corresponds to HEX 12 which looks like C# is dropping rather than replacing with 65533 - I think my screen shot above is mislabelled

emexes · Tuesday at 1:13 AM

Where are the two bytes at positions 303 and 304 of the text file dumps,

with values 76 and 20 = hex 4c and 14

in the somefile.zip dump?

Those two bytes shouldn't have changed, since UTF-8 leaves the 128 ASCII characters unchanged, so that ASCII text files are "automatically" also UTF-8 text files.

tchart · Tuesday at 1:21 AM

Actually I took it a step further and dumped out the byte values and the ascii value

B4J is showing negatives (since they are signed). So even though the bytes are the same length they are different.

I guess at this stage I'll just go down the hard road and fix my legacy function

Thanks for the bouncing some ideas around @emexes

emexes · Tuesday at 1:25 AM

These bytes should not have changed at all, because they are all ASCII characters, ie high bit 0, with values < 128

They should appear in the text files as:

0x67 = "g"
0x2c = ","
0x32 = "2"
0x2e = "."
0x77 = "w"
0x12 = control code DC2
0x28 = "("

ie the first five characters are printable characters (not control codes) shown as:

g,2.w

emexes · Tuesday at 1:48 AM

tchart said:
the byte values and the ascii value

If "xx : yy" is "<byte value> : <unicode code point>"

then the B4J translations are correct, and the C# translations are wrong, because any byte values not within the ASCII range 0..127 are multi-byte UTF-8 sequences, but not all sequences are valid, thus this character:

Unicode character 65533 is known as the Replacement Character, represented by the symbol �. This character is used when there is an issue with encoding or when an incoming character's value is unknown or cannot be represented in Unicode. It acts as a placeholder for characters that could not be correctly decoded or are invalid within a particular text encoding scheme.

https://x.com/i/grok/share/jxjtDmadURGaLRRtVtRcm5OC5

tchart said:
fix my legacy function

emexes · Tuesday at 2:00 AM

tchart said:
fix my legacy function

You would expect a binary file containing approximately same number of bytes with high bits set as high bits clear (ie values 128..255 as values 0..127) to grow by 50% when encoded with UTF-8

because Unicode character numbers 0..127 are valid ASCII and encoded as 1 byte, and Unicode character numbers 128..255 are encoded as 2 bytes, the first with high bits "110" and the second (ie last) with high bits "10"

or as explained better by https://en.wikipedia.org/wiki/UTF-8 :

UTF-8 encodes code points in one to four bytes, depending on the value of the code point. In the following table, the characters u to z are replaced by the bits of the code point, from the positions U+uvwxyz:

Code point ↔ UTF-8 conversion

First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0yyyzzzz
U+0080 U+07FF 110xxxyy 10yyzzzz
U+0800 U+FFFF 1110wwww 10xxxxyy 10yyzzzz
U+010000 U+10FFFF 11110uvv 10vvwwww 10xxxxyy 10yyzzzz

The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode,

tchart · Tuesday at 6:26 AM

Doing some further reading it appears that Java UTF-8 is a modified version;

UTF-8 - Wikipedia

en.wikipedia.org

I’ve found a C# library that will apparently encode a string in the same way as Java. Will test and report back.

Jeffrey Cameron · Tuesday at 3:53 PM

In my experience, it is generally not worth the effort to attempt to appease the encoding. Leave it as a byte array if possible.

If you _really_ need a string, convert the byte array to a base-64 encoded string, send that then decode it back into a byte-array on the other end before working with it.

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0yyyzzzz
U+0080	U+07FF	110xxxyy	10yyzzzz
U+0800	U+FFFF	1110wwww	10xxxxyy	10yyzzzz
U+010000	U+10FFFF	11110uvv	10vvwwww	10xxxxyy	10yyzzzz

B4J Question Bytes -> String -> File = Different Results B4J vs C#

Well-Known Member

Well-Known Member

Member

Expert

Expert

Attachments

Expert

Expert

Well-Known Member

Well-Known Member

Well-Known Member

Expert

Attachments

Expert

Well-Known Member

Expert

Well-Known Member

Expert

Expert

Expert

Well-Known Member

Well-Known Member

Similar Threads

Privacy & Transparency

Privacy & Transparency