ISO-8859-1 and UTF-8 Hell 🤯

Adam Crowe · 28 Apr 2020, 12:16

Hello!

Due to unfortunate circumstances, a file that was created in C# was saved with ISO-8859-1 encoding but contained unicode characters as well. Those familiar with C#, the method used was Encoding.Default.GetBytes() The problem never came to light because on Windows, in the original app, the file decodes correctly via the same method. I am now porting the app to Qt and having trouble decoding the unicode characters in the file.

Take this character for example:

↕

It is saved as 6 bytes in the file instead of 3. The six bytes are:

0xC3 0xA2 0xE2 0x80 0xA0 0xE2 0x80 0xA2

So it gets interpreted as:

â†•

I have managed to use two online tools to manually do the conversion but cannot find a way to do replicate this process in Qt code. I can take the original the three characters â†• and convert them with ISO-8859-1 decoding to their respective hex representations here:

https://www.rapidtables.com/convert/number/ascii-to-hex.html

This way â†• becomes:

0xE2 0x86 0x95

I can then paste these three hex bytes to the Hex to UTF-8 tool below and get the right character:

https://onlineutf8tools.com/convert-hexadecimal-to-utf8

I need to somehow take the original six bytes or the three characters they get interpreted as and make one unicode character out of them. I think I've tried every permutation of QTextEncode with ISO-8859-1 and UTF-8 codecs but I never get the right results.

Any help would be massively appreciated!

KroMignon · wrote on 28 Apr 2020, 12:16

@Adam-Crowe Just to be sure to right understand your problem: you want a way to import your ISO-8859-1 content from a file into a QString?
IMHO, the easiest way is to read the file content into a QByteArray and to use QString::fromLatin1().

Like this:

QFile latin1File(filePath);
QString fileData;
if(latin1File.open(QIODevice::ReadOnly)) 
{
    QByteArray fileData = latin1File.readAll();
    latin1File.close();
    fileData = QString::fromLatin1(fileData);
}

Adam Crowe · 28 Apr 2020, 18:59

@KroMignon Thank you very much for this but sadly it doesn't work as expected. The characters that are meant to be unicode get read as:

Ã¢â\u0080 â\u0080\u009C

It's like the original ↕ character is encoded twice or something.

Pablo J. Rogina · 28 Apr 2020, 20:41

@Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

Take this character for example:
↕

It is saved as 6 bytes in the file instead of 3. The six bytes are:
0xC3 0xA2 0xE2 0x80 0xA0 0xE2 0x80 0xA2

So it gets interpreted as:
â†•

Well, the interpretation seems to be Ok. "Usually" UTF-8 characters take up to 2 bytes, so in this case you'll end up with 3 characters.

I pasted all those 6 hex values at the online tool you mentioned (https://onlineutf8tools.com/convert-hexadecimal-to-utf8) and I've got... well, 3 characters.

in the original app, the file decodes correctly via the same method.

Perhaps the original app was doing something else beyond interpreting Unicode chars, i.e. "composing" the arrow char from all those 3 resulting Unicode characters.

Adam Crowe · 28 Apr 2020, 20:52

@Pablo-J-Rogina That's exactly what I experienced. For whatever reason, in C# those 3 characters that you ended up with as well get decoded (encoded?) into one single unicode arrow char. And it's breaking my head how that's happening and why it's working correctly in .Net.

Pablo J. Rogina · wrote on 28 Apr 2020, 20:52

@Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

how that's happening

Don't you have access to the C# source code to check?

why it's working correctly in .Net.

I wouldn't say "correctly" :-)
Getting 3 Unicode characters combined into just one doesn't sound good, unless the application is doing that on purpose

KroMignon · 29 Apr 2020, 20:09

@Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

That's exactly what I experienced. For whatever reason, in C# those 3 characters that you ended up with as well get decoded (encoded?) into one single unicode arrow char. And it's breaking my head how that's happening and why it's working correctly in .Net.

Hmm that's is very strange, perhaps you can try it with some other converters:

fileData = QString::fromLocal8Bit(fileData);

Or try different QTextCodec.
First find out your system code page on your Windows system with chcp :

c:\>chcp
Page de codes active : 850

And then use the corresponding codec

QTextCodec *codec = QTextCodec::codecForName("IBM850");
fileData = codec->toUnicode(fileData);

Adam Crowe · wrote on 29 Apr 2020, 20:09

@KroMignon Thank you very much but unfortunately I have tried this. The Windows encoding used is definitely ISO-8859-1. I can debug the encoder code and that's what it reports.

I tried QString::fromLocal8Bit and many variations of QTextCodec processing :(

I fear that I will have to fix this at the source with the Windows software. Bugger.

Christian Ehrlicher · 30 Apr 2020, 19:41

@Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

I fear that I will have to fix this at the source with the Windows software.

~~Correct since the bytes you give in your first post are neither valid latin1 nor utf-8 nor anything else.~~

iconv -f utf-8 -t cp1252 file.txt, then interpret this as utf-8 data.

/edit: got it

const char *str = "\xC3\xA2\xE2\x80\xA0\xE2\x80\xA2";
const QString utf8_1 = QString::fromUtf8(str);
QTextCodec *tc = QTextCodec::codecForName("CP1252");
const QByteArray ba = tc->fromUnicode(utf8_1);
const QString utf8_2 = QString::fromUtf8(ba);
qDebug() << utf8_2.size() << utf8_2 << QString::number(utf8_2.at(0).unicode(), 16);

-->
1 "↕" "2195"

Adam Crowe · wrote on 30 Apr 2020, 19:41

@Christian-Ehrlicher Holy Smokes Batman! It actually worked!!!!!!!!!!!!

How on Earth?!?!? I would have never thought of this solution. I'm really shocked. It's working. It's working perfectly.

Hats off and a massive thank you Sir!!!! 🙌