Convert QString into QByteArray as either UTF-8 or Latin1

jsiei97 · wrote on 13 Mar 2011, 11:11

Hi

I would like to covert a QString into either a utf8 or a latin1 QByteArray, but today I get everything as utf8.

And I am testing this with some char in the higher segment of latin1 higher than 0x7f, where the german ü is a good example.

If I do like this:

@QString name("\u00fc"); // U+00FC = ü
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("ISO 8859-1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();
@

I get the following output.

@utf8 "ü" "c3bc"
Latin1 "ü" "c3bc"
ISO 8859-1 "ü" "c3bc" @

As you can see I get the unicode 0xc3bc everywhere, where I would expect to get the Latin1 0xfc for step 2 and 3.

What is going on here?

/Thanks

jsiei97 · wrote on 13 Mar 2011, 13:06

If I set local correctly it works...

@QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));@

goetz · wrote on 13 Mar 2011, 17:20

Anything wrong with "QString::toUtf8() ":http://doc.qt.nokia.com/4.7/qstring.html#toUtf8 or "QString::toLatin1() ":http://doc.qt.nokia.com/4.7/qstring.html#toLatin1? :-)

jsiei97 · wrote on 13 Mar 2011, 19:13

It was QString::toLatin1 that did not work!

And the string was already utf8, so I could not convert it again using toUtf8.
QString and QByteArray just got the stupid idea that the data was latin1 and not utf8.

So the only remedy that I could find was to make sure the system really knew what format the data was stored in to begin with. Therefore the setCodecForLocale and setCodecForCStrings.

raulgd · wrote on 13 Mar 2011, 19:40

Hi, I had this issue before, you would have to convert to a char and pass the length using QString::fromUtf8() so I did this:

@
//An UTF8 encoded QByteArray
QByteArray aByteArray = aString.c_str();

//From an UTF8 encoded QByteArray to a QString
QString aQString = QString::fromUtf8(aByteArray.data(), aByteArray.size());
@

goetz · wrote on 13 Mar 2011, 21:54

It works with 0-terminated strings (standard char *) without knowing the length. The data length is computed with qstrlen() then. So, it's perfectly ok to write:

@
QString x = QString::fromLatin1("ü");
@

@jsiei97: Your issue is not an output problem but an input problem. setCodecForLocale and setCodecForCStrings only affect the construction of QStrings. These settings are useful if you want to use native strings throughout your source code. If you need this only here and then, the static methods QString::fromLatin1() and QString::fromUtf8() could be enough.

jsiei97 · wrote on 14 Mar 2011, 06:27

In this case I must have the data as a QByteArray at the end (for other reasons).

But since my system is using utf8, I can't see any danger with telling him what he is using.

I know there could be some portability issues if we move this sw to another platform.
Today this code will only run on a specific embedded Linux device, so this solutions moves my trouble into the future.... (or not)

However my current guess is that the correct solution is to find the missconfiguration in the system environment, so I don't need to hardcode the default locale in the code it self...

goetz · wrote on 14 Mar 2011, 11:18

This is no problem. You create a QString object, that does all the codec stuff and stores the string as unicode (utf-16 if I remember correctly) internally. After you have kind-of normalized the string this way, you can get a new byte representation in the encoding you want. It's an easy and convenient way to convert from latin1 to utf8, for example.

You are doing nothing different in your code, but only leave the decision which codec to use to the system. With the static methods you can tell Qt directly.

Once you have compiled the sources, it is irrelevant what encoding the system uses you deploy on. The decision has already been made. The encoding stuff is also platform independent on a compiler view - it works on any platform. We do UTF-8 encoded sources (including german umlauts) on Windows, Mac an Linux without any problem. But be sure to tell all your editors to use the same encoding (including default encoding when saving new files) :-)

dangelog · wrote on 14 Mar 2011, 12:20

[quote author="jsiei97" date="1300014687"]Hi

I would like to covert a QString into either a utf8 or a latin1 QByteArray, but today I get everything as utf8.[/quote]

There are some problems with your snippet...

QString(const char *) uses whatever codec was set by QTextCodec::setCodecForCStrings(), or if no codec was set, fromLatin1()
A \u escape sequence is not generated in any particular encoding, but it's up to your compiler to set the execution charset (see -fexec-charset on gcc). For instance:
@
$ LC_ALL=C gcc -x c++ -o - -S - -fexec-charset=latin1 <<< 'const char *foo = "\u00fc";' | grep .string
.string "\374"
$ LC_ALL=C gcc -x c++ -o - -S - -fexec-charset=utf8 <<< 'const char *foo = "\u00fc";' | grep .string
.string "\303\274"
$ LC_ALL=C gcc -x c++ -o - -S - -fexec-charset=utf16 <<< 'const char *foo = "\u00fc";' | grep .string
.string "\377\376\374"
@

This means that what ends up in your char array that you pass to QString ctor is pretty much up to your compiler, may change for every translation unit and may be out of your control (load a plugin that changes the codec for the C strings => doom).
Therefore, stay on the safe side: don't use \u inside strings unless you are 100% sure of the WHOLE toolchain, locale set by the user, etc; use ascii characters only in the source file; use the \x escape sequence instead. In any case, use QString::fromUtf8/Latin1/Utf16 inside your program, and if possible, shut down all unsafe conversions from/to C strings by defining QT_NO_CAST_FROM_ASCII and QT_NO_CAST_TO_ASCII.

QByteArray::append(QString) uses toAscii on the string, which again uses the codec for c strings, otherwise converts to latin1. If you want to convert to utf8, use toUtf8.
Watch out, qDebug() may be not unicode safe. Always check with toUtf8().toHex() what's really inside your strings.