Detecting unicode encoding errors
-
Greetings all,
Is it possible to detect errors while encoding a
QString
fromchar*
?Here's the case: I get a string as a
char*
form an external library (on which I don't have any control).
This string can use either UTF8 encoding, or "local8bit" encoding - thus may vary from a user to the other.What I basically need is to be table to write something like this:
char const* src = ...; bool encoding_failed = false; QString qs = QString::fromUtf8(src, &encoding_failed); if ( encoding_failed ) { qs = QString::fromLocal8Bit(src); }
I looked at the docs for
QString
andQTextCodec
, but I couldn't find any error support.Is there any (portable) way of achieving this?
Thanks in advance for any hint.
-
hi and welcome
I didn't find any error reporting functions but I did stumble uponQTextCodec::ConverterState state;
QTextCodec *codec = QTextCodec::codecForName("UTF-8");
const QString text = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
if (state.invalidChars > 0) {
qDebug() << "Not a valid UTF-8 sequence.";
}I hope it might be useful.
-
@mrjj Hello and thank you mrjj,
I also stumbled on this
QTextCodec::ConverterState
class, but I did not go further as its contents is not documented: QTextCodec::ConverterState
Therefore, is it safe to assume your (nice) code, which relies on the undocumentedinvalidChars
data member, will still work in the future?
I ended with something like this (slightly simplified):char const* src = ...; QString qs = QString::fromUtf8(src); QByteArray utf8 = qs.toUtf8(); if ( utf8 != src ) qs = QString::fromLocal8Bit();
...but that seems a bit overkill to me.
-
@ybailly71 said:
I also stumbled on this QTextCodec::ConverterState class, but I did not go further as its contents is not documented: QTextCodec::ConverterState
It appears to be documented under the QTextCodec::convertToUnicode function:
If state is not 0, the codec should save the state after the conversion in state, and adjust the remainingChars and invalidChars members of the struct.
The members in question are public, as indicated by http://doc.qt.io/qt-5/qtextcodec-converterstate-members.html but not listed in http://doc.qt.io/qt-5/qtextcodec-converterstate.html as those members don't have any documentation that the engine (Doxygen?) is recognising.
I'd say this means its safe to use them, but the doc formatting could be improved.
Cheers.
-
Hi
Just as @Paul-Colby , I do think its safe to use.
They are not flagged for removal and they are not in
a private file/class as implementation details always are
so should be ok.Your solution with
if ( utf8 != src )
is not bad but could be expensive with long strings. :) -
@ybailly71
Nice day to you too :)