Unsolved How to check if QByteArray is valid UTF8?
-
How can I check if a
QByteArray
is valid UTF8? I triedQString::fromUTF8(byteArray)
but according to the documentation, this will just replace or suppress any invalid sequences silently without telling me what happened. -
Try
QTextStream stream(byteArray) stream.setCodec("UTF-8"); stream.setAutoDetectUnicode(false); stream.readAll(); if(stream.status()==QTextStream::Ok) qDebug("Array is valid UTF-8"); else qDebug("Array is not valid UTF-8");
-
@VRonin That didn't work, it always said that it is valid UTF-8 even when it's not.
-
Found the solution here: https://forum.qt.io/topic/55325/solved-how-to-know-if-qtextstream-could-not-encode-data-it-reads
#include <QDebug> #include <QTextCodec> int main(int argc, char **argv) { QByteArray byteArray("test\xc3\xb1test"); QTextCodec::ConverterState state; QTextCodec *codec = QTextCodec::codecForName("UTF-8"); ///////////////////////////////////////////////////////////////////////////////////////////////////// const QString validText = codec->toUnicode(byteArray.constData(), byteArray.size(), &state); if (state.invalidChars == 0) qDebug("Array is valid UTF-8"); else qDebug("Array is not valid UTF-8"); ///////////////////////////////////////////////////////////////////////////////////////////////////// byteArray = QByteArray("test\xc3\x28test"); const QString invalidText = codec->toUnicode(byteArray.constData(), byteArray.size(), &state); if (state.invalidChars == 0) qDebug("Array is valid UTF-8"); else qDebug("Array is not valid UTF-8"); ///////////////////////////////////////////////////////////////////////////////////////////////////// return 0; }
-
@VRonin, ConverterState doesn't seem to be documented. Looks more like implementation detail than intended way to monitor conversion results, else why wouldn't every of the three overloads take this optional out parameter?
-
@Violet-Giraffe said in How to check if QByteArray is valid UTF8?:
ConverterState doesn't seem to be documented
Agree, it should be
Looks more like implementation detail than intended way to monitor conversion results
It's explicitly exported, not hidden in the private implementation so I don't think this is true
why wouldn't every of the three overloads take this optional out parameter?
The other overloads are just for convenience, internally they all call the 3 arguments method
-
@VRonin, fair enoug. Don't get me wrong, your answer is spot on and it is, in fact, the only proper way of checking UTF-8 for correctness I've found, but I worry it may break in the future.
-
@Violet-Giraffe said in How to check if QByteArray is valid UTF8?:
but I worry it may break in the future
It can only be broken in major releases Qt6, Qt7, etc. so very infrequently
-
Also, if you are creating your own codec (because Qt uses ICU for one and the OS you are on does not use ICU so you are left on your own) you are supposed to adjust
QTextCodec::ConverterState::invalidChars
(andQTextCodec::ConverterState::remainingChars
if appropriate) AND consider theQTextCodec::ConverterState::flags
enum - particularly theConvertInvalidToNull
{if set each invalid char in input should be output as a null} andIgnoreHeader
{if set any BOM characters at the beginning of Unicode input should be skipped and none generated in output}, in your sub-class...As it happens it looks like Qt tries a character or string and looks for a zero invalid char count to provide the mechanism to work the
QTextCodec::canEncode(...)
methods for your sub-class.