How to check if QByteArray is valid UTF8?

Donald Duck

How can I check if a QByteArray is valid UTF8? I tried QString::fromUTF8(byteArray) but according to the documentation, this will just replace or suppress any invalid sequences silently without telling me what happened.

VRonin

Try

QTextStream stream(byteArray)
stream.setCodec("UTF-8");
stream.setAutoDetectUnicode(false);
stream.readAll();
if(stream.status()==QTextStream::Ok) qDebug("Array is valid UTF-8");
else qDebug("Array is not valid UTF-8");

Donald Duck

@VRonin That didn't work, it always said that it is valid UTF-8 even when it's not.

VRonin

Found the solution here: https://forum.qt.io/topic/55325/solved-how-to-know-if-qtextstream-could-not-encode-data-it-reads

#include <QDebug>
#include <QTextCodec>
int main(int argc, char **argv) {
    QByteArray byteArray("test\xc3\xb1test");
    QTextCodec::ConverterState state;
    QTextCodec *codec = QTextCodec::codecForName("UTF-8");
/////////////////////////////////////////////////////////////////////////////////////////////////////
    const QString validText = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
    if (state.invalidChars == 0)
        qDebug("Array is valid UTF-8");
    else
        qDebug("Array is not valid UTF-8");
/////////////////////////////////////////////////////////////////////////////////////////////////////
    byteArray = QByteArray("test\xc3\x28test");
    const QString invalidText = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
    if (state.invalidChars == 0)
        qDebug("Array is valid UTF-8");
    else
        qDebug("Array is not valid UTF-8");
/////////////////////////////////////////////////////////////////////////////////////////////////////
    return 0;
}

Violet Giraffe

@VRonin, ConverterState doesn't seem to be documented. Looks more like implementation detail than intended way to monitor conversion results, else why wouldn't every of the three overloads take this optional out parameter?

VRonin

@Violet-Giraffe said in How to check if QByteArray is valid UTF8?:

ConverterState doesn't seem to be documented

Agree, it should be

Looks more like implementation detail than intended way to monitor conversion results

It's explicitly exported, not hidden in the private implementation so I don't think this is true

why wouldn't every of the three overloads take this optional out parameter?

The other overloads are just for convenience, internally they all call the 3 arguments method

Violet Giraffe

@VRonin, fair enoug. Don't get me wrong, your answer is spot on and it is, in fact, the only proper way of checking UTF-8 for correctness I've found, but I worry it may break in the future.

VRonin

@Violet-Giraffe said in How to check if QByteArray is valid UTF8?:

but I worry it may break in the future

It can only be broken in major releases Qt6, Qt7, etc. so very infrequently

SlySven

Also, if you are creating your own codec (because Qt uses ICU for one and the OS you are on does not use ICU so you are left on your own) you are supposed to adjust QTextCodec::ConverterState::invalidChars (and QTextCodec::ConverterState::remainingChars if appropriate) AND consider the QTextCodec::ConverterState::flags enum - particularly the ConvertInvalidToNull {if set each invalid char in input should be output as a null} and IgnoreHeader {if set any BOM characters at the beginning of Unicode input should be skipped and none generated in output}, in your sub-class...

As it happens it looks like Qt tries a character or string and looks for a zero invalid char count to provide the mechanism to work the QTextCodec::canEncode(...) methods for your sub-class.