Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

How to check if QByteArray is valid UTF8?



  • How can I check if a QByteArray is valid UTF8? I tried QString::fromUTF8(byteArray) but according to the documentation, this will just replace or suppress any invalid sequences silently without telling me what happened.



  • Try

    QTextStream stream(byteArray)
    stream.setCodec("UTF-8");
    stream.setAutoDetectUnicode(false);
    stream.readAll();
    if(stream.status()==QTextStream::Ok) qDebug("Array is valid UTF-8");
    else qDebug("Array is not valid UTF-8");
    


  • @VRonin That didn't work, it always said that it is valid UTF-8 even when it's not.



  • Found the solution here: https://forum.qt.io/topic/55325/solved-how-to-know-if-qtextstream-could-not-encode-data-it-reads

    #include <QDebug>
    #include <QTextCodec>
    int main(int argc, char **argv) {
        QByteArray byteArray("test\xc3\xb1test");
        QTextCodec::ConverterState state;
        QTextCodec *codec = QTextCodec::codecForName("UTF-8");
    /////////////////////////////////////////////////////////////////////////////////////////////////////
        const QString validText = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
        if (state.invalidChars == 0)
            qDebug("Array is valid UTF-8");
        else
            qDebug("Array is not valid UTF-8");
    /////////////////////////////////////////////////////////////////////////////////////////////////////
        byteArray = QByteArray("test\xc3\x28test");
        const QString invalidText = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
        if (state.invalidChars == 0)
            qDebug("Array is valid UTF-8");
        else
            qDebug("Array is not valid UTF-8");
    /////////////////////////////////////////////////////////////////////////////////////////////////////
        return 0;
    }
    


  • @VRonin, ConverterState doesn't seem to be documented. Looks more like implementation detail than intended way to monitor conversion results, else why wouldn't every of the three overloads take this optional out parameter?



  • @Violet-Giraffe said in How to check if QByteArray is valid UTF8?:

    ConverterState doesn't seem to be documented

    Agree, it should be

    Looks more like implementation detail than intended way to monitor conversion results

    It's explicitly exported, not hidden in the private implementation so I don't think this is true

    why wouldn't every of the three overloads take this optional out parameter?

    The other overloads are just for convenience, internally they all call the 3 arguments method



  • @VRonin, fair enoug. Don't get me wrong, your answer is spot on and it is, in fact, the only proper way of checking UTF-8 for correctness I've found, but I worry it may break in the future.



  • @Violet-Giraffe said in How to check if QByteArray is valid UTF8?:

    but I worry it may break in the future

    It can only be broken in major releases Qt6, Qt7, etc. so very infrequently



  • Also, if you are creating your own codec (because Qt uses ICU for one and the OS you are on does not use ICU so you are left on your own) you are supposed to adjust QTextCodec::ConverterState::invalidChars (and QTextCodec::ConverterState::remainingChars if appropriate) AND consider the QTextCodec::ConverterState::flags enum - particularly the ConvertInvalidToNull {if set each invalid char in input should be output as a null} and IgnoreHeader {if set any BOM characters at the beginning of Unicode input should be skipped and none generated in output}, in your sub-class...

    As it happens it looks like Qt tries a character or string and looks for a zero invalid char count to provide the mechanism to work the QTextCodec::canEncode(...) methods for your sub-class.


Log in to reply