Detecting unicode encoding errors



  • Greetings all,

    Is it possible to detect errors while encoding a QString from char*?

    Here's the case: I get a string as a char* form an external library (on which I don't have any control).
    This string can use either UTF8 encoding, or "local8bit" encoding - thus may vary from a user to the other.

    What I basically need is to be table to write something like this:

    char const* src = ...;
    bool encoding_failed = false;
    QString qs = QString::fromUtf8(src, &encoding_failed);
    if ( encoding_failed )
    {
      qs = QString::fromLocal8Bit(src);
    }
    

    I looked at the docs for QString and QTextCodec, but I couldn't find any error support.

    Is there any (portable) way of achieving this?

    Thanks in advance for any hint.


  • Qt Champions 2016

    hi and welcome
    I didn't find any error reporting functions but I did stumble upon

    QTextCodec::ConverterState state;
    QTextCodec *codec = QTextCodec::codecForName("UTF-8");
    const QString text = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
    if (state.invalidChars > 0) {
    qDebug() << "Not a valid UTF-8 sequence.";
    }

    I hope it might be useful.



  • This post is deleted!


  • @mrjj Hello and thank you mrjj,

    I also stumbled on this QTextCodec::ConverterState class, but I did not go further as its contents is not documented: QTextCodec::ConverterState
    Therefore, is it safe to assume your (nice) code, which relies on the undocumented invalidChars data member, will still work in the future?
    I ended with something like this (slightly simplified):

    char const* src = ...;
    QString qs = QString::fromUtf8(src);
    QByteArray utf8 = qs.toUtf8();
    if ( utf8 != src )
      qs = QString::fromLocal8Bit();
    

    ...but that seems a bit overkill to me.



  • @ybailly71 said:

    I also stumbled on this QTextCodec::ConverterState class, but I did not go further as its contents is not documented: QTextCodec::ConverterState

    It appears to be documented under the QTextCodec::convertToUnicode function:

    If state is not 0, the codec should save the state after the conversion in state, and adjust the remainingChars and invalidChars members of the struct.

    The members in question are public, as indicated by http://doc.qt.io/qt-5/qtextcodec-converterstate-members.html but not listed in http://doc.qt.io/qt-5/qtextcodec-converterstate.html as those members don't have any documentation that the engine (Doxygen?) is recognising.

    I'd say this means its safe to use them, but the doc formatting could be improved.

    Cheers.


  • Qt Champions 2016

    Hi
    Just as @Paul-Colby , I do think its safe to use.
    They are not flagged for removal and they are not in
    a private file/class as implementation details always are
    so should be ok.

    Your solution with
    if ( utf8 != src )
    is not bad but could be expensive with long strings. :)



  • Ok thanks all, I'll go with that then.

    Have a nice day :-)


  • Qt Champions 2016

    @ybailly71
    Nice day to you too :)


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.