Detecting unicode encoding errors

ybailly71

Greetings all,

Is it possible to detect errors while encoding a QString from char*?

Here's the case: I get a string as a char* form an external library (on which I don't have any control).
This string can use either UTF8 encoding, or "local8bit" encoding - thus may vary from a user to the other.

What I basically need is to be table to write something like this:

char const* src = ...;
bool encoding_failed = false;
QString qs = QString::fromUtf8(src, &encoding_failed);
if ( encoding_failed )
{
  qs = QString::fromLocal8Bit(src);
}

I looked at the docs for QString and QTextCodec, but I couldn't find any error support.

Is there any (portable) way of achieving this?

Thanks in advance for any hint.

mrjj

hi and welcome
I didn't find any error reporting functions but I did stumble upon

QTextCodec::ConverterState state;
QTextCodec *codec = QTextCodec::codecForName("UTF-8");
const QString text = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
if (state.invalidChars > 0) {
qDebug() << "Not a valid UTF-8 sequence.";
}

I hope it might be useful.

ybailly71

This post is deleted!

ybailly71

@mrjj Hello and thank you mrjj,

I also stumbled on this QTextCodec::ConverterState class, but I did not go further as its contents is not documented: QTextCodec::ConverterState
Therefore, is it safe to assume your (nice) code, which relies on the undocumented invalidChars data member, will still work in the future?
I ended with something like this (slightly simplified):

char const* src = ...;
QString qs = QString::fromUtf8(src);
QByteArray utf8 = qs.toUtf8();
if ( utf8 != src )
  qs = QString::fromLocal8Bit();

...but that seems a bit overkill to me.

Paul Colby

@ybailly71 said:

I also stumbled on this QTextCodec::ConverterState class, but I did not go further as its contents is not documented: QTextCodec::ConverterState

It appears to be documented under the QTextCodec::convertToUnicode function:

If state is not 0, the codec should save the state after the conversion in state, and adjust the remainingChars and invalidChars members of the struct.

The members in question are public, as indicated by http://doc.qt.io/qt-5/qtextcodec-converterstate-members.html but not listed in http://doc.qt.io/qt-5/qtextcodec-converterstate.html as those members don't have any documentation that the engine (Doxygen?) is recognising.

I'd say this means its safe to use them, but the doc formatting could be improved.

Cheers.

mrjj

Hi
Just as @Paul-Colby , I do think its safe to use.
They are not flagged for removal and they are not in
a private file/class as implementation details always are
so should be ok.

Your solution with
if ( utf8 != src )
is not bad but could be expensive with long strings. :)

ybailly71

Ok thanks all, I'll go with that then.

Have a nice day :-)

mrjj

@ybailly71
Nice day to you too :)