QChar::unicode() concern

Perdrix

QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support

The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.

So if I were to iterate over a QString thus:

QString password{ emailPassword->text() };

//
// Obfuscate the password prior to saving it
//
for (auto& character : password)
{
	character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U);
}

that it will only work as "expected" for BMP characters only?

Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

Thanks
David

ChrisW67

@Perdrix said in QChar::unicode() concern:

Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

Here is the simple experiment:

#include <QCoreApplication>
#include <QDebug>

int main(int argc, char**argv) {
        QCoreApplication app(argc, argv);

        QString input("\U00010000 \U00010001 \U00010002 \U00010003 \U00010004");
        qDebug() << input;

        for (auto& character: input) {
                QChar modified(QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U));
                qDebug() << character << modified << modified.isHighSurrogate() << modified.isLowSurrogate();
        }

        return 0;
}

Results:

"𐀀 𐀁 𐀂 𐀃 𐀄"
'\ud800' '\ud882' true false
'\udc00' '\udc82' false true
' ' '\u00a2' false false
'\ud800' '\ud882' true false
'\udc01' '\udc83' false true
' ' '\u00a2' false false
'\ud800' '\ud882' true false
'\udc02' '\udc80' false true
' ' '\u00a2' false false
'\ud800' '\ud882' true false
'\udc03' '\udc81' false true
' ' '\u00a2' false false
'\ud800' '\ud882' true false
'\udc04' '\udc86' false true

Qt returns surrogate code points for characters not in the BMP, as expected.
Your code will toggle bits in the low byte of each 16-bit integer, as expected. This will modify both halves of the surrogate pair and result in a valid surrogate pair though not necessarily a valid code point.

SGaist

Hi,

Looks like you should rather use QString::toUcs4.

That said, obfuscating a password is not the correct way to handle that kind of data, you should encrypt it.

Perdrix

@SGaist Does Qt offer a cross-platform encryption API?

In this case it's not a "critical" password so obfuscation isn't really an issue ...

Outlook used to (and may still) encrypt its SMTP password using:

CryptProtectData(&blobClear, 0, 0, 0, 0, CRYPTPROTECT_UI_FORBIDDEN, &blobEncrypted));

which uses a private key specific to the current user.

But that of course isn't very portable ...

Perdrix

@SGaist The use of UCS4 isn't the issue - it is what the unicode() mf does that is crucial here - if it correctly handles 1st Supplementary plane characters then that is fine. If it doesn't do so, that needs to be documented very clearly saying effectively that you shouldn't expect this to work if you are using characters that aren't in the BMP

JonB

@Perdrix said in QChar::unicode() concern:

@SGaist Does Qt offer a cross-platform encryption API?

What about QCryptographicHash Class?

Perdrix

@JonB That class only creates a hash of data, and AFAICT, there's no encryption capability. FWIW if a password is encrypted using CryptProtectData, then any code running in the user's windows session can decrypt that data.

So using CryptProtectData() to protect password data is only as secure as the user's login password, or their malware detection code (to prevent bad actor's code running in the user's context).

David

SGaist

@Perdrix said in QChar::unicode() concern:

@SGaist The use of UCS4 isn't the issue - it is what the unicode() mf does that is crucial here - if it correctly handles 1st Supplementary plane characters then that is fine. If it doesn't do so, that needs to be documented very clearly saying effectively that you shouldn't expect this to work if you are using characters that aren't in the BMP

I think that this is something that you should bring to the interest mailing list. You'll find there Qt's developers/maintainers.

ChrisW67

@Perdrix said in QChar::unicode() concern:

Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

Here is the simple experiment:

#include <QCoreApplication>
#include <QDebug>

int main(int argc, char**argv) {
        QCoreApplication app(argc, argv);

        QString input("\U00010000 \U00010001 \U00010002 \U00010003 \U00010004");
        qDebug() << input;

        for (auto& character: input) {
                QChar modified(QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U));
                qDebug() << character << modified << modified.isHighSurrogate() << modified.isLowSurrogate();
        }

        return 0;
}

Results:

"𐀀 𐀁 𐀂 𐀃 𐀄"
'\ud800' '\ud882' true false
'\udc00' '\udc82' false true
' ' '\u00a2' false false
'\ud800' '\ud882' true false
'\udc01' '\udc83' false true
' ' '\u00a2' false false
'\ud800' '\ud882' true false
'\udc02' '\udc80' false true
' ' '\u00a2' false false
'\ud800' '\ud882' true false
'\udc03' '\udc81' false true
' ' '\u00a2' false false
'\ud800' '\ud882' true false
'\udc04' '\udc86' false true

Qt returns surrogate code points for characters not in the BMP, as expected.
Your code will toggle bits in the low byte of each 16-bit integer, as expected. This will modify both halves of the surrogate pair and result in a valid surrogate pair though not necessarily a valid code point.

Perdrix

@ChrisW67 I didn't know you could feed QString with \U00010000 etc. .

Thanks for setting my mind at rest.

D.