QChar::unicode() concern
-
QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support
The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.
So if I were to iterate over a QString thus:
QString password{ emailPassword->text() }; // // Obfuscate the password prior to saving it // for (auto& character : password) { character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U); }
that it will only work as "expected" for BMP characters only?
Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?
Thanks
David -
QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support
The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.
So if I were to iterate over a QString thus:
QString password{ emailPassword->text() }; // // Obfuscate the password prior to saving it // for (auto& character : password) { character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U); }
that it will only work as "expected" for BMP characters only?
Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?
Thanks
David@Perdrix said in QChar::unicode() concern:
Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?
Here is the simple experiment:
#include <QCoreApplication> #include <QDebug> int main(int argc, char**argv) { QCoreApplication app(argc, argv); QString input("\U00010000 \U00010001 \U00010002 \U00010003 \U00010004"); qDebug() << input; for (auto& character: input) { QChar modified(QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U)); qDebug() << character << modified << modified.isHighSurrogate() << modified.isLowSurrogate(); } return 0; }
Results:
"𐀀 𐀁 𐀂 𐀃 𐀄" '\ud800' '\ud882' true false '\udc00' '\udc82' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc01' '\udc83' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc02' '\udc80' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc03' '\udc81' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc04' '\udc86' false true
Qt returns surrogate code points for characters not in the BMP, as expected.
Your code will toggle bits in the low byte of each 16-bit integer, as expected. This will modify both halves of the surrogate pair and result in a valid surrogate pair though not necessarily a valid code point. -
QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support
The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.
So if I were to iterate over a QString thus:
QString password{ emailPassword->text() }; // // Obfuscate the password prior to saving it // for (auto& character : password) { character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U); }
that it will only work as "expected" for BMP characters only?
Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?
Thanks
DavidHi,
Looks like you should rather use QString::toUcs4.
That said, obfuscating a password is not the correct way to handle that kind of data, you should encrypt it.
-
Hi,
Looks like you should rather use QString::toUcs4.
That said, obfuscating a password is not the correct way to handle that kind of data, you should encrypt it.
@SGaist Does Qt offer a cross-platform encryption API?
In this case it's not a "critical" password so obfuscation isn't really an issue ...
Outlook used to (and may still) encrypt its SMTP password using:
CryptProtectData(&blobClear, 0, 0, 0, 0, CRYPTPROTECT_UI_FORBIDDEN, &blobEncrypted));
which uses a private key specific to the current user.
But that of course isn't very portable ...
-
Hi,
Looks like you should rather use QString::toUcs4.
That said, obfuscating a password is not the correct way to handle that kind of data, you should encrypt it.
@SGaist The use of UCS4 isn't the issue - it is what the unicode() mf does that is crucial here - if it correctly handles 1st Supplementary plane characters then that is fine. If it doesn't do so, that needs to be documented very clearly saying effectively that you shouldn't expect this to work if you are using characters that aren't in the BMP
-
@SGaist Does Qt offer a cross-platform encryption API?
In this case it's not a "critical" password so obfuscation isn't really an issue ...
Outlook used to (and may still) encrypt its SMTP password using:
CryptProtectData(&blobClear, 0, 0, 0, 0, CRYPTPROTECT_UI_FORBIDDEN, &blobEncrypted));
which uses a private key specific to the current user.
But that of course isn't very portable ...
@Perdrix said in QChar::unicode() concern:
@SGaist Does Qt offer a cross-platform encryption API?
What about QCryptographicHash Class?
-
@Perdrix said in QChar::unicode() concern:
@SGaist Does Qt offer a cross-platform encryption API?
What about QCryptographicHash Class?
@JonB That class only creates a hash of data, and AFAICT, there's no encryption capability. FWIW if a password is encrypted using CryptProtectData, then any code running in the user's windows session can decrypt that data.
So using CryptProtectData() to protect password data is only as secure as the user's login password, or their malware detection code (to prevent bad actor's code running in the user's context).
David
-
@SGaist The use of UCS4 isn't the issue - it is what the unicode() mf does that is crucial here - if it correctly handles 1st Supplementary plane characters then that is fine. If it doesn't do so, that needs to be documented very clearly saying effectively that you shouldn't expect this to work if you are using characters that aren't in the BMP
@Perdrix said in QChar::unicode() concern:
@SGaist The use of UCS4 isn't the issue - it is what the unicode() mf does that is crucial here - if it correctly handles 1st Supplementary plane characters then that is fine. If it doesn't do so, that needs to be documented very clearly saying effectively that you shouldn't expect this to work if you are using characters that aren't in the BMP
I think that this is something that you should bring to the interest mailing list. You'll find there Qt's developers/maintainers.
-
QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support
The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.
So if I were to iterate over a QString thus:
QString password{ emailPassword->text() }; // // Obfuscate the password prior to saving it // for (auto& character : password) { character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U); }
that it will only work as "expected" for BMP characters only?
Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?
Thanks
David@Perdrix said in QChar::unicode() concern:
Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?
Here is the simple experiment:
#include <QCoreApplication> #include <QDebug> int main(int argc, char**argv) { QCoreApplication app(argc, argv); QString input("\U00010000 \U00010001 \U00010002 \U00010003 \U00010004"); qDebug() << input; for (auto& character: input) { QChar modified(QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U)); qDebug() << character << modified << modified.isHighSurrogate() << modified.isLowSurrogate(); } return 0; }
Results:
"𐀀 𐀁 𐀂 𐀃 𐀄" '\ud800' '\ud882' true false '\udc00' '\udc82' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc01' '\udc83' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc02' '\udc80' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc03' '\udc81' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc04' '\udc86' false true
Qt returns surrogate code points for characters not in the BMP, as expected.
Your code will toggle bits in the low byte of each 16-bit integer, as expected. This will modify both halves of the surrogate pair and result in a valid surrogate pair though not necessarily a valid code point. -
@Perdrix said in QChar::unicode() concern:
Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?
Here is the simple experiment:
#include <QCoreApplication> #include <QDebug> int main(int argc, char**argv) { QCoreApplication app(argc, argv); QString input("\U00010000 \U00010001 \U00010002 \U00010003 \U00010004"); qDebug() << input; for (auto& character: input) { QChar modified(QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U)); qDebug() << character << modified << modified.isHighSurrogate() << modified.isLowSurrogate(); } return 0; }
Results:
"𐀀 𐀁 𐀂 𐀃 𐀄" '\ud800' '\ud882' true false '\udc00' '\udc82' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc01' '\udc83' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc02' '\udc80' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc03' '\udc81' false true ' ' '\u00a2' false false '\ud800' '\ud882' true false '\udc04' '\udc86' false true
Qt returns surrogate code points for characters not in the BMP, as expected.
Your code will toggle bits in the low byte of each 16-bit integer, as expected. This will modify both halves of the surrogate pair and result in a valid surrogate pair though not necessarily a valid code point. -
P Perdrix has marked this topic as solved on