Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. QChar::unicode() concern

QChar::unicode() concern

Scheduled Pinned Locked Moved Solved General and Desktop
9 Posts 4 Posters 1.1k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • PerdrixP Offline
    PerdrixP Offline
    Perdrix
    wrote on last edited by
    #1

    QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support

    The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.

    So if I were to iterate over a QString thus:

    QString password{ emailPassword->text() };
    
    //
    // Obfuscate the password prior to saving it
    //
    for (auto& character : password)
    {
    	character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U);
    }
    

    that it will only work as "expected" for BMP characters only?

    Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

    Thanks
    David

    SGaistS C 2 Replies Last reply
    0
    • PerdrixP Perdrix

      QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support

      The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.

      So if I were to iterate over a QString thus:

      QString password{ emailPassword->text() };
      
      //
      // Obfuscate the password prior to saving it
      //
      for (auto& character : password)
      {
      	character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U);
      }
      

      that it will only work as "expected" for BMP characters only?

      Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

      Thanks
      David

      C Offline
      C Offline
      ChrisW67
      wrote on last edited by
      #8

      @Perdrix said in QChar::unicode() concern:

      Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

      Here is the simple experiment:

      #include <QCoreApplication>
      #include <QDebug>
      
      int main(int argc, char**argv) {
              QCoreApplication app(argc, argv);
      
              QString input("\U00010000 \U00010001 \U00010002 \U00010003 \U00010004");
              qDebug() << input;
      
              for (auto& character: input) {
                      QChar modified(QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U));
                      qDebug() << character << modified << modified.isHighSurrogate() << modified.isLowSurrogate();
              }
      
              return 0;
      }
      

      Results:

      "𐀀 𐀁 𐀂 𐀃 𐀄"
      '\ud800' '\ud882' true false
      '\udc00' '\udc82' false true
      ' ' '\u00a2' false false
      '\ud800' '\ud882' true false
      '\udc01' '\udc83' false true
      ' ' '\u00a2' false false
      '\ud800' '\ud882' true false
      '\udc02' '\udc80' false true
      ' ' '\u00a2' false false
      '\ud800' '\ud882' true false
      '\udc03' '\udc81' false true
      ' ' '\u00a2' false false
      '\ud800' '\ud882' true false
      '\udc04' '\udc86' false true
      

      Qt returns surrogate code points for characters not in the BMP, as expected.
      Your code will toggle bits in the low byte of each 16-bit integer, as expected. This will modify both halves of the surrogate pair and result in a valid surrogate pair though not necessarily a valid code point.

      PerdrixP 1 Reply Last reply
      0
      • PerdrixP Perdrix

        QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support

        The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.

        So if I were to iterate over a QString thus:

        QString password{ emailPassword->text() };
        
        //
        // Obfuscate the password prior to saving it
        //
        for (auto& character : password)
        {
        	character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U);
        }
        

        that it will only work as "expected" for BMP characters only?

        Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

        Thanks
        David

        SGaistS Offline
        SGaistS Offline
        SGaist
        Lifetime Qt Champion
        wrote on last edited by
        #2

        Hi,

        Looks like you should rather use QString::toUcs4.

        That said, obfuscating a password is not the correct way to handle that kind of data, you should encrypt it.

        Interested in AI ? www.idiap.ch
        Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

        PerdrixP 2 Replies Last reply
        1
        • SGaistS SGaist

          Hi,

          Looks like you should rather use QString::toUcs4.

          That said, obfuscating a password is not the correct way to handle that kind of data, you should encrypt it.

          PerdrixP Offline
          PerdrixP Offline
          Perdrix
          wrote on last edited by Perdrix
          #3

          @SGaist Does Qt offer a cross-platform encryption API?

          In this case it's not a "critical" password so obfuscation isn't really an issue ...

          Outlook used to (and may still) encrypt its SMTP password using:

          CryptProtectData(&blobClear, 0, 0, 0, 0, CRYPTPROTECT_UI_FORBIDDEN, &blobEncrypted));

          which uses a private key specific to the current user.

          But that of course isn't very portable ...

          JonBJ 1 Reply Last reply
          0
          • SGaistS SGaist

            Hi,

            Looks like you should rather use QString::toUcs4.

            That said, obfuscating a password is not the correct way to handle that kind of data, you should encrypt it.

            PerdrixP Offline
            PerdrixP Offline
            Perdrix
            wrote on last edited by
            #4

            @SGaist The use of UCS4 isn't the issue - it is what the unicode() mf does that is crucial here - if it correctly handles 1st Supplementary plane characters then that is fine. If it doesn't do so, that needs to be documented very clearly saying effectively that you shouldn't expect this to work if you are using characters that aren't in the BMP

            SGaistS 1 Reply Last reply
            0
            • PerdrixP Perdrix

              @SGaist Does Qt offer a cross-platform encryption API?

              In this case it's not a "critical" password so obfuscation isn't really an issue ...

              Outlook used to (and may still) encrypt its SMTP password using:

              CryptProtectData(&blobClear, 0, 0, 0, 0, CRYPTPROTECT_UI_FORBIDDEN, &blobEncrypted));

              which uses a private key specific to the current user.

              But that of course isn't very portable ...

              JonBJ Offline
              JonBJ Offline
              JonB
              wrote on last edited by
              #5

              @Perdrix said in QChar::unicode() concern:

              @SGaist Does Qt offer a cross-platform encryption API?

              What about QCryptographicHash Class?

              PerdrixP 1 Reply Last reply
              0
              • JonBJ JonB

                @Perdrix said in QChar::unicode() concern:

                @SGaist Does Qt offer a cross-platform encryption API?

                What about QCryptographicHash Class?

                PerdrixP Offline
                PerdrixP Offline
                Perdrix
                wrote on last edited by
                #6

                @JonB That class only creates a hash of data, and AFAICT, there's no encryption capability. FWIW if a password is encrypted using CryptProtectData, then any code running in the user's windows session can decrypt that data.

                So using CryptProtectData() to protect password data is only as secure as the user's login password, or their malware detection code (to prevent bad actor's code running in the user's context).

                David

                1 Reply Last reply
                1
                • PerdrixP Perdrix

                  @SGaist The use of UCS4 isn't the issue - it is what the unicode() mf does that is crucial here - if it correctly handles 1st Supplementary plane characters then that is fine. If it doesn't do so, that needs to be documented very clearly saying effectively that you shouldn't expect this to work if you are using characters that aren't in the BMP

                  SGaistS Offline
                  SGaistS Offline
                  SGaist
                  Lifetime Qt Champion
                  wrote on last edited by
                  #7

                  @Perdrix said in QChar::unicode() concern:

                  @SGaist The use of UCS4 isn't the issue - it is what the unicode() mf does that is crucial here - if it correctly handles 1st Supplementary plane characters then that is fine. If it doesn't do so, that needs to be documented very clearly saying effectively that you shouldn't expect this to work if you are using characters that aren't in the BMP

                  I think that this is something that you should bring to the interest mailing list. You'll find there Qt's developers/maintainers.

                  Interested in AI ? www.idiap.ch
                  Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                  1 Reply Last reply
                  0
                  • PerdrixP Perdrix

                    QChar::unicode() is documented to return the character as a char16_t. This appears to be heading down a similar rabbit hole to the one that C++20 got itself into with utf-8 support

                    The problem is that (if I understand things correctly) a QString is stored internally as utf-16 which means that for Basic Multilingual Plane characters a char16_t (or uint16_t) works just fine (ignoring the special range of U+D800 to U+DFFF). However this all falls apart for any characters from the 1st supplementary plane (U+010000 to U+10FFFF) which require TWO char16_t characters to encode.

                    So if I were to iterate over a QString thus:

                    QString password{ emailPassword->text() };
                    
                    //
                    // Obfuscate the password prior to saving it
                    //
                    for (auto& character : password)
                    {
                    	character = QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U);
                    }
                    

                    that it will only work as "expected" for BMP characters only?

                    Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

                    Thanks
                    David

                    C Offline
                    C Offline
                    ChrisW67
                    wrote on last edited by
                    #8

                    @Perdrix said in QChar::unicode() concern:

                    Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

                    Here is the simple experiment:

                    #include <QCoreApplication>
                    #include <QDebug>
                    
                    int main(int argc, char**argv) {
                            QCoreApplication app(argc, argv);
                    
                            QString input("\U00010000 \U00010001 \U00010002 \U00010003 \U00010004");
                            qDebug() << input;
                    
                            for (auto& character: input) {
                                    QChar modified(QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U));
                                    qDebug() << character << modified << modified.isHighSurrogate() << modified.isLowSurrogate();
                            }
                    
                            return 0;
                    }
                    

                    Results:

                    "𐀀 𐀁 𐀂 𐀃 𐀄"
                    '\ud800' '\ud882' true false
                    '\udc00' '\udc82' false true
                    ' ' '\u00a2' false false
                    '\ud800' '\ud882' true false
                    '\udc01' '\udc83' false true
                    ' ' '\u00a2' false false
                    '\ud800' '\ud882' true false
                    '\udc02' '\udc80' false true
                    ' ' '\u00a2' false false
                    '\ud800' '\ud882' true false
                    '\udc03' '\udc81' false true
                    ' ' '\u00a2' false false
                    '\ud800' '\ud882' true false
                    '\udc04' '\udc86' false true
                    

                    Qt returns surrogate code points for characters not in the BMP, as expected.
                    Your code will toggle bits in the low byte of each 16-bit integer, as expected. This will modify both halves of the surrogate pair and result in a valid surrogate pair though not necessarily a valid code point.

                    PerdrixP 1 Reply Last reply
                    0
                    • C ChrisW67

                      @Perdrix said in QChar::unicode() concern:

                      Or will it return 2 char16_t (one at a time) for any characters from the 1st Supplementary Plane?

                      Here is the simple experiment:

                      #include <QCoreApplication>
                      #include <QDebug>
                      
                      int main(int argc, char**argv) {
                              QCoreApplication app(argc, argv);
                      
                              QString input("\U00010000 \U00010001 \U00010002 \U00010003 \U00010004");
                              qDebug() << input;
                      
                              for (auto& character: input) {
                                      QChar modified(QChar(static_cast<uint16_t>(character.unicode()) ^ 0x82U));
                                      qDebug() << character << modified << modified.isHighSurrogate() << modified.isLowSurrogate();
                              }
                      
                              return 0;
                      }
                      

                      Results:

                      "𐀀 𐀁 𐀂 𐀃 𐀄"
                      '\ud800' '\ud882' true false
                      '\udc00' '\udc82' false true
                      ' ' '\u00a2' false false
                      '\ud800' '\ud882' true false
                      '\udc01' '\udc83' false true
                      ' ' '\u00a2' false false
                      '\ud800' '\ud882' true false
                      '\udc02' '\udc80' false true
                      ' ' '\u00a2' false false
                      '\ud800' '\ud882' true false
                      '\udc03' '\udc81' false true
                      ' ' '\u00a2' false false
                      '\ud800' '\ud882' true false
                      '\udc04' '\udc86' false true
                      

                      Qt returns surrogate code points for characters not in the BMP, as expected.
                      Your code will toggle bits in the low byte of each 16-bit integer, as expected. This will modify both halves of the surrogate pair and result in a valid surrogate pair though not necessarily a valid code point.

                      PerdrixP Offline
                      PerdrixP Offline
                      Perdrix
                      wrote on last edited by Perdrix
                      #9

                      @ChrisW67 I didn't know you could feed QString with \U00010000 etc. .

                      Thanks for setting my mind at rest.

                      D.

                      1 Reply Last reply
                      0
                      • PerdrixP Perdrix has marked this topic as solved on

                      • Login

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • Users
                      • Groups
                      • Search
                      • Get Qt Extensions
                      • Unsolved