Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. QRegExp does not support unicode 16 bit character?
Forum Updated to NodeBB v4.3 + New Features

QRegExp does not support unicode 16 bit character?

Scheduled Pinned Locked Moved Solved General and Desktop
9 Posts 3 Posters 664 Views 2 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • K Offline
    K Offline
    king558
    wrote on last edited by king558
    #1
    auto someText = QString( "%1" ).arg( QChar( 0x2029 ) );
    QRegExp regexp1("\\u2029");
    // QRegExp regexp1("\\x{2029}");
    // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
    auto f1 = regexp1.indexIn(someText);
    QRegularExpression regexp2("\\x{2029}"); // No complaints here
    auto f2 = regexp2.match(someText).hasMatch();
    

    f1 is always -1, no matter it is \u2029 or \x{2029}, but QRegularExpression does work. That means QRegExp does not support 16 bit character?

    sierdzioS 1 Reply Last reply
    0
    • K king558
      auto someText = QString( "%1" ).arg( QChar( 0x2029 ) );
      QRegExp regexp1("\\u2029");
      // QRegExp regexp1("\\x{2029}");
      // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
      auto f1 = regexp1.indexIn(someText);
      QRegularExpression regexp2("\\x{2029}"); // No complaints here
      auto f2 = regexp2.match(someText).hasMatch();
      

      f1 is always -1, no matter it is \u2029 or \x{2029}, but QRegularExpression does work. That means QRegExp does not support 16 bit character?

      sierdzioS Offline
      sierdzioS Offline
      sierdzio
      Moderators
      wrote on last edited by
      #2

      @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

      \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

      https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

      Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

      (Z(:^

      K 1 Reply Last reply
      2
      • sierdzioS sierdzio

        @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

        \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

        https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

        Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

        K Offline
        K Offline
        king558
        wrote on last edited by king558
        #3

        @sierdzio said in QRegExp does not support unicode 16 bit character?:

        @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

        \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

        https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

        Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

        Thx your for your tips.

        auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
        QRegExp regexp1("\\xD83D");
         // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
        auto f1 = regexp1.indexIn(someText);
        auto f11 = regexp1.capturedTexts()[0];
        QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
        auto f2 = regexp2.match(someText);
        if (f2.hasMatch()) {
            auto f22 = f2.captured(1);
        }
        

        I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

        K 1 Reply Last reply
        0
        • K king558

          @sierdzio said in QRegExp does not support unicode 16 bit character?:

          @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

          \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

          https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

          Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

          Thx your for your tips.

          auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
          QRegExp regexp1("\\xD83D");
           // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
          auto f1 = regexp1.indexIn(someText);
          auto f11 = regexp1.capturedTexts()[0];
          QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
          auto f2 = regexp2.match(someText);
          if (f2.hasMatch()) {
              auto f22 = f2.captured(1);
          }
          

          I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

          K Offline
          K Offline
          king558
          wrote on last edited by
          #4

          @king558 said in QRegExp does not support unicode 16 bit character?:

          @sierdzio said in QRegExp does not support unicode 16 bit character?:

          @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

          \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

          https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

          Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

          Thx your for your tips.

          auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
          QRegExp regexp1("\\xD83D");
           // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
          auto f1 = regexp1.indexIn(someText);
          auto f11 = regexp1.capturedTexts()[0];
          QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
          auto f2 = regexp2.match(someText);
          if (f2.hasMatch()) {
              auto f22 = f2.captured(1);
          }
          

          I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

          I am using Qt 5.15.2 LTS and Qt 6.15 LTS

          C 1 Reply Last reply
          0
          • K king558

            @king558 said in QRegExp does not support unicode 16 bit character?:

            @sierdzio said in QRegExp does not support unicode 16 bit character?:

            @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

            \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

            https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

            Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

            Thx your for your tips.

            auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
            QRegExp regexp1("\\xD83D");
             // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
            auto f1 = regexp1.indexIn(someText);
            auto f11 = regexp1.capturedTexts()[0];
            QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
            auto f2 = regexp2.match(someText);
            if (f2.hasMatch()) {
                auto f22 = f2.captured(1);
            }
            

            I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

            I am using Qt 5.15.2 LTS and Qt 6.15 LTS

            C Offline
            C Offline
            ChrisW67
            wrote on last edited by
            #5

            @king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

            K 1 Reply Last reply
            1
            • C ChrisW67

              @king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

              K Offline
              K Offline
              king558
              wrote on last edited by king558
              #6

              @ChrisW67 said in QRegExp does not support unicode 16 bit character?:

              @king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

              Thx you for your hint. I corrected the code as follow, matched still false.

                  auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
                  QRegExp regexp1("\\xD83D\\xDCC9");
                  // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
                  auto f1 = regexp1.indexIn(someText);
                  auto f11 = regexp1.capturedTexts()[0];
                  QRegularExpression regexp2("\\xD83D\\xDCC9", QRegularExpression::CaseInsensitiveOption); // No complaints here
                  auto f2 = regexp2.match(someText);
                  auto matched = f2.hasMatch();
                  if ( matched ) {
                      auto f22 = f2.captured(1);
                      auto i = 0;
                  }
              
              C 1 Reply Last reply
              0
              • K king558

                @ChrisW67 said in QRegExp does not support unicode 16 bit character?:

                @king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

                Thx you for your hint. I corrected the code as follow, matched still false.

                    auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
                    QRegExp regexp1("\\xD83D\\xDCC9");
                    // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
                    auto f1 = regexp1.indexIn(someText);
                    auto f11 = regexp1.capturedTexts()[0];
                    QRegularExpression regexp2("\\xD83D\\xDCC9", QRegularExpression::CaseInsensitiveOption); // No complaints here
                    auto f2 = regexp2.match(someText);
                    auto matched = f2.hasMatch();
                    if ( matched ) {
                        auto f22 = f2.captured(1);
                        auto i = 0;
                    }
                
                C Offline
                C Offline
                ChrisW67
                wrote on last edited by ChrisW67
                #7

                @king558 First you need a logically correct and valid regular expression.

                Your test string contains a single Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND. Internally this is built of two UTF-16 surrogates.

                This RE \xD83D\xDCC9 matches byte 0xD8, char '3' , char 'D', byte 0xDC, char 'C', char '9'. That does not match your string.

                The pcre syntax for a multi-byte hex code point is \x{hhhh}. This, however, is flagged as an invalid RE:

                QRegularExpression re("\\x{D83D}\\x{DCC9}", QRegularExpression::CaseInsensitiveOption);
                

                When matching Unicode characters you specify the Unicode code point, not the surrogates that it may be encoded in.

                include <QCoreApplication>
                #include <QRegularExpression>
                #include <QDebug>
                
                int main(int argc, char **argv) {
                        QCoreApplication app(argc, argv);
                
                        auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
                        // More clearly:
                        // auto someText = QString( "\U0001f4c9" );
                
                        QRegularExpression re("\\x{1F4C9}", QRegularExpression::CaseInsensitiveOption); // No complaints here
                        auto f2 = re.match(someText);
                        auto matched = f2.hasMatch();
                        qDebug() << re;
                        qDebug() << f2;
                        qDebug() << matched;
                        return 0;
                }
                

                Output:

                RegularExpression("\\x{1F4C9}", QRegularExpression::PatternOptions("CaseInsensitiveOption"))
                QRegularExpressionMatch(Valid, has match: 0:(0, 2, "📉"))
                true
                
                K 1 Reply Last reply
                2
                • C ChrisW67

                  @king558 First you need a logically correct and valid regular expression.

                  Your test string contains a single Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND. Internally this is built of two UTF-16 surrogates.

                  This RE \xD83D\xDCC9 matches byte 0xD8, char '3' , char 'D', byte 0xDC, char 'C', char '9'. That does not match your string.

                  The pcre syntax for a multi-byte hex code point is \x{hhhh}. This, however, is flagged as an invalid RE:

                  QRegularExpression re("\\x{D83D}\\x{DCC9}", QRegularExpression::CaseInsensitiveOption);
                  

                  When matching Unicode characters you specify the Unicode code point, not the surrogates that it may be encoded in.

                  include <QCoreApplication>
                  #include <QRegularExpression>
                  #include <QDebug>
                  
                  int main(int argc, char **argv) {
                          QCoreApplication app(argc, argv);
                  
                          auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
                          // More clearly:
                          // auto someText = QString( "\U0001f4c9" );
                  
                          QRegularExpression re("\\x{1F4C9}", QRegularExpression::CaseInsensitiveOption); // No complaints here
                          auto f2 = re.match(someText);
                          auto matched = f2.hasMatch();
                          qDebug() << re;
                          qDebug() << f2;
                          qDebug() << matched;
                          return 0;
                  }
                  

                  Output:

                  RegularExpression("\\x{1F4C9}", QRegularExpression::PatternOptions("CaseInsensitiveOption"))
                  QRegularExpressionMatch(Valid, has match: 0:(0, 2, "📉"))
                  true
                  
                  K Offline
                  K Offline
                  king558
                  wrote on last edited by
                  #8

                  @ChrisW67 said in QRegExp does not support unicode 16 bit character?:

                  Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND

                  Thx you for your help. But last one question, how do you convert 0xD83D 0xDCD9 to Unicode U+1F4C9?

                  C 1 Reply Last reply
                  0
                  • K king558

                    @ChrisW67 said in QRegExp does not support unicode 16 bit character?:

                    Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND

                    Thx you for your help. But last one question, how do you convert 0xD83D 0xDCD9 to Unicode U+1F4C9?

                    C Offline
                    C Offline
                    ChrisW67
                    wrote on last edited by ChrisW67
                    #9

                    @king558 I use this little gem r12a >> apps >> Unicode code converter. Paste in the text to convert (e.g. 📉) and out pops it encoded in various ways ready to paste in elsewhere. Or, provide the 16-bit words of a UTF-16 encoding and out pop the characters.

                    The Wikipedia page for UTF-16 describes how the surrogates encode the higher code points.

                    1 Reply Last reply
                    2
                    • K king558 has marked this topic as solved on

                    • Login

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • Users
                    • Groups
                    • Search
                    • Get Qt Extensions
                    • Unsolved