Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. QRegExp does not support unicode 16 bit character?
QtWS25 Last Chance

QRegExp does not support unicode 16 bit character?

Scheduled Pinned Locked Moved Solved General and Desktop
9 Posts 3 Posters 543 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • K Offline
    K Offline
    king558
    wrote on last edited by king558
    #1
    auto someText = QString( "%1" ).arg( QChar( 0x2029 ) );
    QRegExp regexp1("\\u2029");
    // QRegExp regexp1("\\x{2029}");
    // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
    auto f1 = regexp1.indexIn(someText);
    QRegularExpression regexp2("\\x{2029}"); // No complaints here
    auto f2 = regexp2.match(someText).hasMatch();
    

    f1 is always -1, no matter it is \u2029 or \x{2029}, but QRegularExpression does work. That means QRegExp does not support 16 bit character?

    sierdzioS 1 Reply Last reply
    0
    • K king558
      auto someText = QString( "%1" ).arg( QChar( 0x2029 ) );
      QRegExp regexp1("\\u2029");
      // QRegExp regexp1("\\x{2029}");
      // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
      auto f1 = regexp1.indexIn(someText);
      QRegularExpression regexp2("\\x{2029}"); // No complaints here
      auto f2 = regexp2.match(someText).hasMatch();
      

      f1 is always -1, no matter it is \u2029 or \x{2029}, but QRegularExpression does work. That means QRegExp does not support 16 bit character?

      sierdzioS Offline
      sierdzioS Offline
      sierdzio
      Moderators
      wrote on last edited by
      #2

      @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

      \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

      https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

      Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

      (Z(:^

      K 1 Reply Last reply
      2
      • sierdzioS sierdzio

        @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

        \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

        https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

        Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

        K Offline
        K Offline
        king558
        wrote on last edited by king558
        #3

        @sierdzio said in QRegExp does not support unicode 16 bit character?:

        @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

        \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

        https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

        Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

        Thx your for your tips.

        auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
        QRegExp regexp1("\\xD83D");
         // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
        auto f1 = regexp1.indexIn(someText);
        auto f11 = regexp1.capturedTexts()[0];
        QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
        auto f2 = regexp2.match(someText);
        if (f2.hasMatch()) {
            auto f22 = f2.captured(1);
        }
        

        I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

        K 1 Reply Last reply
        0
        • K king558

          @sierdzio said in QRegExp does not support unicode 16 bit character?:

          @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

          \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

          https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

          Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

          Thx your for your tips.

          auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
          QRegExp regexp1("\\xD83D");
           // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
          auto f1 = regexp1.indexIn(someText);
          auto f11 = regexp1.capturedTexts()[0];
          QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
          auto f2 = regexp2.match(someText);
          if (f2.hasMatch()) {
              auto f22 = f2.captured(1);
          }
          

          I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

          K Offline
          K Offline
          king558
          wrote on last edited by
          #4

          @king558 said in QRegExp does not support unicode 16 bit character?:

          @sierdzio said in QRegExp does not support unicode 16 bit character?:

          @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

          \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

          https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

          Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

          Thx your for your tips.

          auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
          QRegExp regexp1("\\xD83D");
           // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
          auto f1 = regexp1.indexIn(someText);
          auto f11 = regexp1.capturedTexts()[0];
          QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
          auto f2 = regexp2.match(someText);
          if (f2.hasMatch()) {
              auto f22 = f2.captured(1);
          }
          

          I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

          I am using Qt 5.15.2 LTS and Qt 6.15 LTS

          C 1 Reply Last reply
          0
          • K king558

            @king558 said in QRegExp does not support unicode 16 bit character?:

            @sierdzio said in QRegExp does not support unicode 16 bit character?:

            @king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

            \xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

            https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

            Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

            Thx your for your tips.

            auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
            QRegExp regexp1("\\xD83D");
             // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
            auto f1 = regexp1.indexIn(someText);
            auto f11 = regexp1.capturedTexts()[0];
            QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
            auto f2 = regexp2.match(someText);
            if (f2.hasMatch()) {
                auto f22 = f2.captured(1);
            }
            

            I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

            I am using Qt 5.15.2 LTS and Qt 6.15 LTS

            C Offline
            C Offline
            ChrisW67
            wrote on last edited by
            #5

            @king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

            K 1 Reply Last reply
            1
            • C ChrisW67

              @king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

              K Offline
              K Offline
              king558
              wrote on last edited by king558
              #6

              @ChrisW67 said in QRegExp does not support unicode 16 bit character?:

              @king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

              Thx you for your hint. I corrected the code as follow, matched still false.

                  auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
                  QRegExp regexp1("\\xD83D\\xDCC9");
                  // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
                  auto f1 = regexp1.indexIn(someText);
                  auto f11 = regexp1.capturedTexts()[0];
                  QRegularExpression regexp2("\\xD83D\\xDCC9", QRegularExpression::CaseInsensitiveOption); // No complaints here
                  auto f2 = regexp2.match(someText);
                  auto matched = f2.hasMatch();
                  if ( matched ) {
                      auto f22 = f2.captured(1);
                      auto i = 0;
                  }
              
              C 1 Reply Last reply
              0
              • K king558

                @ChrisW67 said in QRegExp does not support unicode 16 bit character?:

                @king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

                Thx you for your hint. I corrected the code as follow, matched still false.

                    auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
                    QRegExp regexp1("\\xD83D\\xDCC9");
                    // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
                    auto f1 = regexp1.indexIn(someText);
                    auto f11 = regexp1.capturedTexts()[0];
                    QRegularExpression regexp2("\\xD83D\\xDCC9", QRegularExpression::CaseInsensitiveOption); // No complaints here
                    auto f2 = regexp2.match(someText);
                    auto matched = f2.hasMatch();
                    if ( matched ) {
                        auto f22 = f2.captured(1);
                        auto i = 0;
                    }
                
                C Offline
                C Offline
                ChrisW67
                wrote on last edited by ChrisW67
                #7

                @king558 First you need a logically correct and valid regular expression.

                Your test string contains a single Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND. Internally this is built of two UTF-16 surrogates.

                This RE \xD83D\xDCC9 matches byte 0xD8, char '3' , char 'D', byte 0xDC, char 'C', char '9'. That does not match your string.

                The pcre syntax for a multi-byte hex code point is \x{hhhh}. This, however, is flagged as an invalid RE:

                QRegularExpression re("\\x{D83D}\\x{DCC9}", QRegularExpression::CaseInsensitiveOption);
                

                When matching Unicode characters you specify the Unicode code point, not the surrogates that it may be encoded in.

                include <QCoreApplication>
                #include <QRegularExpression>
                #include <QDebug>
                
                int main(int argc, char **argv) {
                        QCoreApplication app(argc, argv);
                
                        auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
                        // More clearly:
                        // auto someText = QString( "\U0001f4c9" );
                
                        QRegularExpression re("\\x{1F4C9}", QRegularExpression::CaseInsensitiveOption); // No complaints here
                        auto f2 = re.match(someText);
                        auto matched = f2.hasMatch();
                        qDebug() << re;
                        qDebug() << f2;
                        qDebug() << matched;
                        return 0;
                }
                

                Output:

                RegularExpression("\\x{1F4C9}", QRegularExpression::PatternOptions("CaseInsensitiveOption"))
                QRegularExpressionMatch(Valid, has match: 0:(0, 2, "📉"))
                true
                
                K 1 Reply Last reply
                2
                • C ChrisW67

                  @king558 First you need a logically correct and valid regular expression.

                  Your test string contains a single Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND. Internally this is built of two UTF-16 surrogates.

                  This RE \xD83D\xDCC9 matches byte 0xD8, char '3' , char 'D', byte 0xDC, char 'C', char '9'. That does not match your string.

                  The pcre syntax for a multi-byte hex code point is \x{hhhh}. This, however, is flagged as an invalid RE:

                  QRegularExpression re("\\x{D83D}\\x{DCC9}", QRegularExpression::CaseInsensitiveOption);
                  

                  When matching Unicode characters you specify the Unicode code point, not the surrogates that it may be encoded in.

                  include <QCoreApplication>
                  #include <QRegularExpression>
                  #include <QDebug>
                  
                  int main(int argc, char **argv) {
                          QCoreApplication app(argc, argv);
                  
                          auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
                          // More clearly:
                          // auto someText = QString( "\U0001f4c9" );
                  
                          QRegularExpression re("\\x{1F4C9}", QRegularExpression::CaseInsensitiveOption); // No complaints here
                          auto f2 = re.match(someText);
                          auto matched = f2.hasMatch();
                          qDebug() << re;
                          qDebug() << f2;
                          qDebug() << matched;
                          return 0;
                  }
                  

                  Output:

                  RegularExpression("\\x{1F4C9}", QRegularExpression::PatternOptions("CaseInsensitiveOption"))
                  QRegularExpressionMatch(Valid, has match: 0:(0, 2, "📉"))
                  true
                  
                  K Offline
                  K Offline
                  king558
                  wrote on last edited by
                  #8

                  @ChrisW67 said in QRegExp does not support unicode 16 bit character?:

                  Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND

                  Thx you for your help. But last one question, how do you convert 0xD83D 0xDCD9 to Unicode U+1F4C9?

                  C 1 Reply Last reply
                  0
                  • K king558

                    @ChrisW67 said in QRegExp does not support unicode 16 bit character?:

                    Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND

                    Thx you for your help. But last one question, how do you convert 0xD83D 0xDCD9 to Unicode U+1F4C9?

                    C Offline
                    C Offline
                    ChrisW67
                    wrote on last edited by ChrisW67
                    #9

                    @king558 I use this little gem r12a >> apps >> Unicode code converter. Paste in the text to convert (e.g. 📉) and out pops it encoded in various ways ready to paste in elsewhere. Or, provide the 16-bit words of a UTF-16 encoding and out pop the characters.

                    The Wikipedia page for UTF-16 describes how the surrogates encode the higher code points.

                    1 Reply Last reply
                    2
                    • K king558 has marked this topic as solved on

                    • Login

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • Users
                    • Groups
                    • Search
                    • Get Qt Extensions
                    • Unsolved