Qt Forum

    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Unsolved

    Update: Forum Guidelines & Code of Conduct

    Solved Unicode handling by QRegularExpression

    General and Desktop
    qregularexpress unicode
    4
    10
    5334
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • P
      panosk last edited by

      I migrated a few regular expressions used in my application from QRegExp to QRegularExpression. After all was done, I was getting the non-fatal message

        QRegularExpressionPrivate::doMatch(): called on an invalid QRegularExpression object
      

      and after a little debugging I found that it happens when the regular expression's setPattern() loads the regexp ("\u2029"). The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above. How can I fix this?

      Thanks in advance.

      1 Reply Last reply Reply Quote 0
      • Paul Colby
        Paul Colby last edited by

        Hi @panosk,

        The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above.

        How are you loading the expressions from file, and then assigning them to QRegularExpression?

        P 1 Reply Last reply Reply Quote 0
        • SGaist
          SGaist Lifetime Qt Champion last edited by

          Hi,

          Can you post a small code sample that triggers that ?

          Interested in AI ? www.idiap.ch
          Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

          1 Reply Last reply Reply Quote 0
          • P
            panosk @Paul Colby last edited by

            @Paul-Colby said:

            Hi @panosk,

            The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above.

            How are you loading the expressions from file, and then assigning them to QRegularExpression?

            The code is correct and has been working fine for quite some time, there's nothing wrong with the way the regexps are assigned. However, I was examining a wrong file, the correct file doesn't have many ICU regexps as I thought, only 2, so the problem is trivial as I can remove these regexps. It would be nice though to know why this happens.

            @SGaist said:

            Hi,

            Can you post a small code sample that triggers that ?

            No special code is needed. You can try this and see the message:

            QRegExp regexp("\\u2029"); // No complaints here
               if (regexp.indexIn(someText) > -1)
                  qDebug() << "Match";
            
            
            QRegularExpression regexp("\\u2029"); // It doesn't like this and the message appears
            QRegularExpressionMatch match = regexp.match(someText);
               if (match.hasMatch())
                  qDebug() << "Match";
            
            1 Reply Last reply Reply Quote 0
            • P
              panosk last edited by panosk

              So, I think I found the problem. I created a file and pasted the \u2029 character from the character selector utility. Running either QRegExp or QRegularExpression doesn't find it if I use double slashes, but only QRegularExpression warns about the problem. That is

              QRegularExpression regexp("\\u2029")
              

              doesn't work, while

              QRegularExpression regexp("\u2029")
              

              finds the match.

              The problem is that when I retrieve the (correctly, I think) formatted regexp string "\u2029" from my file that contains the regexps, the slash is escaped automatically and hence the problem. Maybe QRegExp and QRegularExpression should recognize such cases and not escape them or maybe I miss sth :-).

              1 Reply Last reply Reply Quote 0
              • SGaist
                SGaist Lifetime Qt Champion last edited by

                You did. You have a sequence that represent a unicode char in your file. So the string resulting from the load will have that backslash escaped to match the content of the file. If you what to load that char from a file, you must write it as is in that file in the first place.

                Interested in AI ? www.idiap.ch
                Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                P 1 Reply Last reply Reply Quote 0
                • P
                  panosk @SGaist last edited by

                  @SGaist said:

                  You did. You have a sequence that represent a unicode char in your file. So the string resulting from the load will have that backslash escaped to match the content of the file. If you what to load that char from a file, you must write it as is in that file in the first place.

                  I'm not so sure... The unicode representation is used in a regular expression in a file, for example [\u00A0\s], and I want to load that regular expression from the file to QRegularExpression in runtime. Currently, it seems there's no way to do that. It seems QRegularExpression recognizes the unicode sequence when you write it directly to the constructor or to the setPattern() function, but it doesn't recognize it when it loads it from a file and it wrongly escapes it.

                  kshegunov 1 Reply Last reply Reply Quote 0
                  • kshegunov
                    kshegunov Moderators @panosk last edited by kshegunov

                    @panosk

                    Perl and PCRE do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times.

                    From here: http://www.regular-expressions.info/unicode.html

                    QRegularExpression is PCRE based, so try specifying the code point correctly in your file/testing string. For example, try like this:

                    QRegularExpression regexp("\\x{2029}"); // It should like this just fine
                    

                    PS.
                    This

                    QRegularExpression regexp("\u2029")
                    

                    Works, because \u2029 is a unicode character (written through its hex representation) and is then passed to the engine as a sequence of bytes. It would be equivalent to:

                    const char rx[] = {0x20, 0x29};
                    QRegularExpression regexp(rx);
                    

                    Kind regards.

                    Read and abide by the Qt Code of Conduct

                    P 1 Reply Last reply Reply Quote 2
                    • P
                      panosk @kshegunov last edited by

                      @kshegunov said:

                      QRegularExpression is PCRE based

                      Thank you very much for the clear explanation. So, it seems this is the issue. Apart from these unicode peculiarities, I don't think there are other major differences between the PCRE and the ICU standards so I can modify the few instances of these unicode representations to the appropriate format.

                      kshegunov 1 Reply Last reply Reply Quote 0
                      • kshegunov
                        kshegunov Moderators @panosk last edited by

                        @panosk
                        I suggest trying out in code first, and if everything goes smoothly, then yes you can replace the codepoints in your file.

                        Good luck!

                        Read and abide by the Qt Code of Conduct

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post