Unicode handling by QRegularExpression



  • I migrated a few regular expressions used in my application from QRegExp to QRegularExpression. After all was done, I was getting the non-fatal message

      QRegularExpressionPrivate::doMatch(): called on an invalid QRegularExpression object
    

    and after a little debugging I found that it happens when the regular expression's setPattern() loads the regexp ("\u2029"). The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above. How can I fix this?

    Thanks in advance.



  • Hi @panosk,

    The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above.

    How are you loading the expressions from file, and then assigning them to QRegularExpression?


  • Lifetime Qt Champion

    Hi,

    Can you post a small code sample that triggers that ?



  • @Paul-Colby said:

    Hi @panosk,

    The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above.

    How are you loading the expressions from file, and then assigning them to QRegularExpression?

    The code is correct and has been working fine for quite some time, there's nothing wrong with the way the regexps are assigned. However, I was examining a wrong file, the correct file doesn't have many ICU regexps as I thought, only 2, so the problem is trivial as I can remove these regexps. It would be nice though to know why this happens.

    @SGaist said:

    Hi,

    Can you post a small code sample that triggers that ?

    No special code is needed. You can try this and see the message:

    QRegExp regexp("\\u2029"); // No complaints here
       if (regexp.indexIn(someText) > -1)
          qDebug() << "Match";
    
    
    QRegularExpression regexp("\\u2029"); // It doesn't like this and the message appears
    QRegularExpressionMatch match = regexp.match(someText);
       if (match.hasMatch())
          qDebug() << "Match";


  • So, I think I found the problem. I created a file and pasted the \u2029 character from the character selector utility. Running either QRegExp or QRegularExpression doesn't find it if I use double slashes, but only QRegularExpression warns about the problem. That is

    QRegularExpression regexp("\\u2029")
    

    doesn't work, while

    QRegularExpression regexp("\u2029")
    

    finds the match.

    The problem is that when I retrieve the (correctly, I think) formatted regexp string "\u2029" from my file that contains the regexps, the slash is escaped automatically and hence the problem. Maybe QRegExp and QRegularExpression should recognize such cases and not escape them or maybe I miss sth :-).


  • Lifetime Qt Champion

    You did. You have a sequence that represent a unicode char in your file. So the string resulting from the load will have that backslash escaped to match the content of the file. If you what to load that char from a file, you must write it as is in that file in the first place.



  • @SGaist said:

    You did. You have a sequence that represent a unicode char in your file. So the string resulting from the load will have that backslash escaped to match the content of the file. If you what to load that char from a file, you must write it as is in that file in the first place.

    I'm not so sure... The unicode representation is used in a regular expression in a file, for example [\u00A0\s], and I want to load that regular expression from the file to QRegularExpression in runtime. Currently, it seems there's no way to do that. It seems QRegularExpression recognizes the unicode sequence when you write it directly to the constructor or to the setPattern() function, but it doesn't recognize it when it loads it from a file and it wrongly escapes it.


  • Qt Champions 2016

    @panosk

    Perl and PCRE do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times.

    From here: http://www.regular-expressions.info/unicode.html

    QRegularExpression is PCRE based, so try specifying the code point correctly in your file/testing string. For example, try like this:

    QRegularExpression regexp("\\x{2029}"); // It should like this just fine
    

    PS.
    This

    QRegularExpression regexp("\u2029")
    

    Works, because \u2029 is a unicode character (written through its hex representation) and is then passed to the engine as a sequence of bytes. It would be equivalent to:

    const char rx[] = {0x20, 0x29};
    QRegularExpression regexp(rx);
    

    Kind regards.



  • @kshegunov said:

    QRegularExpression is PCRE based

    Thank you very much for the clear explanation. So, it seems this is the issue. Apart from these unicode peculiarities, I don't think there are other major differences between the PCRE and the ICU standards so I can modify the few instances of these unicode representations to the appropriate format.


  • Qt Champions 2016

    @panosk
    I suggest trying out in code first, and if everything goes smoothly, then yes you can replace the codepoints in your file.

    Good luck!


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.