QRegExp does not support unicode 16 bit character?

king558

auto someText = QString( "%1" ).arg( QChar( 0x2029 ) );
QRegExp regexp1("\\u2029");
// QRegExp regexp1("\\x{2029}");
// regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
auto f1 = regexp1.indexIn(someText);
QRegularExpression regexp2("\\x{2029}"); // No complaints here
auto f2 = regexp2.match(someText).hasMatch();

f1 is always -1, no matter it is \u2029 or \x{2029}, but QRegularExpression does work. That means QRegExp does not support 16 bit character?

sierdzio

@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

king558

@sierdzio said in QRegExp does not support unicode 16 bit character?:

@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

Thx your for your tips.

auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
QRegExp regexp1("\\xD83D");
 // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
auto f1 = regexp1.indexIn(someText);
auto f11 = regexp1.capturedTexts()[0];
QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
auto f2 = regexp2.match(someText);
if (f2.hasMatch()) {
    auto f22 = f2.captured(1);
}

I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

king558

@king558 said in QRegExp does not support unicode 16 bit character?:

@sierdzio said in QRegExp does not support unicode 16 bit character?:

@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:

\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).

https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters

Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.

Thx your for your tips.
auto someText = QString( "%1" ).arg( QChar( 0xD83D ) );
QRegExp regexp1("\\xD83D");
 // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
auto f1 = regexp1.indexIn(someText);
auto f11 = regexp1.capturedTexts()[0];
QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here
auto f2 = regexp2.match(someText);
if (f2.hasMatch()) {
    auto f22 = f2.captured(1);
}
I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?

I am using Qt 5.15.2 LTS and Qt 6.15 LTS

ChrisW67

@king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

king558

@ChrisW67 said in QRegExp does not support unicode 16 bit character?:

@king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.

Thx you for your hint. I corrected the code as follow, matched still false.

    auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
    QRegExp regexp1("\\xD83D\\xDCC9");
    // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here
    auto f1 = regexp1.indexIn(someText);
    auto f11 = regexp1.capturedTexts()[0];
    QRegularExpression regexp2("\\xD83D\\xDCC9", QRegularExpression::CaseInsensitiveOption); // No complaints here
    auto f2 = regexp2.match(someText);
    auto matched = f2.hasMatch();
    if ( matched ) {
        auto f22 = f2.captured(1);
        auto i = 0;
    }

ChrisW67

@king558 First you need a logically correct and valid regular expression.

Your test string contains a single Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND. Internally this is built of two UTF-16 surrogates.

This RE \xD83D\xDCC9 matches byte 0xD8, char '3' , char 'D', byte 0xDC, char 'C', char '9'. That does not match your string.

The pcre syntax for a multi-byte hex code point is \x{hhhh}. This, however, is flagged as an invalid RE:

QRegularExpression re("\\x{D83D}\\x{DCC9}", QRegularExpression::CaseInsensitiveOption);

When matching Unicode characters you specify the Unicode code point, not the surrogates that it may be encoded in.

include <QCoreApplication>
#include <QRegularExpression>
#include <QDebug>

int main(int argc, char **argv) {
        QCoreApplication app(argc, argv);

        auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) );
        // More clearly:
        // auto someText = QString( "\U0001f4c9" );

        QRegularExpression re("\\x{1F4C9}", QRegularExpression::CaseInsensitiveOption); // No complaints here
        auto f2 = re.match(someText);
        auto matched = f2.hasMatch();
        qDebug() << re;
        qDebug() << f2;
        qDebug() << matched;
        return 0;
}

Output:

RegularExpression("\\x{1F4C9}", QRegularExpression::PatternOptions("CaseInsensitiveOption"))
QRegularExpressionMatch(Valid, has match: 0:(0, 2, "📉"))
true

king558

@ChrisW67 said in QRegExp does not support unicode 16 bit character?:

Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND

Thx you for your help. But last one question, how do you convert 0xD83D 0xDCD9 to Unicode U+1F4C9?

ChrisW67

@king558 I use this little gem r12a >> apps >> Unicode code converter. Paste in the text to convert (e.g. 📉) and out pops it encoded in various ways ready to paste in elsewhere. Or, provide the 16-bit words of a UTF-16 encoding and out pop the characters.

The Wikipedia page for UTF-16 describes how the surrogates encode the higher code points.