QRegExp does not support unicode 16 bit character?
-
auto someText = QString( "%1" ).arg( QChar( 0x2029 ) ); QRegExp regexp1("\\u2029"); // QRegExp regexp1("\\x{2029}"); // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here auto f1 = regexp1.indexIn(someText); QRegularExpression regexp2("\\x{2029}"); // No complaints here auto f2 = regexp2.match(someText).hasMatch();
f1 is always -1, no matter it is \u2029 or \x{2029}, but QRegularExpression does work. That means QRegExp does not support 16 bit character?
-
auto someText = QString( "%1" ).arg( QChar( 0x2029 ) ); QRegExp regexp1("\\u2029"); // QRegExp regexp1("\\x{2029}"); // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here auto f1 = regexp1.indexIn(someText); QRegularExpression regexp2("\\x{2029}"); // No complaints here auto f2 = regexp2.match(someText).hasMatch();
f1 is always -1, no matter it is \u2029 or \x{2029}, but QRegularExpression does work. That means QRegExp does not support 16 bit character?
@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:
\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).
https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters
Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.
-
@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:
\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).
https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters
Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.
@sierdzio said in QRegExp does not support unicode 16 bit character?:
@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:
\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).
https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters
Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.
Thx your for your tips.
auto someText = QString( "%1" ).arg( QChar( 0xD83D ) ); QRegExp regexp1("\\xD83D"); // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here auto f1 = regexp1.indexIn(someText); auto f11 = regexp1.capturedTexts()[0]; QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here auto f2 = regexp2.match(someText); if (f2.hasMatch()) { auto f22 = f2.captured(1); }
I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?
-
@sierdzio said in QRegExp does not support unicode 16 bit character?:
@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:
\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).
https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters
Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.
Thx your for your tips.
auto someText = QString( "%1" ).arg( QChar( 0xD83D ) ); QRegExp regexp1("\\xD83D"); // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here auto f1 = regexp1.indexIn(someText); auto f11 = regexp1.capturedTexts()[0]; QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here auto f2 = regexp2.match(someText); if (f2.hasMatch()) { auto f22 = f2.captured(1); }
I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?
@king558 said in QRegExp does not support unicode 16 bit character?:
@sierdzio said in QRegExp does not support unicode 16 bit character?:
@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:
\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).
https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters
Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.
Thx your for your tips.
auto someText = QString( "%1" ).arg( QChar( 0xD83D ) ); QRegExp regexp1("\\xD83D"); // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here auto f1 = regexp1.indexIn(someText); auto f11 = regexp1.capturedTexts()[0]; QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here auto f2 = regexp2.match(someText); if (f2.hasMatch()) { auto f22 = f2.captured(1); }
I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?
I am using Qt 5.15.2 LTS and Qt 6.15 LTS
-
@king558 said in QRegExp does not support unicode 16 bit character?:
@sierdzio said in QRegExp does not support unicode 16 bit character?:
@king558 As the documentation clearly states, the syntax in QRegExp is without curly braces:
\xhhhh Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF).
https://doc.qt.io/qt-6/qregexp.html#characters-and-abbreviations-for-sets-of-characters
Btw. if you can use QRegularExpression, use it. QRegExp is deprecated since Qt 5 and should not be used.
Thx your for your tips.
auto someText = QString( "%1" ).arg( QChar( 0xD83D ) ); QRegExp regexp1("\\xD83D"); // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here auto f1 = regexp1.indexIn(someText); auto f11 = regexp1.capturedTexts()[0]; QRegularExpression regexp2("\\x{D83D}", QRegularExpression::CaseInsensitiveOption); // No complaints here auto f2 = regexp2.match(someText); if (f2.hasMatch()) { auto f22 = f2.captured(1); }
I will use QRegularExpression as suggested. But strange scenario happens, here I change to code from 0x2029 to 0xD83D, QRegExp indexin return 0, it does work, but not QRegularExpression hasMatch return false. Could you or anybody help me whats I did wrong here?
I am using Qt 5.15.2 LTS and Qt 6.15 LTS
@king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.
-
@king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.
@ChrisW67 said in QRegExp does not support unicode 16 bit character?:
@king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.
Thx you for your hint. I corrected the code as follow, matched still false.
auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) ); QRegExp regexp1("\\xD83D\\xDCC9"); // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here auto f1 = regexp1.indexIn(someText); auto f11 = regexp1.capturedTexts()[0]; QRegularExpression regexp2("\\xD83D\\xDCC9", QRegularExpression::CaseInsensitiveOption); // No complaints here auto f2 = regexp2.match(someText); auto matched = f2.hasMatch(); if ( matched ) { auto f22 = f2.captured(1); auto i = 0; }
-
@ChrisW67 said in QRegExp does not support unicode 16 bit character?:
@king558 U+D83D code point (U+D800 to U+DBFF) is one part of a surrogate pair. It make little sense without the following data unit (in the range U+DC00 to U+DFFF). The two surrogate characters together encode Unicode points U+10000 onward. You can use the pcre \C in your pattern to handle 16-bit data units independently with the potential for unusual behaviour.
Thx you for your hint. I corrected the code as follow, matched still false.
auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) ); QRegExp regexp1("\\xD83D\\xDCC9"); // regexp1.setPatternSyntax( QRegExp::W3CXmlSchema11 );// No complaints here auto f1 = regexp1.indexIn(someText); auto f11 = regexp1.capturedTexts()[0]; QRegularExpression regexp2("\\xD83D\\xDCC9", QRegularExpression::CaseInsensitiveOption); // No complaints here auto f2 = regexp2.match(someText); auto matched = f2.hasMatch(); if ( matched ) { auto f22 = f2.captured(1); auto i = 0; }
@king558 First you need a logically correct and valid regular expression.
Your test string contains a single Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND. Internally this is built of two UTF-16 surrogates.
This RE
\xD83D\xDCC9
matches byte 0xD8, char '3' , char 'D', byte 0xDC, char 'C', char '9'. That does not match your string.The pcre syntax for a multi-byte hex code point is
\x{hhhh}
. This, however, is flagged as an invalid RE:QRegularExpression re("\\x{D83D}\\x{DCC9}", QRegularExpression::CaseInsensitiveOption);
When matching Unicode characters you specify the Unicode code point, not the surrogates that it may be encoded in.
include <QCoreApplication> #include <QRegularExpression> #include <QDebug> int main(int argc, char **argv) { QCoreApplication app(argc, argv); auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) ); // More clearly: // auto someText = QString( "\U0001f4c9" ); QRegularExpression re("\\x{1F4C9}", QRegularExpression::CaseInsensitiveOption); // No complaints here auto f2 = re.match(someText); auto matched = f2.hasMatch(); qDebug() << re; qDebug() << f2; qDebug() << matched; return 0; }
Output:
RegularExpression("\\x{1F4C9}", QRegularExpression::PatternOptions("CaseInsensitiveOption")) QRegularExpressionMatch(Valid, has match: 0:(0, 2, "📉")) true
-
@king558 First you need a logically correct and valid regular expression.
Your test string contains a single Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND. Internally this is built of two UTF-16 surrogates.
This RE
\xD83D\xDCC9
matches byte 0xD8, char '3' , char 'D', byte 0xDC, char 'C', char '9'. That does not match your string.The pcre syntax for a multi-byte hex code point is
\x{hhhh}
. This, however, is flagged as an invalid RE:QRegularExpression re("\\x{D83D}\\x{DCC9}", QRegularExpression::CaseInsensitiveOption);
When matching Unicode characters you specify the Unicode code point, not the surrogates that it may be encoded in.
include <QCoreApplication> #include <QRegularExpression> #include <QDebug> int main(int argc, char **argv) { QCoreApplication app(argc, argv); auto someText = QString( "%1%2" ).arg( QChar( 0xD83D ) ).arg( QChar( 0xDCC9 ) ); // More clearly: // auto someText = QString( "\U0001f4c9" ); QRegularExpression re("\\x{1F4C9}", QRegularExpression::CaseInsensitiveOption); // No complaints here auto f2 = re.match(someText); auto matched = f2.hasMatch(); qDebug() << re; qDebug() << f2; qDebug() << matched; return 0; }
Output:
RegularExpression("\\x{1F4C9}", QRegularExpression::PatternOptions("CaseInsensitiveOption")) QRegularExpressionMatch(Valid, has match: 0:(0, 2, "📉")) true
-
@ChrisW67 said in QRegExp does not support unicode 16 bit character?:
Unicode code point U+1F4C9 CHART WITH DOWNWARDS TREND
Thx you for your help. But last one question, how do you convert 0xD83D 0xDCD9 to Unicode U+1F4C9?
@king558 I use this little gem r12a >> apps >> Unicode code converter. Paste in the text to convert (e.g. 📉) and out pops it encoded in various ways ready to paste in elsewhere. Or, provide the 16-bit words of a UTF-16 encoding and out pop the characters.
The Wikipedia page for UTF-16 describes how the surrogates encode the higher code points.
-