QRegularExpression and Arabic bare word search
-
I'm having a problem doing a bare word match for Arabic text. In the example below, the English match works ok, but the Arabic
one does not. Am I doing something wrong?QString str("but only the inf. n., namely دَقَعَ , of the verb in this sense"); QRegularExpression rx1(QString("\\b%1\\b").arg("verb")); QRegularExpression rx2(QString("\\b%1\\b").arg("دَقَعَ")); qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch(); qDebug() << rx2.isValid() << rx2.pattern() << rx2.match(str).hasMatch();
Returns
true "\\bverb\\b" true true "\\bدَقَعَ\\b" false
If I create a file with same contents as 'str' and use grep -e, it matches ok:
$ grep -e '\bدَقَعَ\b' test.txt but only the inf. n., namely دَقَعَ , of the verb in this sense
-
@GraemeA It may be a bug in the engine. This is interesting:
QRegularExpression rx1(QString("\\b%1\\b").arg("verb")); QRegularExpression rx2(QString("\\b%1\\b").arg("دَقَعَ")); QRegularExpression rx3(QString("\\b%1").arg("دَقَعَ")); QRegularExpression rx4(QString("%1\\b").arg("دَقَعَ")); QRegularExpression rx5(QString("%1").arg("دَقَعَ")); QRegularExpression rx6(QString("\\b%1\\b").arg("ع")); rx1.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption); rx2.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption); rx3.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption); rx4.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption); rx5.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption); rx6.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption); qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch(); qDebug() << rx2.isValid() << rx2.pattern() << rx2.match(str).hasMatch(); qDebug() << rx3.isValid() << rx3.pattern() << rx3.match(str).hasMatch(); qDebug() << rx4.isValid() << rx4.pattern() << rx4.match(str).hasMatch(); qDebug() << rx5.isValid() << rx5.pattern() << rx5.match(str).hasMatch(); qDebug() << rx6.isValid() << rx6.pattern() << rx6.match(str).hasMatch();
true "\bverb\b" true
true "\b??????\b" false
true "\b??????" true
true "??????\b" false
true "??????" true
true "\b?\b" trueOn the other hand, with arabic the text isn't WYSIWYG. Just try to walk through the word with arrow keys and delete a letter with backspace or Delete key...
-
@Eeli-K you're right, your examples are interesting.
I think the problem is with trailing fatha, which is why the leading \b works, but the trailing does not.
// this matches str = " دقَع "; rx1.setPattern(QString("\\b%1\\b").arg("دقَع")); qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch(); // this does not str = " دقَعَ "; rx1.setPattern(QString("\\b%1\\b").arg("دقَعَ")); qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
Gives
true "\\bدقَع\\b" true true "\\bدقَعَ\\b" false
I tried with a few other trailing characters (damma,kasra,sukun etc) and they all failed.
It does look like a bug to me. -
@GraemeA See http://www.regular-expressions.info/wordboundaries.html and then http://www.regular-expressions.info/shorthand.html : "In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included." In my opinion it should recognize Arabic letters and parts of letters even though forming the letters and words is complicated. I think you could report this as a bug in https://bugreports.qt.io/.