QRegularExpression and Arabic bare word search



  • I'm having a problem doing a bare word match for Arabic text. In the example below, the English match works ok, but the Arabic
    one does not. Am I doing something wrong?

    QString str("but only the inf. n., namely دَقَعَ , of the verb in this sense");
    QRegularExpression rx1(QString("\\b%1\\b").arg("verb"));
    QRegularExpression rx2(QString("\\b%1\\b").arg("دَقَعَ"));
    qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
    qDebug() << rx2.isValid() << rx2.pattern() << rx2.match(str).hasMatch();
    

    Returns

    true "\\bverb\\b" true
    true "\\bدَقَعَ\\b" false
    

    If I create a file with same contents as 'str' and use grep -e, it matches ok:

    $ grep -e '\bدَقَعَ\b' test.txt
    but only the inf. n., namely دَقَعَ , of the verb in this sense
    
    


  • @GraemeA It may be a bug in the engine. This is interesting:

    QRegularExpression rx1(QString("\\b%1\\b").arg("verb"));
        QRegularExpression rx2(QString("\\b%1\\b").arg("دَقَعَ"));
        QRegularExpression rx3(QString("\\b%1").arg("دَقَعَ"));
        QRegularExpression rx4(QString("%1\\b").arg("دَقَعَ"));
        QRegularExpression rx5(QString("%1").arg("دَقَعَ"));
        QRegularExpression rx6(QString("\\b%1\\b").arg("ع"));
        rx1.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
        rx2.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
        rx3.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
        rx4.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
        rx5.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
        rx6.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
        qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
        qDebug() << rx2.isValid() << rx2.pattern() << rx2.match(str).hasMatch();
        qDebug() << rx3.isValid() << rx3.pattern() << rx3.match(str).hasMatch();
        qDebug() << rx4.isValid() << rx4.pattern() << rx4.match(str).hasMatch();
        qDebug() << rx5.isValid() << rx5.pattern() << rx5.match(str).hasMatch();
        qDebug() << rx6.isValid() << rx6.pattern() << rx6.match(str).hasMatch();
    

    true "\bverb\b" true
    true "\b??????\b" false
    true "\b??????" true
    true "??????\b" false
    true "??????" true
    true "\b?\b" true

    On the other hand, with arabic the text isn't WYSIWYG. Just try to walk through the word with arrow keys and delete a letter with backspace or Delete key...



  • @Eeli-K you're right, your examples are interesting.

    I think the problem is with trailing fatha, which is why the leading \b works, but the trailing does not.

         // this matches
        str = "   دقَع   ";
        rx1.setPattern(QString("\\b%1\\b").arg("دقَع"));
        qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
        // this does not
        str = "   دقَعَ   ";
        rx1.setPattern(QString("\\b%1\\b").arg("دقَعَ"));
        qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
    
    

    Gives

    true "\\bدقَع\\b" true
    true "\\bدقَعَ\\b" false
    
    

    I tried with a few other trailing characters (damma,kasra,sukun etc) and they all failed.
    It does look like a bug to me.



  • @GraemeA See http://www.regular-expressions.info/wordboundaries.html and then http://www.regular-expressions.info/shorthand.html : "In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included." In my opinion it should recognize Arabic letters and parts of letters even though forming the letters and words is complicated. I think you could report this as a bug in https://bugreports.qt.io/.



  • @GraemeA And by the way, don't forget to use QRegularExpression::UseUnicodePropertiesOption because the results are a bit different without that.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.