Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. QRegularExpression and Arabic bare word search
Qt 6.11 is out! See what's new in the release blog

QRegularExpression and Arabic bare word search

Scheduled Pinned Locked Moved Unsolved General and Desktop
5 Posts 2 Posters 1.5k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G Offline
    G Offline
    GraemeA
    wrote on last edited by
    #1

    I'm having a problem doing a bare word match for Arabic text. In the example below, the English match works ok, but the Arabic
    one does not. Am I doing something wrong?

    QString str("but only the inf. n., namely دَقَعَ , of the verb in this sense");
    QRegularExpression rx1(QString("\\b%1\\b").arg("verb"));
    QRegularExpression rx2(QString("\\b%1\\b").arg("دَقَعَ"));
    qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
    qDebug() << rx2.isValid() << rx2.pattern() << rx2.match(str).hasMatch();
    

    Returns

    true "\\bverb\\b" true
    true "\\bدَقَعَ\\b" false
    

    If I create a file with same contents as 'str' and use grep -e, it matches ok:

    $ grep -e '\bدَقَعَ\b' test.txt
    but only the inf. n., namely دَقَعَ , of the verb in this sense
    
    
    E 1 Reply Last reply
    0
    • G GraemeA

      I'm having a problem doing a bare word match for Arabic text. In the example below, the English match works ok, but the Arabic
      one does not. Am I doing something wrong?

      QString str("but only the inf. n., namely دَقَعَ , of the verb in this sense");
      QRegularExpression rx1(QString("\\b%1\\b").arg("verb"));
      QRegularExpression rx2(QString("\\b%1\\b").arg("دَقَعَ"));
      qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
      qDebug() << rx2.isValid() << rx2.pattern() << rx2.match(str).hasMatch();
      

      Returns

      true "\\bverb\\b" true
      true "\\bدَقَعَ\\b" false
      

      If I create a file with same contents as 'str' and use grep -e, it matches ok:

      $ grep -e '\bدَقَعَ\b' test.txt
      but only the inf. n., namely دَقَعَ , of the verb in this sense
      
      
      E Offline
      E Offline
      Eeli K
      wrote on last edited by Eeli K
      #2

      @GraemeA It may be a bug in the engine. This is interesting:

      QRegularExpression rx1(QString("\\b%1\\b").arg("verb"));
          QRegularExpression rx2(QString("\\b%1\\b").arg("دَقَعَ"));
          QRegularExpression rx3(QString("\\b%1").arg("دَقَعَ"));
          QRegularExpression rx4(QString("%1\\b").arg("دَقَعَ"));
          QRegularExpression rx5(QString("%1").arg("دَقَعَ"));
          QRegularExpression rx6(QString("\\b%1\\b").arg("ع"));
          rx1.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
          rx2.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
          rx3.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
          rx4.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
          rx5.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
          rx6.setPatternOptions(QRegularExpression::UseUnicodePropertiesOption);
          qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
          qDebug() << rx2.isValid() << rx2.pattern() << rx2.match(str).hasMatch();
          qDebug() << rx3.isValid() << rx3.pattern() << rx3.match(str).hasMatch();
          qDebug() << rx4.isValid() << rx4.pattern() << rx4.match(str).hasMatch();
          qDebug() << rx5.isValid() << rx5.pattern() << rx5.match(str).hasMatch();
          qDebug() << rx6.isValid() << rx6.pattern() << rx6.match(str).hasMatch();
      

      true "\bverb\b" true
      true "\b??????\b" false
      true "\b??????" true
      true "??????\b" false
      true "??????" true
      true "\b?\b" true

      On the other hand, with arabic the text isn't WYSIWYG. Just try to walk through the word with arrow keys and delete a letter with backspace or Delete key...

      1 Reply Last reply
      0
      • G Offline
        G Offline
        GraemeA
        wrote on last edited by
        #3

        @Eeli-K you're right, your examples are interesting.

        I think the problem is with trailing fatha, which is why the leading \b works, but the trailing does not.

             // this matches
            str = "   دقَع   ";
            rx1.setPattern(QString("\\b%1\\b").arg("دقَع"));
            qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
            // this does not
            str = "   دقَعَ   ";
            rx1.setPattern(QString("\\b%1\\b").arg("دقَعَ"));
            qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
        
        

        Gives

        true "\\bدقَع\\b" true
        true "\\bدقَعَ\\b" false
        
        

        I tried with a few other trailing characters (damma,kasra,sukun etc) and they all failed.
        It does look like a bug to me.

        E 2 Replies Last reply
        0
        • G GraemeA

          @Eeli-K you're right, your examples are interesting.

          I think the problem is with trailing fatha, which is why the leading \b works, but the trailing does not.

               // this matches
              str = "   دقَع   ";
              rx1.setPattern(QString("\\b%1\\b").arg("دقَع"));
              qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
              // this does not
              str = "   دقَعَ   ";
              rx1.setPattern(QString("\\b%1\\b").arg("دقَعَ"));
              qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
          
          

          Gives

          true "\\bدقَع\\b" true
          true "\\bدقَعَ\\b" false
          
          

          I tried with a few other trailing characters (damma,kasra,sukun etc) and they all failed.
          It does look like a bug to me.

          E Offline
          E Offline
          Eeli K
          wrote on last edited by
          #4

          @GraemeA See http://www.regular-expressions.info/wordboundaries.html and then http://www.regular-expressions.info/shorthand.html : "In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included." In my opinion it should recognize Arabic letters and parts of letters even though forming the letters and words is complicated. I think you could report this as a bug in https://bugreports.qt.io/.

          1 Reply Last reply
          0
          • G GraemeA

            @Eeli-K you're right, your examples are interesting.

            I think the problem is with trailing fatha, which is why the leading \b works, but the trailing does not.

                 // this matches
                str = "   دقَع   ";
                rx1.setPattern(QString("\\b%1\\b").arg("دقَع"));
                qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
                // this does not
                str = "   دقَعَ   ";
                rx1.setPattern(QString("\\b%1\\b").arg("دقَعَ"));
                qDebug() << rx1.isValid() << rx1.pattern() << rx1.match(str).hasMatch();
            
            

            Gives

            true "\\bدقَع\\b" true
            true "\\bدقَعَ\\b" false
            
            

            I tried with a few other trailing characters (damma,kasra,sukun etc) and they all failed.
            It does look like a bug to me.

            E Offline
            E Offline
            Eeli K
            wrote on last edited by
            #5

            @GraemeA And by the way, don't forget to use QRegularExpression::UseUnicodePropertiesOption because the results are a bit different without that.

            1 Reply Last reply
            0

            • Login

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • Users
            • Groups
            • Search
            • Get Qt Extensions
            • Unsolved