Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Need help with QRegularExpression for strings and comments
Forum Updated to NodeBB v4.3 + New Features

Need help with QRegularExpression for strings and comments

Scheduled Pinned Locked Moved Solved General and Desktop
14 Posts 4 Posters 4.1k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • SikarjanS Sikarjan

    I already tried that. But looks like the syntax here is a bit different. Regex101 would suggest this string

    ((?<!\\)([\"']))|(\/\*)|(\*\/)
    

    But this one is not working either.

    mrjjM Offline
    mrjjM Offline
    mrjj
    Lifetime Qt Champion
    wrote on last edited by
    #4

    @Sikarjan
    and u do CORRECTLY escape it ?

    1 Reply Last reply
    1
    • SikarjanS Offline
      SikarjanS Offline
      Sikarjan
      wrote on last edited by
      #5

      I believe so.

      QRegExp("/\\*")
      

      Is a valid expression.

      kshegunovK 1 Reply Last reply
      0
      • SikarjanS Sikarjan

        I believe so.

        QRegExp("/\\*")
        

        Is a valid expression.

        kshegunovK Offline
        kshegunovK Offline
        kshegunov
        Moderators
        wrote on last edited by kshegunov
        #6

        Here's for comments QRegularExpression("\\/\\*+\\s*(?!<\\*)(.*?)(?:\\**)\\s*\\*\\/").
        Here's for strings: QRegularExpression("\"(.+?)(?:\"\\s*)??(?<!(?<!\\\\)\\\\)\"")

        However this is a really iffy use of regular expressions, they can't cover all the possible cases. For example the proposed string matching expression doesn't handle:

        "string"  "concatenated string"
        

        well, and also will fail to properly match strings containing \\\" . The "real" solution is to have a proper parser.

        Read and abide by the Qt Code of Conduct

        1 Reply Last reply
        1
        • SikarjanS Offline
          SikarjanS Offline
          Sikarjan
          wrote on last edited by Sikarjan
          #7

          @kshegunov

          Maybe I need give you some more background. I am working on code highlighter for PHP. I started with the highlighter example and redid the multi line section. In PHP a String could be over more lines.

          "I am some text
          in a multi line string";
          

          It could be in singe or double quotes. If the string is started with either one the other will not end the string. This is why your suggestions would not work in my case.

          "<a href='../test.php'>see the \"Test\" page</a>";
          

          This would be one string and should all be highlighted in green (in my case).

          Everything above I had working with the code below and this

          quoteExpressions = QRegularExpression(R"**(?<!\\)([\"']**");
          

          regex.

          But there is another case. Something like

          glob('images/*.jpg');
          

          is also possible. If I do the quotes and the comments in two sections, the code above will be interpreted als a beginning string and then be changed to a comment. Therefore I tried to combine all multi line cases in one "function", see below. I believe my code should work if I get the regex to work. Unfortunately I do not understand the regex with the R"**... . Probably there is a better way to do what I want but this is the best I could come up with.

             multiLineCommentFormat.setForeground(Qt::gray);
             multiLineQuoteFormat.setForeground(Qt::darkGreen);
             quoteExpressions = QRegularExpression(R"**(?<!\\)([\"']|(/\\*)|(\\*/))**");
          }
          
          void Highlighter::highlightBlock(const QString &text)
          {
              setCurrentBlockState(0);
          
              if (previousBlockState() <= 0){
                  QRegularExpressionMatchIterator quoteMatch = quoteExpressions.globalMatch(text);
          
                  while(quoteMatch.hasNext()){
                      QRegularExpressionMatch match = quoteMatch.next();
                      int quoteStart = match.capturedStart();
                      int quoteLength = 0;
                      bool foundNextQuote = false;
                      QString lastQuote = match.captured();
                      int blockState = 3;
                      if(lastQuote == "'"){
                          blockState = 2;
                      }else if(lastQuote == "/*"){
                          blockState = 1;
                          lastQuote = "*/";
                      }
          
                      while(quoteMatch.hasNext()){
                          match = quoteMatch.next();
                          if(match.captured() == lastQuote){
                              quoteLength = match.capturedStart() - quoteStart;
                              foundNextQuote = true;
                              break;
                          }
                      }
          
                      if(!foundNextQuote){
                          setCurrentBlockState(blockState);
                          quoteLength = text.length() - quoteStart;
                      }
                      setFormat(quoteStart, quoteLength+1, blockState == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                  }
              }else{
                  QRegularExpressionMatchIterator quoteMatch = quoteExpressions.globalMatch(text);
                  QString lastQuote = "\"";
                  if(previousBlockState() == 1)
                      lastQuote = "*/";
                  else if(previousBlockState() == 2)
                      lastQuote = "'";
          
                  bool foundNextQuote = false;
                  while(quoteMatch.hasNext()){
                      QRegularExpressionMatch match = quoteMatch.next();
                      if(match.captured() == lastQuote){
                          setFormat(0, match.capturedStart()+1, previousBlockState() == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                          foundNextQuote = true;
                          break;
                      }
                  }
          
                  if(!quoteMatch.hasNext() && !foundNextQuote){
                      setCurrentBlockState(previousBlockState());
                      setFormat(0, text.length(), previousBlockState() == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                  }
          
                  while(quoteMatch.hasNext()){
                      QRegularExpressionMatch match = quoteMatch.next();
                      int quoteStart = match.capturedStart();
                      int quoteLength = 0;
                      bool foundNextQuote = false;
                      QString lastQuote = match.captured();
                      int blockState = 3;
                      if(lastQuote == "'"){
                          blockState = 2;
                      }else if(lastQuote == "/*"){
                          blockState = 1;
                          lastQuote = "*/";
                      }
          
                      while(quoteMatch.hasNext()){
                          match = quoteMatch.next();
                          if(match.captured() == lastQuote){
                              quoteLength = match.capturedStart() - quoteStart;
                              foundNextQuote = true;
                              break;
                          }
                      }
          
                      if(!foundNextQuote){
                          setCurrentBlockState(blockState);
                          quoteLength = text.length() - quoteStart;
                      }
                      setFormat(quoteStart, quoteLength+1, blockState == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                  }
              }
          }
          
          kshegunovK 1 Reply Last reply
          0
          • SikarjanS Sikarjan

            @kshegunov

            Maybe I need give you some more background. I am working on code highlighter for PHP. I started with the highlighter example and redid the multi line section. In PHP a String could be over more lines.

            "I am some text
            in a multi line string";
            

            It could be in singe or double quotes. If the string is started with either one the other will not end the string. This is why your suggestions would not work in my case.

            "<a href='../test.php'>see the \"Test\" page</a>";
            

            This would be one string and should all be highlighted in green (in my case).

            Everything above I had working with the code below and this

            quoteExpressions = QRegularExpression(R"**(?<!\\)([\"']**");
            

            regex.

            But there is another case. Something like

            glob('images/*.jpg');
            

            is also possible. If I do the quotes and the comments in two sections, the code above will be interpreted als a beginning string and then be changed to a comment. Therefore I tried to combine all multi line cases in one "function", see below. I believe my code should work if I get the regex to work. Unfortunately I do not understand the regex with the R"**... . Probably there is a better way to do what I want but this is the best I could come up with.

               multiLineCommentFormat.setForeground(Qt::gray);
               multiLineQuoteFormat.setForeground(Qt::darkGreen);
               quoteExpressions = QRegularExpression(R"**(?<!\\)([\"']|(/\\*)|(\\*/))**");
            }
            
            void Highlighter::highlightBlock(const QString &text)
            {
                setCurrentBlockState(0);
            
                if (previousBlockState() <= 0){
                    QRegularExpressionMatchIterator quoteMatch = quoteExpressions.globalMatch(text);
            
                    while(quoteMatch.hasNext()){
                        QRegularExpressionMatch match = quoteMatch.next();
                        int quoteStart = match.capturedStart();
                        int quoteLength = 0;
                        bool foundNextQuote = false;
                        QString lastQuote = match.captured();
                        int blockState = 3;
                        if(lastQuote == "'"){
                            blockState = 2;
                        }else if(lastQuote == "/*"){
                            blockState = 1;
                            lastQuote = "*/";
                        }
            
                        while(quoteMatch.hasNext()){
                            match = quoteMatch.next();
                            if(match.captured() == lastQuote){
                                quoteLength = match.capturedStart() - quoteStart;
                                foundNextQuote = true;
                                break;
                            }
                        }
            
                        if(!foundNextQuote){
                            setCurrentBlockState(blockState);
                            quoteLength = text.length() - quoteStart;
                        }
                        setFormat(quoteStart, quoteLength+1, blockState == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                    }
                }else{
                    QRegularExpressionMatchIterator quoteMatch = quoteExpressions.globalMatch(text);
                    QString lastQuote = "\"";
                    if(previousBlockState() == 1)
                        lastQuote = "*/";
                    else if(previousBlockState() == 2)
                        lastQuote = "'";
            
                    bool foundNextQuote = false;
                    while(quoteMatch.hasNext()){
                        QRegularExpressionMatch match = quoteMatch.next();
                        if(match.captured() == lastQuote){
                            setFormat(0, match.capturedStart()+1, previousBlockState() == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                            foundNextQuote = true;
                            break;
                        }
                    }
            
                    if(!quoteMatch.hasNext() && !foundNextQuote){
                        setCurrentBlockState(previousBlockState());
                        setFormat(0, text.length(), previousBlockState() == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                    }
            
                    while(quoteMatch.hasNext()){
                        QRegularExpressionMatch match = quoteMatch.next();
                        int quoteStart = match.capturedStart();
                        int quoteLength = 0;
                        bool foundNextQuote = false;
                        QString lastQuote = match.captured();
                        int blockState = 3;
                        if(lastQuote == "'"){
                            blockState = 2;
                        }else if(lastQuote == "/*"){
                            blockState = 1;
                            lastQuote = "*/";
                        }
            
                        while(quoteMatch.hasNext()){
                            match = quoteMatch.next();
                            if(match.captured() == lastQuote){
                                quoteLength = match.capturedStart() - quoteStart;
                                foundNextQuote = true;
                                break;
                            }
                        }
            
                        if(!foundNextQuote){
                            setCurrentBlockState(blockState);
                            quoteLength = text.length() - quoteStart;
                        }
                        setFormat(quoteStart, quoteLength+1, blockState == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                    }
                }
            }
            
            kshegunovK Offline
            kshegunovK Offline
            kshegunov
            Moderators
            wrote on last edited by
            #8

            @Sikarjan said in Need help with QRegularExpression for strings and comments:

            It could be in singe or double quotes.

            This is rather irrelevant, the regex can be trivially modified to allow for single quotes.

            I am working on code highlighter for PHP.

            Sorry to bring that to you, but then you're definitely on a slippery slope, you need a proper parser (rather a tokenizer), you won't be able to make it work reliably with regular expressions alone. It should be a simple matter as you can also directly use PHP's own language API to get the tokenization directly. If not an option, you can write your own it's not a very hard thing to do.

            Read and abide by the Qt Code of Conduct

            SikarjanS 1 Reply Last reply
            1
            • VRoninV Offline
              VRoninV Offline
              VRonin
              wrote on last edited by
              #9

              Steep learning curve but boost::spirit can be an option for a proper parser

              "La mort n'est rien, mais vivre vaincu et sans gloire, c'est mourir tous les jours"
              ~Napoleon Bonaparte

              On a crusade to banish setIndexWidget() from the holy land of Qt

              1 Reply Last reply
              1
              • kshegunovK kshegunov

                @Sikarjan said in Need help with QRegularExpression for strings and comments:

                It could be in singe or double quotes.

                This is rather irrelevant, the regex can be trivially modified to allow for single quotes.

                I am working on code highlighter for PHP.

                Sorry to bring that to you, but then you're definitely on a slippery slope, you need a proper parser (rather a tokenizer), you won't be able to make it work reliably with regular expressions alone. It should be a simple matter as you can also directly use PHP's own language API to get the tokenization directly. If not an option, you can write your own it's not a very hard thing to do.

                SikarjanS Offline
                SikarjanS Offline
                Sikarjan
                wrote on last edited by
                #10

                @kshegunov said in Need help with QRegularExpression for strings and comments:

                It should be a simple matter as you can also directly use PHP's own language API to get the tokenization directly.

                That sounds very simple indeed but I don't understand a word. Do you happen to have a link, which is a good entry point for that topic? I only have some PHP background and I am not a programmer by training. So my skills are very, very limited.

                Thanks for the help, so!

                kshegunovK 1 Reply Last reply
                0
                • SikarjanS Offline
                  SikarjanS Offline
                  Sikarjan
                  wrote on last edited by
                  #11

                  Hi,

                  I got my code working with the following regex

                  quoteExpressions = QRegularExpression("(?<!\\\\)([\"'])|(\\/\\*)|(\\*\\/)");
                  

                  Thanks @mrjj for making me recheck it again.

                  I am still interested in a parser solution as well but so far I was not able to find something that would help me understand the two post about it.

                  1 Reply Last reply
                  1
                  • SikarjanS Sikarjan

                    @kshegunov said in Need help with QRegularExpression for strings and comments:

                    It should be a simple matter as you can also directly use PHP's own language API to get the tokenization directly.

                    That sounds very simple indeed but I don't understand a word. Do you happen to have a link, which is a good entry point for that topic? I only have some PHP background and I am not a programmer by training. So my skills are very, very limited.

                    Thanks for the help, so!

                    kshegunovK Offline
                    kshegunovK Offline
                    kshegunov
                    Moderators
                    wrote on last edited by
                    #12

                    To tokenize something basically means to split into some kind of atomic units - e. g. string literals, identifiers, number literals, parenthesis and so on. Start with wikipedia. Also as I said, you have that already in PHP:
                    http://php.net/manual/en/function.token-get-all.php
                    http://php.net/manual/en/function.token-name.php

                    Read and abide by the Qt Code of Conduct

                    SikarjanS 1 Reply Last reply
                    0
                    • kshegunovK kshegunov

                      To tokenize something basically means to split into some kind of atomic units - e. g. string literals, identifiers, number literals, parenthesis and so on. Start with wikipedia. Also as I said, you have that already in PHP:
                      http://php.net/manual/en/function.token-get-all.php
                      http://php.net/manual/en/function.token-name.php

                      SikarjanS Offline
                      SikarjanS Offline
                      Sikarjan
                      wrote on last edited by
                      #13

                      @kshegunov I believe I get an idea. What I am unsure about is how the parser would work. Like how would I call it? Would it rescan the entire file with every key stroke?
                      The problem with a PHP file is that it could contain html, css and javascript parts, which should have their own highlighting and auto completion.

                      kshegunovK 1 Reply Last reply
                      0
                      • SikarjanS Sikarjan

                        @kshegunov I believe I get an idea. What I am unsure about is how the parser would work. Like how would I call it? Would it rescan the entire file with every key stroke?
                        The problem with a PHP file is that it could contain html, css and javascript parts, which should have their own highlighting and auto completion.

                        kshegunovK Offline
                        kshegunovK Offline
                        kshegunov
                        Moderators
                        wrote on last edited by
                        #14

                        That's no problem of PHP (from it's point of view). If you look at the list of tokens you see that it doesn't care about any HTML, javascript or css. It just reads the stuff outside <?php and ?> and prints it to the standard stream (the T_INLINE_HTML token), it cares not what it contains. So for highlighting any one of those languages you will need another tokenizer that recognizes them.

                        Read and abide by the Qt Code of Conduct

                        1 Reply Last reply
                        0

                        • Login

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • Users
                        • Groups
                        • Search
                        • Get Qt Extensions
                        • Unsolved