Need help with QRegularExpression for strings and comments



  • Hi,

    I need some help with a regex. Here is what I am working with but has an issue:

    quoteExpressions = QRegularExpression(R"**((?<!\\)([\"']))|(/\\*)|(\\*/)**");
    

    This part worked:

    quoteExpressions = QRegularExpression(R"**((?<!\\)([\"']))**");
    

    And this is what I added

    |(/\\*)|(\\*/)
    

    Basically I want to extend the regex to find /* and */ substrings.

    The regex app does not help since it cannot deal with the ** and the R. What is the R for and why do I need the ** for the quote part?

    Thanks in advance!


  • Qt Champions 2016

    @Sikarjan
    Hi
    Maybe this tool can help ?
    https://regex101.com/



  • I already tried that. But looks like the syntax here is a bit different. Regex101 would suggest this string

    ((?<!\\)([\"']))|(\/\*)|(\*\/)
    

    But this one is not working either.


  • Qt Champions 2016

    @Sikarjan
    and u do CORRECTLY escape it ?



  • I believe so.

    QRegExp("/\\*")
    

    Is a valid expression.


  • Qt Champions 2016

    Here's for comments QRegularExpression("\\/\\*+\\s*(?!<\\*)(.*?)(?:\\**)\\s*\\*\\/").
    Here's for strings: QRegularExpression("\"(.+?)(?:\"\\s*)??(?<!(?<!\\\\)\\\\)\"")

    However this is a really iffy use of regular expressions, they can't cover all the possible cases. For example the proposed string matching expression doesn't handle:

    "string"  "concatenated string"
    

    well, and also will fail to properly match strings containing \\\" . The "real" solution is to have a proper parser.



  • @kshegunov

    Maybe I need give you some more background. I am working on code highlighter for PHP. I started with the highlighter example and redid the multi line section. In PHP a String could be over more lines.

    "I am some text
    in a multi line string";
    

    It could be in singe or double quotes. If the string is started with either one the other will not end the string. This is why your suggestions would not work in my case.

    "<a href='../test.php'>see the \"Test\" page</a>";
    

    This would be one string and should all be highlighted in green (in my case).

    Everything above I had working with the code below and this

    quoteExpressions = QRegularExpression(R"**(?<!\\)([\"']**");
    

    regex.

    But there is another case. Something like

    glob('images/*.jpg');
    

    is also possible. If I do the quotes and the comments in two sections, the code above will be interpreted als a beginning string and then be changed to a comment. Therefore I tried to combine all multi line cases in one "function", see below. I believe my code should work if I get the regex to work. Unfortunately I do not understand the regex with the R"**... . Probably there is a better way to do what I want but this is the best I could come up with.

       multiLineCommentFormat.setForeground(Qt::gray);
       multiLineQuoteFormat.setForeground(Qt::darkGreen);
       quoteExpressions = QRegularExpression(R"**(?<!\\)([\"']|(/\\*)|(\\*/))**");
    }
    
    void Highlighter::highlightBlock(const QString &text)
    {
        setCurrentBlockState(0);
    
        if (previousBlockState() <= 0){
            QRegularExpressionMatchIterator quoteMatch = quoteExpressions.globalMatch(text);
    
            while(quoteMatch.hasNext()){
                QRegularExpressionMatch match = quoteMatch.next();
                int quoteStart = match.capturedStart();
                int quoteLength = 0;
                bool foundNextQuote = false;
                QString lastQuote = match.captured();
                int blockState = 3;
                if(lastQuote == "'"){
                    blockState = 2;
                }else if(lastQuote == "/*"){
                    blockState = 1;
                    lastQuote = "*/";
                }
    
                while(quoteMatch.hasNext()){
                    match = quoteMatch.next();
                    if(match.captured() == lastQuote){
                        quoteLength = match.capturedStart() - quoteStart;
                        foundNextQuote = true;
                        break;
                    }
                }
    
                if(!foundNextQuote){
                    setCurrentBlockState(blockState);
                    quoteLength = text.length() - quoteStart;
                }
                setFormat(quoteStart, quoteLength+1, blockState == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
            }
        }else{
            QRegularExpressionMatchIterator quoteMatch = quoteExpressions.globalMatch(text);
            QString lastQuote = "\"";
            if(previousBlockState() == 1)
                lastQuote = "*/";
            else if(previousBlockState() == 2)
                lastQuote = "'";
    
            bool foundNextQuote = false;
            while(quoteMatch.hasNext()){
                QRegularExpressionMatch match = quoteMatch.next();
                if(match.captured() == lastQuote){
                    setFormat(0, match.capturedStart()+1, previousBlockState() == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
                    foundNextQuote = true;
                    break;
                }
            }
    
            if(!quoteMatch.hasNext() && !foundNextQuote){
                setCurrentBlockState(previousBlockState());
                setFormat(0, text.length(), previousBlockState() == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
            }
    
            while(quoteMatch.hasNext()){
                QRegularExpressionMatch match = quoteMatch.next();
                int quoteStart = match.capturedStart();
                int quoteLength = 0;
                bool foundNextQuote = false;
                QString lastQuote = match.captured();
                int blockState = 3;
                if(lastQuote == "'"){
                    blockState = 2;
                }else if(lastQuote == "/*"){
                    blockState = 1;
                    lastQuote = "*/";
                }
    
                while(quoteMatch.hasNext()){
                    match = quoteMatch.next();
                    if(match.captured() == lastQuote){
                        quoteLength = match.capturedStart() - quoteStart;
                        foundNextQuote = true;
                        break;
                    }
                }
    
                if(!foundNextQuote){
                    setCurrentBlockState(blockState);
                    quoteLength = text.length() - quoteStart;
                }
                setFormat(quoteStart, quoteLength+1, blockState == 1 ? multiLineCommentFormat:multiLineQuoteFormat);
            }
        }
    }
    

  • Qt Champions 2016

    @Sikarjan said in Need help with QRegularExpression for strings and comments:

    It could be in singe or double quotes.

    This is rather irrelevant, the regex can be trivially modified to allow for single quotes.

    I am working on code highlighter for PHP.

    Sorry to bring that to you, but then you're definitely on a slippery slope, you need a proper parser (rather a tokenizer), you won't be able to make it work reliably with regular expressions alone. It should be a simple matter as you can also directly use PHP's own language API to get the tokenization directly. If not an option, you can write your own it's not a very hard thing to do.



  • Steep learning curve but boost::spirit can be an option for a proper parser



  • @kshegunov said in Need help with QRegularExpression for strings and comments:

    It should be a simple matter as you can also directly use PHP's own language API to get the tokenization directly.

    That sounds very simple indeed but I don't understand a word. Do you happen to have a link, which is a good entry point for that topic? I only have some PHP background and I am not a programmer by training. So my skills are very, very limited.

    Thanks for the help, so!



  • Hi,

    I got my code working with the following regex

    quoteExpressions = QRegularExpression("(?<!\\\\)([\"'])|(\\/\\*)|(\\*\\/)");
    

    Thanks @mrjj for making me recheck it again.

    I am still interested in a parser solution as well but so far I was not able to find something that would help me understand the two post about it.


  • Qt Champions 2016

    To tokenize something basically means to split into some kind of atomic units - e. g. string literals, identifiers, number literals, parenthesis and so on. Start with wikipedia. Also as I said, you have that already in PHP:
    http://php.net/manual/en/function.token-get-all.php
    http://php.net/manual/en/function.token-name.php



  • @kshegunov I believe I get an idea. What I am unsure about is how the parser would work. Like how would I call it? Would it rescan the entire file with every key stroke?
    The problem with a PHP file is that it could contain html, css and javascript parts, which should have their own highlighting and auto completion.


  • Qt Champions 2016

    That's no problem of PHP (from it's point of view). If you look at the list of tokens you see that it doesn't care about any HTML, javascript or css. It just reads the stuff outside <?php and ?> and prints it to the standard stream (the T_INLINE_HTML token), it cares not what it contains. So for highlighting any one of those languages you will need another tokenizer that recognizes them.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.