Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. using reqular expression wrong

using reqular expression wrong

Scheduled Pinned Locked Moved Unsolved General and Desktop
31 Posts 6 Posters 3.7k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    Anonymous_Banned275
    wrote on last edited by Anonymous_Banned275
    #1

    I am trying to learn and use "regular expression" to remove control characters from QString.
    I am obviously using it wrong because it works in " reverse " - removes all valid ascii characters.

    Any help would be appreciated.

    code

                  qDebug() <<"stream raw line  \n " << line ;
                    // apply QReg expression
                    line.remove(QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*"));
                    qDebug() <<"QRegularExpression applied  \n " << line ;
    

    output / result

    stream raw line  
      "\u0001\u001B[1;39m\u0002export                                            \u0001\u001B[0m\u0002Print environment variables"
    QRegularExpression applied  
      "\u0001\u001B[1;39\u0002                                            \u0001\u001B[0\u0002  "
    
    JonBJ 2 Replies Last reply
    0
    • A Anonymous_Banned275

      I am trying to learn and use "regular expression" to remove control characters from QString.
      I am obviously using it wrong because it works in " reverse " - removes all valid ascii characters.

      Any help would be appreciated.

      code

                    qDebug() <<"stream raw line  \n " << line ;
                      // apply QReg expression
                      line.remove(QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*"));
                      qDebug() <<"QRegularExpression applied  \n " << line ;
      

      output / result

      stream raw line  
        "\u0001\u001B[1;39m\u0002export                                            \u0001\u001B[0m\u0002Print environment variables"
      QRegularExpression applied  
        "\u0001\u001B[1;39\u0002                                            \u0001\u001B[0\u0002  "
      
      JonBJ Online
      JonBJ Online
      JonB
      wrote on last edited by
      #2

      @AnneRanch
      When you asked about this a long time ago, I warned you that you will still be left with bits of the "control sequences" used for ANSI terminals. So even when you have this right you will end with e.g.

      139mexport  
      

      Is that going to be acceptable to you?

      1 Reply Last reply
      3
      • Chris KawaC Offline
        Chris KawaC Offline
        Chris Kawa
        Lifetime Qt Champion
        wrote on last edited by
        #3

        Yes, you've got it reversed. remove doesn't take an expression that you want as a result. It removes everything that matches, so you have to provide an expression that describes all that is to be removed, not all that is to stay.

        To remove everything but letters, digits and spaces you could use e.g. "[^\\w\\d ]+".
        ^ means "everything but"
        \w is "any word character"
        \d is "any digit"
        + means "one or more times"
        Note that you need to use double \ because it's an escape character in C++ strings.

        A 1 Reply Last reply
        4
        • Chris KawaC Chris Kawa

          Yes, you've got it reversed. remove doesn't take an expression that you want as a result. It removes everything that matches, so you have to provide an expression that describes all that is to be removed, not all that is to stay.

          To remove everything but letters, digits and spaces you could use e.g. "[^\\w\\d ]+".
          ^ means "everything but"
          \w is "any word character"
          \d is "any digit"
          + means "one or more times"
          Note that you need to use double \ because it's an escape character in C++ strings.

          A Offline
          A Offline
          Anonymous_Banned275
          wrote on last edited by
          #4

          @Chris-Kawa Thanks, as mentioned by JonB it still leaves "some stuff" . It is not desirable.

          How is this for crazy idea

          remove all ascii - as in present
          Exclusive OR original with removed result
          that should give the original ascii only

          Not sure if it would work / copy the original ascci where "zeroes" are valid .

          Perhaps some additional "conversion" would be needed .

          Is there QString with "exclusive or " function ?

          1 Reply Last reply
          0
          • Chris KawaC Offline
            Chris KawaC Offline
            Chris Kawa
            Lifetime Qt Champion
            wrote on last edited by
            #5

            But wouldn't that be doing the work twice? It's easier to just enhance the expression to match the unwanted stuff. I don't know the format of those control characters but I'm sure you can define them as a regexp e.g. if you want to remove \u0001 and the likes it would be something like "\\\\u[\\d]{4}" ( \ followed by letter u followed by 4 digits).

            A 1 Reply Last reply
            3
            • A Anonymous_Banned275

              I am trying to learn and use "regular expression" to remove control characters from QString.
              I am obviously using it wrong because it works in " reverse " - removes all valid ascii characters.

              Any help would be appreciated.

              code

                            qDebug() <<"stream raw line  \n " << line ;
                              // apply QReg expression
                              line.remove(QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*"));
                              qDebug() <<"QRegularExpression applied  \n " << line ;
              

              output / result

              stream raw line  
                "\u0001\u001B[1;39m\u0002export                                            \u0001\u001B[0m\u0002Print environment variables"
              QRegularExpression applied  
                "\u0001\u001B[1;39\u0002                                            \u0001\u001B[0\u0002  "
              
              JonBJ Online
              JonBJ Online
              JonB
              wrote on last edited by
              #6

              @AnneRanch

              \u0001\u001B[1;39m\u0002export
              \u0001\u001B[0m\u0002Print environment variables
              

              In the two examples you gave it appears the "ANSI escape sequence" is enclosed in \u0001 ... \u0002 in both cases. If this is always the case then it's very easy, something like:

              line.remove(QRegularExpression("\\001[^\\002]*\\002"));
              

              ought do it.

              However, if that is not always the case you would have to write a regular expression to match (so as to remove) all these "ANSI escape sequences". Which are something like:

              <ESC> [ ... <letter>
              

              at least in the cases you show. But you would have to go through and find lots of examples of these in the output you want to parse, as I believe there may be a variety of sequences other than the two you show so far.

              Chris KawaC 1 Reply Last reply
              1
              • Chris KawaC Chris Kawa

                But wouldn't that be doing the work twice? It's easier to just enhance the expression to match the unwanted stuff. I don't know the format of those control characters but I'm sure you can define them as a regexp e.g. if you want to remove \u0001 and the likes it would be something like "\\\\u[\\d]{4}" ( \ followed by letter u followed by 4 digits).

                A Offline
                A Offline
                Anonymous_Banned275
                wrote on last edited by
                #7

                @Chris-Kawa ...doing it twice is OK and using "exclusive or " would eliminate knowing the control code or having to figure out the expression ( I am basically lazy to do that ...)

                1 Reply Last reply
                0
                • VRoninV Offline
                  VRoninV Offline
                  VRonin
                  wrote on last edited by VRonin
                  #8

                  Try this

                  qDebug() <<"stream raw line  \n " << line ;
                  QString sanitisedLine;
                  for (const QRegularExpressionMatch &match : QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*").globalMatch(line))
                  sanitisedLine.append(match.captured(0));
                  qDebug() <<"QRegularExpression applied  \n " << sanitisedLine;
                  

                  "La mort n'est rien, mais vivre vaincu et sans gloire, c'est mourir tous les jours"
                  ~Napoleon Bonaparte

                  On a crusade to banish setIndexWidget() from the holy land of Qt

                  A 1 Reply Last reply
                  1
                  • JonBJ JonB

                    @AnneRanch

                    \u0001\u001B[1;39m\u0002export
                    \u0001\u001B[0m\u0002Print environment variables
                    

                    In the two examples you gave it appears the "ANSI escape sequence" is enclosed in \u0001 ... \u0002 in both cases. If this is always the case then it's very easy, something like:

                    line.remove(QRegularExpression("\\001[^\\002]*\\002"));
                    

                    ought do it.

                    However, if that is not always the case you would have to write a regular expression to match (so as to remove) all these "ANSI escape sequences". Which are something like:

                    <ESC> [ ... <letter>
                    

                    at least in the cases you show. But you would have to go through and find lots of examples of these in the output you want to parse, as I believe there may be a variety of sequences other than the two you show so far.

                    Chris KawaC Offline
                    Chris KawaC Offline
                    Chris Kawa
                    Lifetime Qt Champion
                    wrote on last edited by
                    #9

                    @JonB With a small caveat that \ is an escape sequence both in C++ and in regexp, so to have an actual \ character matched you need 4 of those, so "\\\\0001[^\\\\0002]*\\\\0002". Yeah, the trouble we make for ourselves as an industry :P

                    JonBJ 1 Reply Last reply
                    0
                    • Chris KawaC Chris Kawa

                      @JonB With a small caveat that \ is an escape sequence both in C++ and in regexp, so to have an actual \ character matched you need 4 of those, so "\\\\0001[^\\\\0002]*\\\\0002". Yeah, the trouble we make for ourselves as an industry :P

                      JonBJ Online
                      JonBJ Online
                      JonB
                      wrote on last edited by JonB
                      #10

                      @Chris-Kawa
                      I'm intending to pass \001 & \002 like that to regular expression. Then let it handle it. Which I think it will treat as number-character. Now that you make me think about that I'm wondering where I got that idea from....?

                      You are going to pass \\0001. What do you think that is going to do/be parsed as in reg exp?

                      Let's be clear: the OP's output like:

                      \u0001\u001B
                      

                      is representing ASCII-char-1 and ASCII-char-27 (i.e. "Escape") bytes in that output, are we agreed?

                      Maybe modern reg exps even accept \u0001 as a (Unicode??) character entity, I don't know?

                      Chris KawaC 1 Reply Last reply
                      0
                      • JonBJ JonB

                        @Chris-Kawa
                        I'm intending to pass \001 & \002 like that to regular expression. Then let it handle it. Which I think it will treat as number-character. Now that you make me think about that I'm wondering where I got that idea from....?

                        You are going to pass \\0001. What do you think that is going to do/be parsed as in reg exp?

                        Let's be clear: the OP's output like:

                        \u0001\u001B
                        

                        is representing ASCII-char-1 and ASCII-char-27 (i.e. "Escape") bytes in that output, are we agreed?

                        Maybe modern reg exps even accept \u0001 as a (Unicode??) character entity, I don't know?

                        Chris KawaC Offline
                        Chris KawaC Offline
                        Chris Kawa
                        Lifetime Qt Champion
                        wrote on last edited by
                        #11

                        @JonB Ah, fair enough. I thought \u0001 is an actual string (6 characters) and not a single character.

                        JonBJ 2 Replies Last reply
                        0
                        • Chris KawaC Chris Kawa

                          @JonB Ah, fair enough. I thought \u0001 is an actual string (6 characters) and not a single character.

                          JonBJ Online
                          JonBJ Online
                          JonB
                          wrote on last edited by JonB
                          #12

                          @Chris-Kawa
                          No, these are byte representations. Like:

                          \u0001\u001B[1;39m\u0002export
                          

                          From the past, the OP is obtaining from something like the output of a program running, or intended to run, in a terminal.

                          I happen to know that there is a ANSI terminal escape sequence like:

                          Esc [ row-number ; column-number m
                          

                          which I think is "move cursor to row-col", \u001B == 27 decimal == Escape char.

                          All this stuff can be found in table at https://en.wikipedia.org/wiki/ANSI_escape_code#CSIsection

                          1 Reply Last reply
                          2
                          • Chris KawaC Chris Kawa

                            @JonB Ah, fair enough. I thought \u0001 is an actual string (6 characters) and not a single character.

                            JonBJ Online
                            JonBJ Online
                            JonB
                            wrote on last edited by JonB
                            #13

                            @Chris-Kawa
                            You raise a good question though. I'm not sure whether QRegularExpression will interpret my \001 as I intended.

                            How would you write the QRegularExpression to include matching characters like ASCII-1 or ASCII-27? I haven't kept up with how to reperesent that in reg exps nowadays? Maybe it's actually \u0001 & \u001B, is that a single (Unicode?) char sequence recognised in QRegularExpression??

                            UPDATE
                            I just looked on https://regex101.com/ and it does say

                            \ddd

                            Matches the 8-bit character with the given octal value.

                            so I think my original dim recollection for using \001 & \002 may have been right/OK after all :)

                            1 Reply Last reply
                            0
                            • VRoninV VRonin

                              Try this

                              qDebug() <<"stream raw line  \n " << line ;
                              QString sanitisedLine;
                              for (const QRegularExpressionMatch &match : QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*").globalMatch(line))
                              sanitisedLine.append(match.captured(0));
                              qDebug() <<"QRegularExpression applied  \n " << sanitisedLine;
                              
                              A Offline
                              A Offline
                              Anonymous_Banned275
                              wrote on last edited by
                              #14

                              @VRonin

                              I am missing something here , I do not understand the error .

                              6ec658f0-4a0b-4ee7-8125-28777a12747f-image.png

                              I need to read-up on QRegularExpressionMatch - but I think you are on right track...

                              Would you kindly explain in few words what the code is doing ?
                              I think that would help me...

                              JonBJ 1 Reply Last reply
                              0
                              • A Anonymous_Banned275

                                @VRonin

                                I am missing something here , I do not understand the error .

                                6ec658f0-4a0b-4ee7-8125-28777a12747f-image.png

                                I need to read-up on QRegularExpressionMatch - but I think you are on right track...

                                Would you kindly explain in few words what the code is doing ?
                                I think that would help me...

                                JonBJ Online
                                JonBJ Online
                                JonB
                                wrote on last edited by JonB
                                #15

                                @AnneRanch

                                I am missing something here , I do not understand the error .

                                https://doc.qt.io/qt-6/qregularexpressionmatchiterator.html#details

                                Starting with Qt 6.0, it is also possible to simply use the result of QRegularExpression::globalMatch in a range-based for loop, for instance like this:
                                ...
                                for (const QRegularExpressionMatch &match : re.globalMatch(subject)) {

                                Are you using Qt6 or Qt5?

                                1 Reply Last reply
                                1
                                • A Offline
                                  A Offline
                                  Anonymous_Banned275
                                  wrote on last edited by Anonymous_Banned275
                                  #16

                                  I hope this post does not distracts from the discussion .

                                  1. I believe the whole concept to "search for individual ascii characters" was misleading . I have been there before and using "words" "w" should make more sense from start. .

                                  2. The code snippet is "work in progress", hence has some stuff not really needed at this point.

                                  3. As seen , I can retieve "word" LIST m but I am stomped on how to get QString, not a :list":

                                  SOLVED
                                  QString test = match.captured();
                                  qDebug() <<"match name from ( list ) " << test;

                                  Code

                                                  line = stream.readLine();
                                                  //qDebug() <<"Stream raw line  ";
                                                  qDebug() <<"stream raw line  \n " << line ;
                                  
                                                  // extracts the words
                                  QRegularExpression re("(\\w+)");
                                  QString subject(line);
                                  QString *capture_name; //  = "                            ";
                                  QRegularExpressionMatchIterator i = re.globalMatch(subject);
                                  while (i.hasNext()) {
                                      QRegularExpressionMatch match = i.next();
                                      //  qDebug() <<"match (next)     " << i.next() ;
                                       qDebug() <<"match     " << match ;
                                  
                                  THIS SORT OF WORKS 
                                       qDebug() <<"match   list  " << match.capturedTexts();
                                  
                                  HOW TO GET INDIVIDUAL QSTRING HERE 
                                  **?????**
                                   **//     qDebug() <<"match  name ( from  list )  " << match.captured(*capture_name);**
                                  HOW TO GET INDIVIDUAL QSTRING HERE 
                                  
                                  }
                                  
                                  
                                  

                                  Output

                                  Stream file 
                                  Stream file ArrayIndex  0
                                  stream raw line  
                                    "\u0001\u001B[1;39m\u0002Menu main:\u0001\u001B[0m\u0002"
                                  match      QRegularExpressionMatch(Valid, has match: 0:(3, 4, "1"), 1:(3, 4, "1"))
                                  match   list  match.captured( ("1", "1")
                                  match      QRegularExpressionMatch(Valid, has match: 0:(5, 8, "39m"), 1:(5, 8, "39m"))
                                  match   list   ("39m", "39m")
                                  match      QRegularExpressionMatch(Valid, has match: 0:(9, 13, "Menu"), 1:(9, 13, "Menu"))
                                  **match   list   ("Menu", "Menu")**
                                  match      QRegularExpressionMatch(Valid, has match: 0:(14, 18, "main"), 1:(14, 18, "main"))
                                  **match   list   ("main", "main")**
                                  match      QRegularExpressionMatch(Valid, has match: 0:(22, 24, "0m"), 1:(22, 24, "0m"))
                                  match   list   ("0m", "0m")
                                  QRegularExpression remove ascii applied  
                                    "\u0001\u001B[1;39\u0002 :\u0001\u001B[0\u0002"
                                  single character DONE 
                                  
                                  VRoninV 1 Reply Last reply
                                  0
                                  • A Offline
                                    A Offline
                                    Anonymous_Banned275
                                    wrote on last edited by
                                    #17

                                    I am trying to simplify the process

                                    This regular expression works and removes all control code

                                    QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
                                    qDebug() <<"QRegularExpression remove ascii applied \n " << result;

                                    This regal expression DOES NOT WORK
                                    I get run time error

                                    QString::replace: invalid QRegularExpression object

                                    It supposedly remove all control code

                                    result  = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+"));
                                            qDebug() <<"QRegularExpression remove ascii applied  \n " << result;
                                    

                                    return result;

                                    Christian EhrlicherC JonBJ 2 Replies Last reply
                                    0
                                    • A Anonymous_Banned275

                                      I am trying to simplify the process

                                      This regular expression works and removes all control code

                                      QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
                                      qDebug() <<"QRegularExpression remove ascii applied \n " << result;

                                      This regal expression DOES NOT WORK
                                      I get run time error

                                      QString::replace: invalid QRegularExpression object

                                      It supposedly remove all control code

                                      result  = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+"));
                                              qDebug() <<"QRegularExpression remove ascii applied  \n " << result;
                                      

                                      return result;

                                      Christian EhrlicherC Offline
                                      Christian EhrlicherC Offline
                                      Christian Ehrlicher
                                      Lifetime Qt Champion
                                      wrote on last edited by
                                      #18

                                      @AnneRanch said in using reqular expression wrong:

                                      This regal expression DOES NOT WORK

                                      Because \u0000 and \u007F are not valid for pcre -> https://www.regular-expressions.info/unicode.html#codepoint

                                      Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                                      Visit the Qt Academy at https://academy.qt.io/catalog

                                      1 Reply Last reply
                                      2
                                      • A Anonymous_Banned275

                                        I am trying to simplify the process

                                        This regular expression works and removes all control code

                                        QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
                                        qDebug() <<"QRegularExpression remove ascii applied \n " << result;

                                        This regal expression DOES NOT WORK
                                        I get run time error

                                        QString::replace: invalid QRegularExpression object

                                        It supposedly remove all control code

                                        result  = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+"));
                                                qDebug() <<"QRegularExpression remove ascii applied  \n " << result;
                                        

                                        return result;

                                        JonBJ Online
                                        JonBJ Online
                                        JonB
                                        wrote on last edited by JonB
                                        #19

                                        @AnneRanch
                                        As @Christian-Ehrlicher has said.

                                        That should be QRegularExpression("[^\\000-\\177]+")

                                        However it will not do what you intend. It will remove all ASCII characters, as the comment said, and return an empty string.

                                        I suspect you are wanting to try:

                                        result  = inString.remove(QRegularExpression("[^\\000-\\037]+"));
                                        

                                        which will remove just the characters you have which are non-ASCII-printable control characters.
                                        Your \u0001\u001B[1;39m\u0002export should result in [1;39mexport.

                                        1 Reply Last reply
                                        0
                                        • A Offline
                                          A Offline
                                          Anonymous_Banned275
                                          wrote on last edited by
                                          #20

                                          I am not sure linking to other forums is OK , but here is a part of it

                                          I am trying to port the Java code to C++ and this reference claims that
                                          the "controls characters " are identified as "[^\u0000-\u007F]"

                                          and that is my objective "remove" all control characters.

                                          And this removes ascii , not control characters>

                                          QString result = inString.remove(QRegularExpression("[^\000-\037]+"));

                                          and that has been my issue since I started this - remove control characters using this expression "[^\000-\037]+"));

                                          I thin I am not using "remove" and plain "match the expression " correctly .

                                          https://stackoverflow.com/questions/24229262/match-non-printable-non-ascii-characters-and-remove-from-text
                                          public static string RemoveTroublesomeCharacters(string inString)
                                          {
                                          if (inString == null)
                                          {
                                          return null;
                                          }

                                          else
                                          {
                                              char ch;
                                              Regex regex = new Regex(@"[^\u0000-\u007F]", RegexOptions.IgnoreCase);
                                              Match charMatch = regex.Match(inString);
                                          
                                          JonBJ 1 Reply Last reply
                                          0

                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • Users
                                          • Groups
                                          • Search
                                          • Get Qt Extensions
                                          • Unsolved