using reqular expression wrong

Anonymous_Banned275

I am trying to learn and use "regular expression" to remove control characters from QString.
I am obviously using it wrong because it works in " reverse " - removes all valid ascii characters.

Any help would be appreciated.

code

              qDebug() <<"stream raw line  \n " << line ;
                // apply QReg expression
                line.remove(QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*"));
                qDebug() <<"QRegularExpression applied  \n " << line ;

output / result

stream raw line  
  "\u0001\u001B[1;39m\u0002export                                            \u0001\u001B[0m\u0002Print environment variables"
QRegularExpression applied  
  "\u0001\u001B[1;39\u0002                                            \u0001\u001B[0\u0002  "

JonB

@AnneRanch
When you asked about this a long time ago, I warned you that you will still be left with bits of the "control sequences" used for ANSI terminals. So even when you have this right you will end with e.g.

139mexport

Is that going to be acceptable to you?

Chris Kawa

Yes, you've got it reversed. remove doesn't take an expression that you want as a result. It removes everything that matches, so you have to provide an expression that describes all that is to be removed, not all that is to stay.

To remove everything but letters, digits and spaces you could use e.g. "[^\\w\\d ]+".
^ means "everything but"
\w is "any word character"
\d is "any digit"
+ means "one or more times"
Note that you need to use double \ because it's an escape character in C++ strings.

Anonymous_Banned275

@Chris-Kawa Thanks, as mentioned by JonB it still leaves "some stuff" . It is not desirable.

How is this for crazy idea

remove all ascii - as in present
Exclusive OR original with removed result
that should give the original ascii only

Not sure if it would work / copy the original ascci where "zeroes" are valid .

Perhaps some additional "conversion" would be needed .

Is there QString with "exclusive or " function ?

Chris Kawa

But wouldn't that be doing the work twice? It's easier to just enhance the expression to match the unwanted stuff. I don't know the format of those control characters but I'm sure you can define them as a regexp e.g. if you want to remove \u0001 and the likes it would be something like "\\\\u[\\d]{4}" ( \ followed by letter u followed by 4 digits).

JonB

@AnneRanch

\u0001\u001B[1;39m\u0002export
\u0001\u001B[0m\u0002Print environment variables

In the two examples you gave it appears the "ANSI escape sequence" is enclosed in \u0001 ... \u0002 in both cases. If this is always the case then it's very easy, something like:

line.remove(QRegularExpression("\\001[^\\002]*\\002"));

ought do it.

However, if that is not always the case you would have to write a regular expression to match (so as to remove) all these "ANSI escape sequences". Which are something like:

<ESC> [ ... <letter>

at least in the cases you show. But you would have to go through and find lots of examples of these in the output you want to parse, as I believe there may be a variety of sequences other than the two you show so far.

Anonymous_Banned275

@Chris-Kawa ...doing it twice is OK and using "exclusive or " would eliminate knowing the control code or having to figure out the expression ( I am basically lazy to do that ...)

VRonin

Try this

qDebug() <<"stream raw line  \n " << line ;
QString sanitisedLine;
for (const QRegularExpressionMatch &match : QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*").globalMatch(line))
sanitisedLine.append(match.captured(0));
qDebug() <<"QRegularExpression applied  \n " << sanitisedLine;

Chris Kawa

@JonB With a small caveat that \ is an escape sequence both in C++ and in regexp, so to have an actual \ character matched you need 4 of those, so "\\\\0001[^\\\\0002]*\\\\0002". Yeah, the trouble we make for ourselves as an industry :P

JonB

@Chris-Kawa
I'm intending to pass \001 & \002 like that to regular expression. Then let it handle it. Which I think it will treat as number-character. Now that you make me think about that I'm wondering where I got that idea from....?

You are going to pass \\0001. What do you think that is going to do/be parsed as in reg exp?

Let's be clear: the OP's output like:

\u0001\u001B

is representing ASCII-char-1 and ASCII-char-27 (i.e. "Escape") bytes in that output, are we agreed?

Maybe modern reg exps even accept \u0001 as a (Unicode??) character entity, I don't know?

Chris Kawa

@JonB Ah, fair enough. I thought \u0001 is an actual string (6 characters) and not a single character.

JonB

@Chris-Kawa
No, these are byte representations. Like:

\u0001\u001B[1;39m\u0002export

From the past, the OP is obtaining from something like the output of a program running, or intended to run, in a terminal.

I happen to know that there is a ANSI terminal escape sequence like:

Esc [ row-number ; column-number m

which I think is "move cursor to row-col", \u001B == 27 decimal == Escape char.

All this stuff can be found in table at https://en.wikipedia.org/wiki/ANSI_escape_code#CSIsection

JonB

@Chris-Kawa
You raise a good question though. I'm not sure whether QRegularExpression will interpret my \001 as I intended.

How would you write the QRegularExpression to include matching characters like ASCII-1 or ASCII-27? I haven't kept up with how to reperesent that in reg exps nowadays? Maybe it's actually \u0001 & \u001B, is that a single (Unicode?) char sequence recognised in QRegularExpression??

UPDATE
I just looked on https://regex101.com/ and it does say

\ddd

Matches the 8-bit character with the given octal value.

so I think my original dim recollection for using \001 & \002 may have been right/OK after all :)

Anonymous_Banned275

@VRonin

I am missing something here , I do not understand the error .

I need to read-up on QRegularExpressionMatch - but I think you are on right track...

Would you kindly explain in few words what the code is doing ?
I think that would help me...

JonB

@AnneRanch

I am missing something here , I do not understand the error .

https://doc.qt.io/qt-6/qregularexpressionmatchiterator.html#details

Starting with Qt 6.0, it is also possible to simply use the result of QRegularExpression::globalMatch in a range-based for loop, for instance like this:
...
for (const QRegularExpressionMatch &match : re.globalMatch(subject)) {

Are you using Qt6 or Qt5?

Anonymous_Banned275

I hope this post does not distracts from the discussion .

I believe the whole concept to "search for individual ascii characters" was misleading . I have been there before and using "words" "w" should make more sense from start. .
The code snippet is "work in progress", hence has some stuff not really needed at this point.
As seen , I can retieve "word" LIST m but I am stomped on how to get QString, not a :list":

SOLVED
QString test = match.captured();
qDebug() <<"match name from ( list ) " << test;

Code

                line = stream.readLine();
                //qDebug() <<"Stream raw line  ";
                qDebug() <<"stream raw line  \n " << line ;

                // extracts the words
QRegularExpression re("(\\w+)");
QString subject(line);
QString *capture_name; //  = "                            ";
QRegularExpressionMatchIterator i = re.globalMatch(subject);
while (i.hasNext()) {
    QRegularExpressionMatch match = i.next();
    //  qDebug() <<"match (next)     " << i.next() ;
     qDebug() <<"match     " << match ;

THIS SORT OF WORKS 
     qDebug() <<"match   list  " << match.capturedTexts();

HOW TO GET INDIVIDUAL QSTRING HERE 
**?????**
 **//     qDebug() <<"match  name ( from  list )  " << match.captured(*capture_name);**
HOW TO GET INDIVIDUAL QSTRING HERE 

}

Output

Stream file 
Stream file ArrayIndex  0
stream raw line  
  "\u0001\u001B[1;39m\u0002Menu main:\u0001\u001B[0m\u0002"
match      QRegularExpressionMatch(Valid, has match: 0:(3, 4, "1"), 1:(3, 4, "1"))
match   list  match.captured( ("1", "1")
match      QRegularExpressionMatch(Valid, has match: 0:(5, 8, "39m"), 1:(5, 8, "39m"))
match   list   ("39m", "39m")
match      QRegularExpressionMatch(Valid, has match: 0:(9, 13, "Menu"), 1:(9, 13, "Menu"))
**match   list   ("Menu", "Menu")**
match      QRegularExpressionMatch(Valid, has match: 0:(14, 18, "main"), 1:(14, 18, "main"))
**match   list   ("main", "main")**
match      QRegularExpressionMatch(Valid, has match: 0:(22, 24, "0m"), 1:(22, 24, "0m"))
match   list   ("0m", "0m")
QRegularExpression remove ascii applied  
  "\u0001\u001B[1;39\u0002 :\u0001\u001B[0\u0002"
single character DONE

Anonymous_Banned275

I am trying to simplify the process

This regular expression works and removes all control code

QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
qDebug() <<"QRegularExpression remove ascii applied \n " << result;

This regal expression DOES NOT WORK
I get run time error

QString::replace: invalid QRegularExpression object

It supposedly remove all control code

result  = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+"));
        qDebug() <<"QRegularExpression remove ascii applied  \n " << result;

return result;

Christian Ehrlicher

@AnneRanch said in using reqular expression wrong:

This regal expression DOES NOT WORK

Because \u0000 and \u007F are not valid for pcre -> https://www.regular-expressions.info/unicode.html#codepoint

JonB

@AnneRanch
As @Christian-Ehrlicher has said.

That should be QRegularExpression("[^\\000-\\177]+")

However it will not do what you intend. It will remove all ASCII characters, as the comment said, and return an empty string.

I suspect you are wanting to try:

result  = inString.remove(QRegularExpression("[^\\000-\\037]+"));

which will remove just the characters you have which are non-ASCII-printable control characters.
Your \u0001\u001B[1;39m\u0002export should result in [1;39mexport.

Anonymous_Banned275

I am not sure linking to other forums is OK , but here is a part of it

I am trying to port the Java code to C++ and this reference claims that
the "controls characters " are identified as "[^\u0000-\u007F]"

and that is my objective "remove" all control characters.

And this removes ascii , not control characters>

QString result = inString.remove(QRegularExpression("[^\000-\037]+"));

and that has been my issue since I started this - remove control characters using this expression "[^\000-\037]+"));

I thin I am not using "remove" and plain "match the expression " correctly .

https://stackoverflow.com/questions/24229262/match-non-printable-non-ascii-characters-and-remove-from-text
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null)
{
return null;
}

else
{
    char ch;
    Regex regex = new Regex(@"[^\u0000-\u007F]", RegexOptions.IgnoreCase);
    Match charMatch = regex.Match(inString);