using reqular expression wrong
-
I am trying to learn and use "regular expression" to remove control characters from QString.
I am obviously using it wrong because it works in " reverse " - removes all valid ascii characters.Any help would be appreciated.
code
qDebug() <<"stream raw line \n " << line ; // apply QReg expression line.remove(QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*")); qDebug() <<"QRegularExpression applied \n " << line ;
output / result
stream raw line "\u0001\u001B[1;39m\u0002export \u0001\u001B[0m\u0002Print environment variables" QRegularExpression applied "\u0001\u001B[1;39\u0002 \u0001\u001B[0\u0002 "
-
I am trying to learn and use "regular expression" to remove control characters from QString.
I am obviously using it wrong because it works in " reverse " - removes all valid ascii characters.Any help would be appreciated.
code
qDebug() <<"stream raw line \n " << line ; // apply QReg expression line.remove(QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*")); qDebug() <<"QRegularExpression applied \n " << line ;
output / result
stream raw line "\u0001\u001B[1;39m\u0002export \u0001\u001B[0m\u0002Print environment variables" QRegularExpression applied "\u0001\u001B[1;39\u0002 \u0001\u001B[0\u0002 "
@AnneRanch
When you asked about this a long time ago, I warned you that you will still be left with bits of the "control sequences" used for ANSI terminals. So even when you have this right you will end with e.g.139mexport
Is that going to be acceptable to you?
-
Yes, you've got it reversed.
remove
doesn't take an expression that you want as a result. It removes everything that matches, so you have to provide an expression that describes all that is to be removed, not all that is to stay.To remove everything but letters, digits and spaces you could use e.g.
"[^\\w\\d ]+"
.
^
means "everything but"
\w
is "any word character"
\d
is "any digit"
+
means "one or more times"
Note that you need to use double \ because it's an escape character in C++ strings. -
Yes, you've got it reversed.
remove
doesn't take an expression that you want as a result. It removes everything that matches, so you have to provide an expression that describes all that is to be removed, not all that is to stay.To remove everything but letters, digits and spaces you could use e.g.
"[^\\w\\d ]+"
.
^
means "everything but"
\w
is "any word character"
\d
is "any digit"
+
means "one or more times"
Note that you need to use double \ because it's an escape character in C++ strings.@Chris-Kawa Thanks, as mentioned by JonB it still leaves "some stuff" . It is not desirable.
How is this for crazy idea
remove all ascii - as in present
Exclusive OR original with removed result
that should give the original ascii onlyNot sure if it would work / copy the original ascci where "zeroes" are valid .
Perhaps some additional "conversion" would be needed .
Is there QString with "exclusive or " function ?
-
But wouldn't that be doing the work twice? It's easier to just enhance the expression to match the unwanted stuff. I don't know the format of those control characters but I'm sure you can define them as a regexp e.g. if you want to remove
\u0001
and the likes it would be something like"\\\\u[\\d]{4}"
( \ followed by letter u followed by 4 digits). -
I am trying to learn and use "regular expression" to remove control characters from QString.
I am obviously using it wrong because it works in " reverse " - removes all valid ascii characters.Any help would be appreciated.
code
qDebug() <<"stream raw line \n " << line ; // apply QReg expression line.remove(QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*")); qDebug() <<"QRegularExpression applied \n " << line ;
output / result
stream raw line "\u0001\u001B[1;39m\u0002export \u0001\u001B[0m\u0002Print environment variables" QRegularExpression applied "\u0001\u001B[1;39\u0002 \u0001\u001B[0\u0002 "
@AnneRanch
\u0001\u001B[1;39m\u0002export \u0001\u001B[0m\u0002Print environment variables
In the two examples you gave it appears the "ANSI escape sequence" is enclosed in
\u0001 ... \u0002
in both cases. If this is always the case then it's very easy, something like:line.remove(QRegularExpression("\\001[^\\002]*\\002"));
ought do it.
However, if that is not always the case you would have to write a regular expression to match (so as to remove) all these "ANSI escape sequences". Which are something like:
<ESC> [ ... <letter>
at least in the cases you show. But you would have to go through and find lots of examples of these in the output you want to parse, as I believe there may be a variety of sequences other than the two you show so far.
-
But wouldn't that be doing the work twice? It's easier to just enhance the expression to match the unwanted stuff. I don't know the format of those control characters but I'm sure you can define them as a regexp e.g. if you want to remove
\u0001
and the likes it would be something like"\\\\u[\\d]{4}"
( \ followed by letter u followed by 4 digits).@Chris-Kawa ...doing it twice is OK and using "exclusive or " would eliminate knowing the control code or having to figure out the expression ( I am basically lazy to do that ...)
-
Try this
qDebug() <<"stream raw line \n " << line ; QString sanitisedLine; for (const QRegularExpressionMatch &match : QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*").globalMatch(line)) sanitisedLine.append(match.captured(0)); qDebug() <<"QRegularExpression applied \n " << sanitisedLine;
-
@AnneRanch
\u0001\u001B[1;39m\u0002export \u0001\u001B[0m\u0002Print environment variables
In the two examples you gave it appears the "ANSI escape sequence" is enclosed in
\u0001 ... \u0002
in both cases. If this is always the case then it's very easy, something like:line.remove(QRegularExpression("\\001[^\\002]*\\002"));
ought do it.
However, if that is not always the case you would have to write a regular expression to match (so as to remove) all these "ANSI escape sequences". Which are something like:
<ESC> [ ... <letter>
at least in the cases you show. But you would have to go through and find lots of examples of these in the output you want to parse, as I believe there may be a variety of sequences other than the two you show so far.
@JonB With a small caveat that \ is an escape sequence both in C++ and in regexp, so to have an actual \ character matched you need 4 of those, so
"\\\\0001[^\\\\0002]*\\\\0002"
. Yeah, the trouble we make for ourselves as an industry :P -
@JonB With a small caveat that \ is an escape sequence both in C++ and in regexp, so to have an actual \ character matched you need 4 of those, so
"\\\\0001[^\\\\0002]*\\\\0002"
. Yeah, the trouble we make for ourselves as an industry :P@Chris-Kawa
I'm intending to pass\001
&\002
like that to regular expression. Then let it handle it. Which I think it will treat as number-character. Now that you make me think about that I'm wondering where I got that idea from....?You are going to pass
\\0001
. What do you think that is going to do/be parsed as in reg exp?Let's be clear: the OP's output like:
\u0001\u001B
is representing ASCII-char-1 and ASCII-char-27 (i.e. "Escape") bytes in that output, are we agreed?
Maybe modern reg exps even accept
\u0001
as a (Unicode??) character entity, I don't know? -
@Chris-Kawa
I'm intending to pass\001
&\002
like that to regular expression. Then let it handle it. Which I think it will treat as number-character. Now that you make me think about that I'm wondering where I got that idea from....?You are going to pass
\\0001
. What do you think that is going to do/be parsed as in reg exp?Let's be clear: the OP's output like:
\u0001\u001B
is representing ASCII-char-1 and ASCII-char-27 (i.e. "Escape") bytes in that output, are we agreed?
Maybe modern reg exps even accept
\u0001
as a (Unicode??) character entity, I don't know?@JonB Ah, fair enough. I thought
\u0001
is an actual string (6 characters) and not a single character. -
@JonB Ah, fair enough. I thought
\u0001
is an actual string (6 characters) and not a single character.@Chris-Kawa
No, these are byte representations. Like:\u0001\u001B[1;39m\u0002export
From the past, the OP is obtaining from something like the output of a program running, or intended to run, in a terminal.
I happen to know that there is a ANSI terminal escape sequence like:
Esc [ row-number ; column-number m
which I think is "move cursor to row-col",
\u001B
== 27 decimal == Escape char.All this stuff can be found in table at https://en.wikipedia.org/wiki/ANSI_escape_code#CSIsection
-
@JonB Ah, fair enough. I thought
\u0001
is an actual string (6 characters) and not a single character.@Chris-Kawa
You raise a good question though. I'm not sure whetherQRegularExpression
will interpret my\001
as I intended.How would you write the
QRegularExpression
to include matching characters like ASCII-1 or ASCII-27? I haven't kept up with how to reperesent that in reg exps nowadays? Maybe it's actually\u0001
&\u001B
, is that a single (Unicode?) char sequence recognised inQRegularExpression
??UPDATE
I just looked on https://regex101.com/ and it does say\ddd
Matches the 8-bit character with the given octal value.
so I think my original dim recollection for using
\001
&\002
may have been right/OK after all :) -
Try this
qDebug() <<"stream raw line \n " << line ; QString sanitisedLine; for (const QRegularExpressionMatch &match : QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*").globalMatch(line)) sanitisedLine.append(match.captured(0)); qDebug() <<"QRegularExpression applied \n " << sanitisedLine;
I am missing something here , I do not understand the error .
I need to read-up on QRegularExpressionMatch - but I think you are on right track...
Would you kindly explain in few words what the code is doing ?
I think that would help me... -
I am missing something here , I do not understand the error .
I need to read-up on QRegularExpressionMatch - but I think you are on right track...
Would you kindly explain in few words what the code is doing ?
I think that would help me...@AnneRanch
I am missing something here , I do not understand the error .
https://doc.qt.io/qt-6/qregularexpressionmatchiterator.html#details
Starting with Qt 6.0, it is also possible to simply use the result of QRegularExpression::globalMatch in a range-based for loop, for instance like this:
...
for (const QRegularExpressionMatch &match : re.globalMatch(subject)) {
Are you using Qt6 or Qt5?
-
I hope this post does not distracts from the discussion .
-
I believe the whole concept to "search for individual ascii characters" was misleading . I have been there before and using "words" "w" should make more sense from start. .
-
The code snippet is "work in progress", hence has some stuff not really needed at this point.
-
As seen , I can retieve "word" LIST m but I am stomped on how to get QString, not a :list":
SOLVED
QString test = match.captured();
qDebug() <<"match name from ( list ) " << test;Code
line = stream.readLine(); //qDebug() <<"Stream raw line "; qDebug() <<"stream raw line \n " << line ; // extracts the words QRegularExpression re("(\\w+)"); QString subject(line); QString *capture_name; // = " "; QRegularExpressionMatchIterator i = re.globalMatch(subject); while (i.hasNext()) { QRegularExpressionMatch match = i.next(); // qDebug() <<"match (next) " << i.next() ; qDebug() <<"match " << match ; THIS SORT OF WORKS qDebug() <<"match list " << match.capturedTexts(); HOW TO GET INDIVIDUAL QSTRING HERE **?????** **// qDebug() <<"match name ( from list ) " << match.captured(*capture_name);** HOW TO GET INDIVIDUAL QSTRING HERE }
Output
Stream file Stream file ArrayIndex 0 stream raw line "\u0001\u001B[1;39m\u0002Menu main:\u0001\u001B[0m\u0002" match QRegularExpressionMatch(Valid, has match: 0:(3, 4, "1"), 1:(3, 4, "1")) match list match.captured( ("1", "1") match QRegularExpressionMatch(Valid, has match: 0:(5, 8, "39m"), 1:(5, 8, "39m")) match list ("39m", "39m") match QRegularExpressionMatch(Valid, has match: 0:(9, 13, "Menu"), 1:(9, 13, "Menu")) **match list ("Menu", "Menu")** match QRegularExpressionMatch(Valid, has match: 0:(14, 18, "main"), 1:(14, 18, "main")) **match list ("main", "main")** match QRegularExpressionMatch(Valid, has match: 0:(22, 24, "0m"), 1:(22, 24, "0m")) match list ("0m", "0m") QRegularExpression remove ascii applied "\u0001\u001B[1;39\u0002 :\u0001\u001B[0\u0002" single character DONE
-
-
I am trying to simplify the process
This regular expression works and removes all control code
QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
qDebug() <<"QRegularExpression remove ascii applied \n " << result;This regal expression DOES NOT WORK
I get run time errorQString::replace: invalid QRegularExpression object
It supposedly remove all control code
result = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+")); qDebug() <<"QRegularExpression remove ascii applied \n " << result;
return result;
-
I am trying to simplify the process
This regular expression works and removes all control code
QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
qDebug() <<"QRegularExpression remove ascii applied \n " << result;This regal expression DOES NOT WORK
I get run time errorQString::replace: invalid QRegularExpression object
It supposedly remove all control code
result = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+")); qDebug() <<"QRegularExpression remove ascii applied \n " << result;
return result;
@AnneRanch said in using reqular expression wrong:
This regal expression DOES NOT WORK
Because
\u0000
and\u007F
are not valid for pcre -> https://www.regular-expressions.info/unicode.html#codepoint -
I am trying to simplify the process
This regular expression works and removes all control code
QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
qDebug() <<"QRegularExpression remove ascii applied \n " << result;This regal expression DOES NOT WORK
I get run time errorQString::replace: invalid QRegularExpression object
It supposedly remove all control code
result = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+")); qDebug() <<"QRegularExpression remove ascii applied \n " << result;
return result;
@AnneRanch
As @Christian-Ehrlicher has said.That should be
QRegularExpression("[^\\000-\\177]+")
However it will not do what you intend. It will remove all ASCII characters, as the comment said, and return an empty string.
I suspect you are wanting to try:
result = inString.remove(QRegularExpression("[^\\000-\\037]+"));
which will remove just the characters you have which are non-ASCII-printable control characters.
Your\u0001\u001B[1;39m\u0002export
should result in[1;39mexport
. -
I am not sure linking to other forums is OK , but here is a part of it
I am trying to port the Java code to C++ and this reference claims that
the "controls characters " are identified as "[^\u0000-\u007F]"and that is my objective "remove" all control characters.
And this removes ascii , not control characters>
QString result = inString.remove(QRegularExpression("[^\000-\037]+"));
and that has been my issue since I started this - remove control characters using this expression "[^\000-\037]+"));
I thin I am not using "remove" and plain "match the expression " correctly .
https://stackoverflow.com/questions/24229262/match-non-printable-non-ascii-characters-and-remove-from-text
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null)
{
return null;
}else { char ch; Regex regex = new Regex(@"[^\u0000-\u007F]", RegexOptions.IgnoreCase); Match charMatch = regex.Match(inString);