using reqular expression wrong
-
@JonB Ah, fair enough. I thought
\u0001
is an actual string (6 characters) and not a single character.@Chris-Kawa
No, these are byte representations. Like:\u0001\u001B[1;39m\u0002export
From the past, the OP is obtaining from something like the output of a program running, or intended to run, in a terminal.
I happen to know that there is a ANSI terminal escape sequence like:
Esc [ row-number ; column-number m
which I think is "move cursor to row-col",
\u001B
== 27 decimal == Escape char.All this stuff can be found in table at https://en.wikipedia.org/wiki/ANSI_escape_code#CSIsection
-
@JonB Ah, fair enough. I thought
\u0001
is an actual string (6 characters) and not a single character.@Chris-Kawa
You raise a good question though. I'm not sure whetherQRegularExpression
will interpret my\001
as I intended.How would you write the
QRegularExpression
to include matching characters like ASCII-1 or ASCII-27? I haven't kept up with how to reperesent that in reg exps nowadays? Maybe it's actually\u0001
&\u001B
, is that a single (Unicode?) char sequence recognised inQRegularExpression
??UPDATE
I just looked on https://regex101.com/ and it does say\ddd
Matches the 8-bit character with the given octal value.
so I think my original dim recollection for using
\001
&\002
may have been right/OK after all :) -
Try this
qDebug() <<"stream raw line \n " << line ; QString sanitisedLine; for (const QRegularExpressionMatch &match : QRegularExpression("[a-zA-Z_][a-zA-Z_0-9]*").globalMatch(line)) sanitisedLine.append(match.captured(0)); qDebug() <<"QRegularExpression applied \n " << sanitisedLine;
I am missing something here , I do not understand the error .
I need to read-up on QRegularExpressionMatch - but I think you are on right track...
Would you kindly explain in few words what the code is doing ?
I think that would help me... -
I am missing something here , I do not understand the error .
I need to read-up on QRegularExpressionMatch - but I think you are on right track...
Would you kindly explain in few words what the code is doing ?
I think that would help me...@AnneRanch
I am missing something here , I do not understand the error .
https://doc.qt.io/qt-6/qregularexpressionmatchiterator.html#details
Starting with Qt 6.0, it is also possible to simply use the result of QRegularExpression::globalMatch in a range-based for loop, for instance like this:
...
for (const QRegularExpressionMatch &match : re.globalMatch(subject)) {
Are you using Qt6 or Qt5?
-
I hope this post does not distracts from the discussion .
-
I believe the whole concept to "search for individual ascii characters" was misleading . I have been there before and using "words" "w" should make more sense from start. .
-
The code snippet is "work in progress", hence has some stuff not really needed at this point.
-
As seen , I can retieve "word" LIST m but I am stomped on how to get QString, not a :list":
SOLVED
QString test = match.captured();
qDebug() <<"match name from ( list ) " << test;Code
line = stream.readLine(); //qDebug() <<"Stream raw line "; qDebug() <<"stream raw line \n " << line ; // extracts the words QRegularExpression re("(\\w+)"); QString subject(line); QString *capture_name; // = " "; QRegularExpressionMatchIterator i = re.globalMatch(subject); while (i.hasNext()) { QRegularExpressionMatch match = i.next(); // qDebug() <<"match (next) " << i.next() ; qDebug() <<"match " << match ; THIS SORT OF WORKS qDebug() <<"match list " << match.capturedTexts(); HOW TO GET INDIVIDUAL QSTRING HERE **?????** **// qDebug() <<"match name ( from list ) " << match.captured(*capture_name);** HOW TO GET INDIVIDUAL QSTRING HERE }
Output
Stream file Stream file ArrayIndex 0 stream raw line "\u0001\u001B[1;39m\u0002Menu main:\u0001\u001B[0m\u0002" match QRegularExpressionMatch(Valid, has match: 0:(3, 4, "1"), 1:(3, 4, "1")) match list match.captured( ("1", "1") match QRegularExpressionMatch(Valid, has match: 0:(5, 8, "39m"), 1:(5, 8, "39m")) match list ("39m", "39m") match QRegularExpressionMatch(Valid, has match: 0:(9, 13, "Menu"), 1:(9, 13, "Menu")) **match list ("Menu", "Menu")** match QRegularExpressionMatch(Valid, has match: 0:(14, 18, "main"), 1:(14, 18, "main")) **match list ("main", "main")** match QRegularExpressionMatch(Valid, has match: 0:(22, 24, "0m"), 1:(22, 24, "0m")) match list ("0m", "0m") QRegularExpression remove ascii applied "\u0001\u001B[1;39\u0002 :\u0001\u001B[0\u0002" single character DONE
-
-
I am trying to simplify the process
This regular expression works and removes all control code
QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
qDebug() <<"QRegularExpression remove ascii applied \n " << result;This regal expression DOES NOT WORK
I get run time errorQString::replace: invalid QRegularExpression object
It supposedly remove all control code
result = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+")); qDebug() <<"QRegularExpression remove ascii applied \n " << result;
return result;
-
I am trying to simplify the process
This regular expression works and removes all control code
QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
qDebug() <<"QRegularExpression remove ascii applied \n " << result;This regal expression DOES NOT WORK
I get run time errorQString::replace: invalid QRegularExpression object
It supposedly remove all control code
result = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+")); qDebug() <<"QRegularExpression remove ascii applied \n " << result;
return result;
@AnneRanch said in using reqular expression wrong:
This regal expression DOES NOT WORK
Because
\u0000
and\u007F
are not valid for pcre -> https://www.regular-expressions.info/unicode.html#codepoint -
I am trying to simplify the process
This regular expression works and removes all control code
QString result = inString.remove(QRegularExpression("[^\w\d ]+"));
qDebug() <<"QRegularExpression remove ascii applied \n " << result;This regal expression DOES NOT WORK
I get run time errorQString::replace: invalid QRegularExpression object
It supposedly remove all control code
result = inString.remove(QRegularExpression("[^\\u0000-\\u007F]+")); qDebug() <<"QRegularExpression remove ascii applied \n " << result;
return result;
@AnneRanch
As @Christian-Ehrlicher has said.That should be
QRegularExpression("[^\\000-\\177]+")
However it will not do what you intend. It will remove all ASCII characters, as the comment said, and return an empty string.
I suspect you are wanting to try:
result = inString.remove(QRegularExpression("[^\\000-\\037]+"));
which will remove just the characters you have which are non-ASCII-printable control characters.
Your\u0001\u001B[1;39m\u0002export
should result in[1;39mexport
. -
I am not sure linking to other forums is OK , but here is a part of it
I am trying to port the Java code to C++ and this reference claims that
the "controls characters " are identified as "[^\u0000-\u007F]"and that is my objective "remove" all control characters.
And this removes ascii , not control characters>
QString result = inString.remove(QRegularExpression("[^\000-\037]+"));
and that has been my issue since I started this - remove control characters using this expression "[^\000-\037]+"));
I thin I am not using "remove" and plain "match the expression " correctly .
https://stackoverflow.com/questions/24229262/match-non-printable-non-ascii-characters-and-remove-from-text
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null)
{
return null;
}else { char ch; Regex regex = new Regex(@"[^\u0000-\u007F]", RegexOptions.IgnoreCase); Match charMatch = regex.Match(inString);
-
I am not sure linking to other forums is OK , but here is a part of it
I am trying to port the Java code to C++ and this reference claims that
the "controls characters " are identified as "[^\u0000-\u007F]"and that is my objective "remove" all control characters.
And this removes ascii , not control characters>
QString result = inString.remove(QRegularExpression("[^\000-\037]+"));
and that has been my issue since I started this - remove control characters using this expression "[^\000-\037]+"));
I thin I am not using "remove" and plain "match the expression " correctly .
https://stackoverflow.com/questions/24229262/match-non-printable-non-ascii-characters-and-remove-from-text
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null)
{
return null;
}else { char ch; Regex regex = new Regex(@"[^\u0000-\u007F]", RegexOptions.IgnoreCase); Match charMatch = regex.Match(inString);
@AnneRanch
That code you are trying to use is for regular expressions understood by .NET. They are not identical to those used by Qt.And this removes ascii , not control characters>
QString result = inString.remove(QRegularExpression("[^\\000-\\037]+"));
Just remove the
^
I wrote (I forgot you were removing rather than retaining). Should be:QString result = inString.remove(QRegularExpression("[\\000-\\037]+"));
-
I hope this post does not distracts from the discussion .
-
I believe the whole concept to "search for individual ascii characters" was misleading . I have been there before and using "words" "w" should make more sense from start. .
-
The code snippet is "work in progress", hence has some stuff not really needed at this point.
-
As seen , I can retieve "word" LIST m but I am stomped on how to get QString, not a :list":
SOLVED
QString test = match.captured();
qDebug() <<"match name from ( list ) " << test;Code
line = stream.readLine(); //qDebug() <<"Stream raw line "; qDebug() <<"stream raw line \n " << line ; // extracts the words QRegularExpression re("(\\w+)"); QString subject(line); QString *capture_name; // = " "; QRegularExpressionMatchIterator i = re.globalMatch(subject); while (i.hasNext()) { QRegularExpressionMatch match = i.next(); // qDebug() <<"match (next) " << i.next() ; qDebug() <<"match " << match ; THIS SORT OF WORKS qDebug() <<"match list " << match.capturedTexts(); HOW TO GET INDIVIDUAL QSTRING HERE **?????** **// qDebug() <<"match name ( from list ) " << match.captured(*capture_name);** HOW TO GET INDIVIDUAL QSTRING HERE }
Output
Stream file Stream file ArrayIndex 0 stream raw line "\u0001\u001B[1;39m\u0002Menu main:\u0001\u001B[0m\u0002" match QRegularExpressionMatch(Valid, has match: 0:(3, 4, "1"), 1:(3, 4, "1")) match list match.captured( ("1", "1") match QRegularExpressionMatch(Valid, has match: 0:(5, 8, "39m"), 1:(5, 8, "39m")) match list ("39m", "39m") match QRegularExpressionMatch(Valid, has match: 0:(9, 13, "Menu"), 1:(9, 13, "Menu")) **match list ("Menu", "Menu")** match QRegularExpressionMatch(Valid, has match: 0:(14, 18, "main"), 1:(14, 18, "main")) **match list ("main", "main")** match QRegularExpressionMatch(Valid, has match: 0:(22, 24, "0m"), 1:(22, 24, "0m")) match list ("0m", "0m") QRegularExpression remove ascii applied "\u0001\u001B[1;39\u0002 :\u0001\u001B[0\u0002" single character DONE
@AnneRanch said in using reqular expression wrong:
THIS SORT OF WORKS
qDebug() <<"match list " << match.capturedTexts();HOW TO GET INDIVIDUAL QSTRING HERE
match.captured(0);
-
-
@AnneRanch said in using reqular expression wrong:
THIS SORT OF WORKS
qDebug() <<"match list " << match.capturedTexts();HOW TO GET INDIVIDUAL QSTRING HERE
match.captured(0);
@VRonin
If the OP ever returns to look at the answers to this question, it would be a shame if she did not first try the simpleQString result = inString.remove(QRegularExpression("[\\000-\\037]+"));
at least to see if that is acceptable to her, compared to other more complex regular expression solutions....
[I have said that none proposed so far will be perfect, she would have to deal properly with removing just the ANSI escape sequences if she wants it to be really right.]
-
@AnneRanch said in using reqular expression wrong:
I am trying to port the Java code to C++ and this reference claims that
the "controls characters " are identified as "[^\u0000-\u007F]"Well, that reference is wrong. This is the Unicode basic Latin page, covering code points from 0 through 127 decimal, which were specifically designed to be identical to ASCII codes. You will see that only the first 32 code points (0x0000 through 0x001F) and last code point (0x007f, Del) are non-printables: the remainder are printable characters. There are other non-printables outside this range also.
and that is my objective "remove" all control characters.
And this removes ascii , not control characters>
QString result = inString.remove(QRegularExpression("[^\000-\037]+"));
and that has been my issue since I started this - remove control characters using this expression "[^\000-\037]+"));The regular expression matches any run of characters that is not in the range 0 to 31 decimal. You ask Qt to remove any character that the pattern matches: it does, leaving only those things in the control character block. You want the opposite of that.
It turns out that the documented regular expression dialect allows the POSIX character classes which can make life easier:
#include <QCoreApplication> #include <QString> #include <QRegularExpression> #include <QDebug> int main(int argc, char **argv) { QCoreApplication app(argc, argv); QString testString("ABC\tabc\177DEF-def\n\007"); // following removes all the ASCII printables (i.e. your broken result) QString temp(testString); temp.remove(QRegularExpression("[^\\000-\\037]+")); qDebug() << testString << "==>" << temp; // following removes all except the ASCII printables temp = testString; temp.remove(QRegularExpression("[\\000-\\037\\177]+")); qDebug() << testString << "==>" << temp; // Following uses a POSIX character class to remove control characters // (which include TAB and NL). temp = testString; temp.remove(QRegularExpression("[[:cntrl:]]+")); qDebug() << testString << "==>" << temp; return 0; }
Output:
"ABC\tabc\u007FDEF-def\n\u0007" ==> "\t\n\u0007" "ABC\tabc\u007FDEF-def\n\u0007" ==> "ABCabcDEF-def" "ABC\tabc\u007FDEF-def\n\u0007" ==> "ABCabcDEF-def"
-
@AnneRanch said in using reqular expression wrong:
I am trying to port the Java code to C++ and this reference claims that
the "controls characters " are identified as "[^\u0000-\u007F]"Well, that reference is wrong. This is the Unicode basic Latin page, covering code points from 0 through 127 decimal, which were specifically designed to be identical to ASCII codes. You will see that only the first 32 code points (0x0000 through 0x001F) and last code point (0x007f, Del) are non-printables: the remainder are printable characters. There are other non-printables outside this range also.
and that is my objective "remove" all control characters.
And this removes ascii , not control characters>
QString result = inString.remove(QRegularExpression("[^\000-\037]+"));
and that has been my issue since I started this - remove control characters using this expression "[^\000-\037]+"));The regular expression matches any run of characters that is not in the range 0 to 31 decimal. You ask Qt to remove any character that the pattern matches: it does, leaving only those things in the control character block. You want the opposite of that.
It turns out that the documented regular expression dialect allows the POSIX character classes which can make life easier:
#include <QCoreApplication> #include <QString> #include <QRegularExpression> #include <QDebug> int main(int argc, char **argv) { QCoreApplication app(argc, argv); QString testString("ABC\tabc\177DEF-def\n\007"); // following removes all the ASCII printables (i.e. your broken result) QString temp(testString); temp.remove(QRegularExpression("[^\\000-\\037]+")); qDebug() << testString << "==>" << temp; // following removes all except the ASCII printables temp = testString; temp.remove(QRegularExpression("[\\000-\\037\\177]+")); qDebug() << testString << "==>" << temp; // Following uses a POSIX character class to remove control characters // (which include TAB and NL). temp = testString; temp.remove(QRegularExpression("[[:cntrl:]]+")); qDebug() << testString << "==>" << temp; return 0; }
Output:
"ABC\tabc\u007FDEF-def\n\u0007" ==> "\t\n\u0007" "ABC\tabc\u007FDEF-def\n\u0007" ==> "ABCabcDEF-def" "ABC\tabc\u007FDEF-def\n\u0007" ==> "ABCabcDEF-def"
@ChrisW67 said in using reqular expression wrong:
You want the opposite of that.
I did reply earlier:
Just remove the
^
I wrote (I forgot you were removing rather than retaining). Should be:QString result = inString.remove(QRegularExpression("[\\000-\\037]+"));
-
- JobB please get off your horse - this is a discussions and we all have difference of opinions - which is what discussions are for.
( You remind me of "study group " I had years ago where certain cultures insisted on "we all have to have same opinion and agree ... then we can go home ') - I did state I am porting from Java , hence the source ( I used ) is different...
( I realize things get missed . miss-read etc. ) - There are two concepts ( to get the job done ) - so far
identify all ASCII characters
remove all control characters
Here is the code :
#ifdef BYPASS QRegularExpression re("[^\\w\\d (:/<>) ]+"); QString result = inString.remove(re); // keep all ascii plus some qDebug() <<"remove all controls \n " << result; return result; #endif QString result = inString.remove(QRegularExpression("[\\000-\\037]+")); qDebug() <<"remove all controls \n " << result; return result;
They both leave some unwanted characters. Those are easy to remove after
"regular expression" is done.
4. Looks as "match" is OK but too complex to accomplish what I want.- AS the original title said - I was using the concept wrong - did not pay attention to actual expression - identifying or deleting stuff.
I really appreciate everybody input , it has been educational.
Cheers
- JobB please get off your horse - this is a discussions and we all have difference of opinions - which is what discussions are for.
-
- JobB please get off your horse - this is a discussions and we all have difference of opinions - which is what discussions are for.
( You remind me of "study group " I had years ago where certain cultures insisted on "we all have to have same opinion and agree ... then we can go home ') - I did state I am porting from Java , hence the source ( I used ) is different...
( I realize things get missed . miss-read etc. ) - There are two concepts ( to get the job done ) - so far
identify all ASCII characters
remove all control characters
Here is the code :
#ifdef BYPASS QRegularExpression re("[^\\w\\d (:/<>) ]+"); QString result = inString.remove(re); // keep all ascii plus some qDebug() <<"remove all controls \n " << result; return result; #endif QString result = inString.remove(QRegularExpression("[\\000-\\037]+")); qDebug() <<"remove all controls \n " << result; return result;
They both leave some unwanted characters. Those are easy to remove after
"regular expression" is done.
4. Looks as "match" is OK but too complex to accomplish what I want.- AS the original title said - I was using the concept wrong - did not pay attention to actual expression - identifying or deleting stuff.
I really appreciate everybody input , it has been educational.
Cheers
@AnneRanch said in using reqular expression wrong:
JobB please get off your horse - this is a discussions and we all have difference of opinions - which is what discussions are for.
What are you talking about? I gave you the code you need to remove all non-ASCII chars. That's all. And as usual got abuse back. I know you are rude to everybody, but any reason to single me out? :) Oh, and I just saw you use what I suggested and still are cross with me!
- JobB please get off your horse - this is a discussions and we all have difference of opinions - which is what discussions are for.
-
@AnneRanch said in using reqular expression wrong:
JobB please get off your horse - this is a discussions and we all have difference of opinions - which is what discussions are for.
What are you talking about? I gave you the code you need to remove all non-ASCII chars. That's all. And as usual got abuse back. I know you are rude to everybody, but any reason to single me out? :) Oh, and I just saw you use what I suggested and still are cross with me!
@JonB ok let's get serious Your posts are great technically, but you just cannot say it without making comments - such as " if he comes back ..."
"I told you so ..." etc.
I realize that each of us has different way to express stuff and that is perfectly OK .
My gut feeling is - I am not native English speaker and not used to this sentence structure:" ...YOU can do it this way , I ALREADY TOLD YOU SO . "
In may native language I would say
" ... do it this way, "Cheers
-
@JonB ok let's get serious Your posts are great technically, but you just cannot say it without making comments - such as " if he comes back ..."
"I told you so ..." etc.
I realize that each of us has different way to express stuff and that is perfectly OK .
My gut feeling is - I am not native English speaker and not used to this sentence structure:" ...YOU can do it this way , I ALREADY TOLD YOU SO . "
In may native language I would say
" ... do it this way, "Cheers
-
SOLVED
use QString "replace" instead...I need more help making the actual expression
QRegularExpression re("[\000-\037[1;139m]+")
This works BUT deletes EVERY occurrence of "m" .
I like to delete ONLY this string "[1;139m"
PS
can anybody recommend "use regular expressing examples in C++"?
I am getting too many "tutorials" and like to know group recommendation .This one does not really explain stuff, just looks pretty (IMHO) ,,,
-
SOLVED
use QString "replace" instead...I need more help making the actual expression
QRegularExpression re("[\000-\037[1;139m]+")
This works BUT deletes EVERY occurrence of "m" .
I like to delete ONLY this string "[1;139m"
PS
can anybody recommend "use regular expressing examples in C++"?
I am getting too many "tutorials" and like to know group recommendation .This one does not really explain stuff, just looks pretty (IMHO) ,,,
@AnneRanch
It gets harder to write the the regular expression for that case.In all the examples you have shown so far, like
stream raw line "\u0001\u001B[1;39m\u0002export \u0001\u001B[0m\u0002Print environment variables"
they all look like
\u0001...\u0002
That means they have an ASCII-1 at the start and an ASCII-2 at the end. If all your cases look like this, then:
line.remove(QRegularExpression("\\001[^\\002]*\\002"));
should get rid of just what you want, and leave no "artefact bits".