Regular expression for *not* a *sequence* of characters
-
I actually do all my reg exing using Python's
re
module. But I imagine Qt's reg exs are compatible enough that I can use them as Python's if someone would care to answer.... I could have asked this question on stackoverflow, but you guys are so friendly & helpful that I thought I'd try here :) Not to mention, it's often so difficult to think of a title for these reg ex questions!I wish to search (and capture group) for: text started and ended with (i.e. enclosed inside) a pair of characters,
**
something**
:abc ** this is matching text ** def abc ** this is matching * PLUS this bit ** abc ** this is matching ** but not this ** and this is a new one ** def
It's easy to exclude all
*
s in the middle:\*\*([^*]*)\*\*
But what about allowing lone
*
s while excluding the multi-character sequence**
?I imagine it's something to do with "zero width lookahead". But reading the explanations of that makes my head hurt. Would someone care to tell me what I need for this, kindly?
-
Something like this? https://regex101.com/r/gBqL4Y/3 or did I misunderstand the question?
-
Something like this? https://regex101.com/r/gBqL4Y/3 or did I misunderstand the question?
@VRonin
Hmmm, I don't get how it looks like it works on my case #3. How greedy is that.+?
in the middle? I don't want it to match a**
, why doesn't it take longest match and eat up**
s in the middle of a line all the way till it matches the final**
against the end of the reg exp, and giving only one match group?OOhhh. There's an explanation on the right!
+?
Quantifier — Matches between one and unlimited times, as few times as possible, expanding as neededOK, why as few times as possible?? What's happened to regular expressions, since when...? :(
-
@VRonin
Hmmm, I don't get how it looks like it works on my case #3. How greedy is that.+?
in the middle? I don't want it to match a**
, why doesn't it take longest match and eat up**
s in the middle of a line all the way till it matches the final**
against the end of the reg exp, and giving only one match group?OOhhh. There's an explanation on the right!
+?
Quantifier — Matches between one and unlimited times, as few times as possible, expanding as neededOK, why as few times as possible?? What's happened to regular expressions, since when...? :(
-
@JonB said in Regular expression for *not* a *sequence* of characters:
OK, why as few times as possible?
That's the effect of
?
after+
. It make the match non-greedy. If you remove the question mark it will behave as you are expecting -
Python uses:
regular expression matching operations similar to those found in Perl
just as
QRegularExpression
does.https://regex101.com actually has an explicit python simulator
-
Python uses:
regular expression matching operations similar to those found in Perl
just as
QRegularExpression
does.https://regex101.com actually has an explicit python simulator
@VRonin
Yes, I do realize Python/Perl & others now use more advanced regular expressions thansed
did. In my day we didn't even yet have the+
operator, not sure about?
, but certainly not+?
being something special. So I simply did not know about it. Being able to match fewest is really useful, of course. -
Python uses:
regular expression matching operations similar to those found in Perl
just as
QRegularExpression
does.https://regex101.com actually has an explicit python simulator
@VRonin
As an exercise, in terms of what I had had in mind without knowing about+?
, how would you write, say, a matcher which wanted "2 asterisks followed by anything to end which is not another 2 asterisks?". That's what I thought we would need. So something like:abc ** this is a * match abc ** this does not match ** but I guess this * bit does
?
-
Something like https://regex101.com/r/VF5zir/1 ?
-
Something like https://regex101.com/r/VF5zir/1 ?
@VRonin
Yep. I see how you've done that one, again I didn't think of doing it that way.Let me try one more time: what I really want to know is just how you write "whole line [say] must not include a multi-char sequence"?
I know how to do "not a single char":
[^abc]
. How do you do "not a sequence of chars"? Sort of like^(this sequence)
, which I know does not work. Hence the original title of this thread. -
@JonB said in Regular expression for *not* a *sequence* of characters:
How do you do "not a sequence of chars"?
RegExp does not have (and probably never will) this construct. The argument is that it can easily be inverted from the calling code, i.e. write the regex that matches the sequence and then instead of
if(regexp.match())
you'd useif(!regexp.match())
-
@JonB said in Regular expression for *not* a *sequence* of characters:
How do you do "not a sequence of chars"?
RegExp does not have (and probably never will) this construct. The argument is that it can easily be inverted from the calling code, i.e. write the regex that matches the sequence and then instead of
if(regexp.match())
you'd useif(!regexp.match())
@VRonin
Ah, now we're getting somewhere --- that might explain why I don't know how to do it! I thought it could be done using one of these new-fangled "negative lookahead/behind" constructs, but no? You've set me a challenge now... :)It seems strange to me that reg exs can cope with "not one character" but not with "not multiple characters".
I know I can do it "in code" as you have shown. But Qt has various places which allow a reg ex filter/matcher, e.g. a
QLineEdit
validator which I think has to match for the validation to succeed. I could use[^*]
to reject any line with*
in it. But to reject lines which have**
in them, you're saying I cannot use a plain reg ex validator string and have to go write some kind of code (I think the Qt validators allow for that, but that's not my point)?EDIT
(?<!foo)
Negative Lookbehind Asserts that what immediately precedes the current position in the string is not fooThis is probably what I was thinking about. So, for example, I presume:
^.*(?<!\*\*)$
rejects lines which end with
**
, which is "rejecting by a sequence of characters"? [Yep, tested.] Can we expand on this to implement the "not" in-line instead? -
@VRonin
Ah, now we're getting somewhere --- that might explain why I don't know how to do it! I thought it could be done using one of these new-fangled "negative lookahead/behind" constructs, but no? You've set me a challenge now... :)It seems strange to me that reg exs can cope with "not one character" but not with "not multiple characters".
I know I can do it "in code" as you have shown. But Qt has various places which allow a reg ex filter/matcher, e.g. a
QLineEdit
validator which I think has to match for the validation to succeed. I could use[^*]
to reject any line with*
in it. But to reject lines which have**
in them, you're saying I cannot use a plain reg ex validator string and have to go write some kind of code (I think the Qt validators allow for that, but that's not my point)?EDIT
(?<!foo)
Negative Lookbehind Asserts that what immediately precedes the current position in the string is not fooThis is probably what I was thinking about. So, for example, I presume:
^.*(?<!\*\*)$
rejects lines which end with
**
, which is "rejecting by a sequence of characters"? [Yep, tested.] Can we expand on this to implement the "not" in-line instead? -
@kshegunov That works because of
^
/$
you can't matchabc ** this is matching ** but not this ** and this is a new one ** def
where the sequence to exclude is**
-
@kshegunov , @VRonin
The following is probably what you're both saying. But it is possible to "only match a complete line which does not contain**
anywhere in it" (e.g. for aQLineEdit
validator) by (https://stackoverflow.com/a/406408/489865, also an example at https://www.regextester.com/15, they call it "Match string not containing string"):^((?!\*\*).)*$
Which I certainly never knew!
@VRonin
I don't know what you mean by your last post (yes, the reg ex does include^
/$
), would you care to clarify? I suspect it's to do with "group capturing as opposed to whole match", but not at all sure. -
@kshegunov That works because of
^
/$
you can't matchabc ** this is matching ** but not this ** and this is a new one ** def
where the sequence to exclude is**
-
I haven't tried to. As far as understood the question - match lines that do not contain.
@JonB
Pretty much the same idea as what I used.@kshegunov
Yes it is what you used (though your example really confused me with its[^t]|t
in it, did you just complicate it to test me out? ;-) )There is something in @VRonin 's final statement where he accepts use of
^
/$
but then says "you can't match..." where I do not know what he is trying to convey... -
@kshegunov
Yes it is what you used (though your example really confused me with its[^t]|t
in it, did you just complicate it to test me out? ;-) )There is something in @VRonin 's final statement where he accepts use of
^
/$
but then says "you can't match..." where I do not know what he is trying to convey...@JonB said in Regular expression for *not* a *sequence* of characters:
Yes it is what you used (though your example really confused me with its [^t]|t in it, did you just complicate it to test me out? ;-) )
Surely not. It just seemed more natural to me - match anything but
t
ORt
that's not followed by "[t]his thing" ... seemed like kind of the human way of doing it ;PThere is something in @VRonin 's final statement where he accepts use of ^/$ but then says "you can't match..." where I do not know what he is trying to convey...
I think he just misunderstood the question and wants to match stuff that's between
**
pairs ... -
@JonB said in Regular expression for *not* a *sequence* of characters:
Yes it is what you used (though your example really confused me with its [^t]|t in it, did you just complicate it to test me out? ;-) )
Surely not. It just seemed more natural to me - match anything but
t
ORt
that's not followed by "[t]his thing" ... seemed like kind of the human way of doing it ;PThere is something in @VRonin 's final statement where he accepts use of ^/$ but then says "you can't match..." where I do not know what he is trying to convey...
I think he just misunderstood the question and wants to match stuff that's between
**
pairs ...@kshegunov
Surely. Have you heard of "KISS"? :-; When trying to illustrate your use of((?!.....).)*
, which is what I needed to learn as the solution, do you think adding the extra stuff would make it easy for me to understand which bit was the principle? :)I always respect what @VRonin writes. But when he said:
RegExp does not have (and probably never will) this construct.
it now seems to me that it does have such a construct, unless he explains just what he meant...
-
@kshegunov
Surely. Have you heard of "KISS"? :-; When trying to illustrate your use of((?!.....).)*
, which is what I needed to learn as the solution, do you think adding the extra stuff would make it easy for me to understand which bit was the principle? :)I always respect what @VRonin writes. But when he said:
RegExp does not have (and probably never will) this construct.
it now seems to me that it does have such a construct, unless he explains just what he meant...
@JonB said in Regular expression for *not* a *sequence* of characters:
it now seems to me that it does have such a construct
It does not have a generic way. It has a "line does not contain" or "document does not contain". Say you want to capture stuff inside**
(so\*\*(.+?)\*\*
) but exclude the capture if.+?
matchesfoo
. I don't think that is possible.Forget what I said.