Why cling onto RegEx?

SGaist

Hi,

Well, your simple validator is already too simple. You don't take into account that there should be at most 4 numbers otherwise the address is already invalid.

As for the regular expression for ip addresses, this is one of the common known expression.

It does not require type conversion that can fail as you show in your loop but it does require an regular expression engine.

The regular expression might be cryptic but it often allows to avoid complex parsing to be written.

candy76041820

@SGaist
1a. Well it's just a makeshift & demonstrative snippet to convey my idea of usting something instead of regex, so don't be so picky.
1b. Also, for this particular bug (>4 segments for IPv4 addrs) you mentioned, isn't a simple for-loop more debuggable than a cryptic regex? A mere browser F12 would help finding the cause.
2. Regex engines compile patterns to automatas, and I doubt if they run as fast as number string parsing functions. (Heck, the latter may even be written and optimized in assembly instructions.)
3. For C++, including another header might slow down compiling and linking.
4. What if the platform (e.g. preinstalled tailored Qt in some linux distributions, or browser support in the case of JS) lacks advanced features like regex? And what if the dialacts they support differ?
5. The time an average programmer needs to get a regex right, and that of writing such a makeshift validator, differs in - at least - 10:1 (per my observations on those around me). Not to mention that most of them have never written any regex at all - they just copy a snippet from search engine results, then blindly fiddle around hoping it'll somehow work.

Paul Colby

Hi @candy76041820,

Why cling onto RegEx?

I disagree with your premise here. I know plenty of people (like myself) that use regex a lot, but I don't "cling" to it. I use lots of forms of string processing / pattern matching / string replacement etc. Sometime regex, sometimes not. The right tool for the job. Personally, I'm quite good with regex - largely because I do lots of processing with CLI tools likes sed, where knowing regex is pretty essential, and doing that sort of processing would be very hard without regex. How would you define string search-and-replace syntax for command line? You'd have to switch to a tool like awk, which I'm also quite good at - again, the right tool for the job. Don't get me wrong, if someone was forcing me to use regex I'd be pretty resistant, but its a great option to have.

Regex engines compile patterns to automatas, and I doubt if they run as fast as number string parsing functions. (Heck, the latter may even be written and optimized in assembly instructions.)

This I would call premature optimisation. For anything but some very simple cases, regex will immediately be faster that anything the average dev can hand-code, until the dev spends a lot of time optimising. Sometimes that's appropriate, sometimes not. Personally, I would not waste time writing a function to validate IPv4 addresses by hand, when there's such well known regex patterns already available, and doing so would be a misuse of dev resources usually. But, for example, if I was working on a process that needed to process billions of IP addresses in bulk, or work on an extremely constrained device, then I would absolutely consider hand-code it, but even then, only after doing some profiling to check that it really is the most important bottleneck.

For C++, including another header might slow down compiling and linking.

Yep, if compilation time is the most important thing you're optimising for, then there's lots of things you might want to consider not using, including regex. Having the choice is great.

What if the platform (e.g. preinstalled tailored Qt in some linux distributions, or browser support in the case of JS) lacks advanced features like regex? And what if the dialacts they support differ?

Again, if that's your scenario, then avoid regex :)

The time an average programmer needs to get a regex right, and that of writing such a makeshift validator, differs in - at least - 10:1 (per my observations on those around me).

My experience is literally the opposite, at least for any significantly advanced patterns. But again, if runtime performance is super (super) critical, then I might invest that 10x effort to handcode, otherwise regex is much quicker (both to write, and run) for me.

Not to mention that most of them have never written any regex at all - they just copy a snippet from search engine results, then blindly fiddle around hoping it'll somehow work.

Now, I think finally you hint a the real problem. It's not a case if why do people in general insist on regex, but actually, why do those people around you do so - and that I can't say, since it's not true to my experience. Maybe they're trying to learn? Maybe they're under the impression that regex is always best, and haven't been challenged on it? Maybe they believe that they're good at it, and don't realise their limitations? Maybe they're not good at the manual string handling and/or debugging that you find so easy, so settle on regex is seeming safer? Maybe they think their project leaders or management are more impressed with them if they use regex?

If you have the right relationships, then I'd suggest you have a good, open conversation with them to understand their motivations, and explain why its important to you that they don't always cling to regex.

In short: no-one should be "clinging" to regex (in my opinion), just as they shouldn't be avoiding it either. Regex can be pretty damn useful and efficient a lot of the time, but should be used pragmatically, just like all other tools, patterns, etc.

But that's all just my opinion (as always) :D

Cheers.

JonB

@candy76041820
My observation: while it is true reg exs can be arcane, judging by "a large number" of the posts in this forum there are a lot of people programming who haven't got a clue and will struggle even more to write the lines of code you are talking about to do their own programmatic parsing of anything. At least (mostly) getting a reg ex wrong doesn't do any harm other than it's incorrect, compared to crashing (or worse) with bad coding.

candy76041820

@Paul-Colby

sed and awk

That's the problem here. My fellows seldom use any kind of regex, including vim's s//g, grep or pgrep, while they always tend to (try to) use regex for string validation, which make me post this.

premature optimisation

Off topic: (my) practice suggests that no optimization is premature, because otherwise there won't be any room for optimization at all later on. And so are designs.

misuse of dev resources

One good way to make yourself occupied in bosses' eyes, though. *Winks*

well known regex patterns already available

You reminded me. One needs to pay extra caution because available regex strings might exist in unescaped forms, when escaping backslashes - usually tons of them in regex! - is painful in C-influenced languages. Not every language has raw/unescaped strings, namely Java. (java.util.regex.Pattern.quote seems too ugly and clumsy for me.)

compilation time

Come to think of it, I forgot to mention that I tend to write header-only librarys (and templates, you know the best place for them is headers). Including too many implementation headers pollutes namespaces too much. But that isn't a problem for non-C/C++.

significantly advanced patterns

Come to think of it (again), I've been haunted by a exercise problem since I learned regex in university: how do you write a regex that recognizes 5 (0b101)-divisible binary literals? A most simple handcrafted state machine would solve the problem effortlessly, but I just struggled with my one-A4-paper-length regex back then. Maybe I've been stupid by designing and converting a DFA.

Maybe they're trying to learn?

Nope, or so I observe. They are just getting work done by asking me to help write the regex, then completely forget about that.
Another silly instance is that someone copied a regex I wrote for another guy, thinking it'd fulfill his needs - which were actually different from the ones I wrote the regex for. Good thing is I saw the code being familiar and asked about the matter, reminding him that the function are different. (Bad thing? I had to help them both to write another one, because it's been an inconsistency between parts of the project they never noticed.)

understand their motivations

Well that's what I always do after helping them: talking them out of blindly messing with regex they don't actually care. What can you do with those who can't even remember to distinguish between * and +?

@JonB

getting a reg ex wrong doesn't do any harm other than it's incorrect

Quite the contrary. I've been bitten by a wrong validator regex that produces false positives. And crashing early in the development stages is, I believe, helpful to discover bugs.

JonB

@candy76041820 said in Why cling onto RegEx?:

Quite the contrary. I've been bitten by a wrong validator regex that produces false positives.

But not when someone has coded it instead of reg exs?

Your validator code is fine. You're going to be liable to writing an awful lot of dedicated code for validation if you have a lot of different things to validate.

writing a longer - but much more understandable (don't you agree?) - snippet?

Just saying: I don't agree :) A reg ex can be a lot more readable than someone's procedural code --- depends on the task. You can obtain a global consensus on definition of a reg ex for, say, a host address, usable across languages; try getting that for code. I like reg exs., and the various places they crop up. If you don't, don't use them.

candy76041820

@JonB said in Why cling onto RegEx?:

But not when someone has coded it instead of reg exs?

I mean, dedicated segment needs more debugging than copy-pasted regex, so the bug was found earlier.

an awful lot of dedicated code for validation

Then there would be a similarly awful lot of dedicated regex, too.

usable across languages

Unless the backslashes get carefully escaped/unescaped when copying them across language boundaries, and differences in supported dialects are covered as well. Those, especially the latter, are equally laborious as handcrafting dedicated segments.

Kent-Dorfman

regex is something that any good dev should be made to understand. the only issue with regex parsing is that the engine will be more heavyweight than a hand made parser...but the payoff is code quality. You WILL miss special cases if you attempt to parse more generic input streams manually. I could not ever imagine trying to do any sort of parser without a good lexical analyzer, and for everything but the simplest examples, an equally capable parser generator. Now if you'll excuse me, I need to yak all over my lex.

JonB

@Kent-Dorfman said in Why cling onto RegEx?:

I need to yak all over my lex

It's yacc not yak! :) Plus I thought these days you would use bison.

J.Hilk

@Paul-Colby said in Why cling onto RegEx?:

just as they shouldn't be avoiding it either

I agree with your whole post with one exception, one should avoid std::regex. Using literally anything else, including a hand crafted parser, is better :P (or was better, the last time I checked it out)

Kent-Dorfman

@JonB said in Why cling onto RegEx?:

It's yacc not yak! :) Plus I thought these days you would use bison.

Usually when I write yak it is in relation to kayaking...so sue me...

Bob64

@candy76041820 said in Why cling onto RegEx?:

1a. Well it's just a makeshift & demonstrative snippet to convey my idea of usting something instead of regex, so don't be so picky.

Your point was to demonstrate that a simpler solution was available. However, if you can't demonstrate that the simpler solution does everything required then it isn't a solution and you haven't demonstrated anything. It's hardly being "picky" to point this out!