Regular expression pattern to handle the tail end of a command line

Slurms_MacKenzie · November 1, 2022, 4:28pm

Hello all,

I’m trying to put together a regular expression to validate the command line arguments passed to my program. I’ve come up with something that seems to work to my eyes but I have a feeling that one element in particular could be done better. The regex is as follows:

^(?:(?:--force|--flash|--dump-only|--serial[= ]{1}(?P<SERIAL>\w+)|--file[= ]{1}(?P<FILE>[-/\.\w]+))(?:\s|$))+$

If attention is paid to the final non-capturing group (?:\s|$), one can see that it will match either a white-space character, or the end of the string. This is to allow for the final command line argument that will not be followed by white-space like it’s predecessors, and if it did it would likely be a typo. In my head it’s equivalent to [\s$], or it would be if the $ meta character retained it’s special significance inside the class.

The regex works fine, but I felt it would be more appropriate to find a way of expressing that each command line argument must be followed by a white-space character unless that argument is at the end of the line; as it stands my expression merely dictates that each argument must be followed by either a white-space character OR the end of the line. I played around with look-ahead assertions but I couldn’t work out logically how it would be done

It might seem pedantic but to make the code more explicit, I was wondering if anyone could come up with a better way of expressing my intention?

Rosuav · November 1, 2022, 5:44pm

Have you considered using argparse instead?

Slurms_MacKenzie · November 1, 2022, 7:10pm

Not really, simply because regular expressions are more or less universal and I’d rather learn their syntax than use a Python library. But thanks for the suggestion, I’ll keep it in mind for more complex command lines.

cameron · November 1, 2022, 10:31pm

JWZ comes to mind once again:

 Some people, when confronted with a problem, think "I know, I'll use 
 regular expressions¿. Now they have two problems."
 - Jamie Zawinski, in alt.religion.emacs

Regexps are cryptic and error prone (not they do not behave predictably,
but that they are hard to write correctly). Even written verbosely
they’re tricky.

It’s worth learning their syntax, as they are widely used, but
choosing them as your primary method of validation is usually a poor
choice - they have their sweet spots, but are by no means suitable in
all circumstances. They’re almost never my first choice.

Cheers,
Cameron Simpson cs@cskk.id.au

Rosuav · November 1, 2022, 10:54pm

I recently fell into the trap of setting VERBOSE when I didn’t need to, and then suddenly realising that a basic sequence of characters no longer matched itself - because, while normally "spam spam spam" matches "spam spam spam", in verbose mode, you need "spam\sspam\sspam" to match that.

Regexes are great for certain tasks (like categorizing text strings for further analysis), but they tend to be fairly inflexible. For user-level validation of anything more than a single logical “unit of data”, it’s often better to go for something with better parsing options, like argparse; one of the reasons for that is that a regex simply fails to match, so the best you can really do is reject the entire unit of data.

The advantage of using a regex for command-line argument validation is that your app can be ported to another language while still being just as inflexible and unhelpful when something goes wrong.

vbrozik · November 1, 2022, 11:34pm

Attention! r"spam\sspam\sspam" is not the same:

>>> import re
>>> text = "spam\tspam\tspam"
>>> bool(re.fullmatch(r"spam spam spam", text))
False
>>> bool(re.fullmatch(r"(?x)spam\sspam\sspam", text))
True

This is the verbose equivalent:

>>> bool(re.fullmatch(r"(?x)spam\ spam\ spam", text))
False

Rosuav · November 2, 2022, 12:05am

That’s true, but for my purposes, matching other whitespace isn’t a problem. Here’s the regex in question:

r"^(return)?\s*setStatus$'Click\s*to\s*enlarge\s*picture.'$$"

It’s part of classifying a bunch of links like <a href="some_image.jpg" onmouseover="setStatus('Click to enlarge picture.')" onmouseout="setStatus('')"><img src="some_thumb.jpg"></a> - in the highly unlikely case that there’s other forms of whitespace in there, no big deal, but failing to match meant that it failed to classify.

Fun aside for web devs with some years of experience: it’s calling this function:

function setStatus(msg){
  status = msg
  return true
  }

Younger web devs will probably not know why older web devs just facepalmed hard.

Slurms_MacKenzie · November 2, 2022, 3:29pm

when you say they are hard to write correctly (with which my limited experience of them leads me to agree) do you mean in the sense that even if you manage to write a regex that matches strings of your desired format, there may be a string or strings that also match the regex - ones which you haven’t considered - that are not valid in the broader context of the program, if that makes sense?

in other words, would you say that it is easier to come up with a pattern that matches strings of the desired format, but harder to come up with one that does that plus does not match any string of an undesired format? Owing to the fact that the programmer probably can’t conceive of all possible permutations of a string that could match a regex, given a pattern of sufficient complexity.

cameron · November 2, 2022, 10:17pm

Cameron Simpson:

Regexps are cryptic and error prone (not they do not behave predictably,
but that they are hard to write correctly). Even written verbosely
they’re tricky.

when you say they are hard to write correctly (with which my limited experience of them leads me to agree) do you mean in the sense that even if you manage to write a regex that matches strings of your desired format, there may be a string or strings that also match the regex - ones which you haven’t considered - that are not valid in the broader context of the program, if that makes sense?

That can be the case. With a trivial expression:

 ^foo$

that will clearly only match one string. Once you start allowing
optional items things become harder to think about, and repetitions with
optional components, maybe with some guards? The permutations become
many and being sure your (a) match everything that is allows and (b) do
not match anything not allowed can be hard to reason able.

For example, there was a recent thread with a regexp for simple
arithmetic expressions which had optional open and close brackets:

 \(? .. stuff here ...\)?

and then tried further regep based tests to validate whether there were
correct brackets. In the contet of the larger problem that was getting
pretty tricky, and was probaboy infeasible using purely regexp tests.

Look at your regexp here:

 ^(?:(?:--force|--flash|--dump-only|--serial[=]{1}(?P<SERIAL>\w+)|--file[=]{1}(?P<FILE>[-/\.\w]+))(?:\s|$))+$

You’ve got nested noncapturing groups, multiple alternatives,
“whitespace or end of string” inside another noncapturing group, etc
etc. And that’s not even trying to do something very complicated. I
don’t even know why you’ve got noncapturing groups in there, instead of
ordinary groups.

(Aside: it looks like you’re trying to parse a whole command line.
Usually one would parse the command line into distinct “words” (a single
option string), then validate each option individually with a separate
test.)

in other words, would you say that it is easier to come up with a
pattern that matches strings of the desired format, but harder to come
up with one that does that plus does not match any string of an
undesired format?

Yes, if the alternatives are fiddly. You can accept rigidly a lot of
alternatives. For example:

 --(?P<opt>this|that|something)(=(?P<value>\S+))?

is easy to reason about and would remain so even with many --thing
choices. But I’d have gone:

 --(?P<opt>\w+)(=(?P<value>\S+))?

and then added additional if-style (or dict-lookup) tests on the string
in the “opt” group; after all, each option may want special handling
anyway based on that.

If nothing else, the regexp is smaller and you don’t have to tediously
eyeball it for every allowed option name.

Owing to the fact that the programmer probably can’t conceive of all
possible permutations of a string that could match a regex, given a
pattern of sufficient complexity.

Well, one doesn’t have to enumerate them all (an infinite space for some
things), but you do need to keep the complexity to a level you can debug
and understand. Trying to put everything into a single regexp is a path
to unmanageable complexity. (So are some other things, but regexps are
so… compact and composable that they get there pretty quickly.)

Cheers,
Cameron Simpson cs@cskk.id.au

vovavili · November 3, 2022, 3:56am

Save yourself some time and just use Typer callbacks to validate command line arguments. No one validates command line arguments with nothing but raw regular expressions.