when you say they are hard to write correctly (with which my limited experience of them leads me to agree) do you mean in the sense that even if you manage to write a regex that matches strings of your desired format, there may be a string or strings that also match the regex - ones which you haven’t considered - that are not valid in the broader context of the program, if that makes sense?
That can be the case. With a trivial expression:
^foo$
that will clearly only match one string. Once you start allowing
optional items things become harder to think about, and repetitions with
optional components, maybe with some guards? The permutations become
many and being sure your (a) match everything that is allows and (b) do
not match anything not allowed can be hard to reason able.
For example, there was a recent thread with a regexp for simple
arithmetic expressions which had optional open and close brackets:
\(? .. stuff here ...\)?
and then tried further regep based tests to validate whether there were
correct brackets. In the contet of the larger problem that was getting
pretty tricky, and was probaboy infeasible using purely regexp tests.
Look at your regexp here:
^(?:(?:--force|--flash|--dump-only|--serial[=]{1}(?P<SERIAL>\w+)|--file[=]{1}(?P<FILE>[-/\.\w]+))(?:\s|$))+$
You’ve got nested noncapturing groups, multiple alternatives,
“whitespace or end of string” inside another noncapturing group, etc
etc. And that’s not even trying to do something very complicated. I
don’t even know why you’ve got noncapturing groups in there, instead of
ordinary groups.
(Aside: it looks like you’re trying to parse a whole command line.
Usually one would parse the command line into distinct “words” (a single
option string), then validate each option individually with a separate
test.)
in other words, would you say that it is easier to come up with a
pattern that matches strings of the desired format, but harder to come
up with one that does that plus does not match any string of an
undesired format?
Yes, if the alternatives are fiddly. You can accept rigidly a lot of
alternatives. For example:
--(?P<opt>this|that|something)(=(?P<value>\S+))?
is easy to reason about and would remain so even with many --thing
choices. But I’d have gone:
--(?P<opt>\w+)(=(?P<value>\S+))?
and then added additional if-style (or dict-lookup) tests on the string
in the “opt
” group; after all, each option may want special handling
anyway based on that.
If nothing else, the regexp is smaller and you don’t have to tediously
eyeball it for every allowed option name.
Owing to the fact that the programmer probably can’t conceive of all
possible permutations of a string that could match a regex, given a
pattern of sufficient complexity.
Well, one doesn’t have to enumerate them all (an infinite space for some
things), but you do need to keep the complexity to a level you can debug
and understand. Trying to put everything into a single regexp is a path
to unmanageable complexity. (So are some other things, but regexps are
so… compact and composable that they get there pretty quickly.)
Cheers,
Cameron Simpson cs@cskk.id.au