Optimise non-backtracking regex patterns with non-deterministic finite automata

daviewales · November 3, 2023, 1:22pm

Back in 2007, Russ Cox noted that for pathological regular expressions, Python (among other languages) uses an exponential time backtracking regex algorithm, when it could instead use a linear non-deterministic finite automata (NFA) algorithm.

Cox gives the example of the following string and regular expression:

string: aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
regex: a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa

You can test the performance in python as follows:

❯ time python -c "import re; re.match('a?'*29 + 'a'*29, 'a'*29)"
python -c "import re; re.match('a?'*29 + 'a'*29, 'a'*29)"  61.61s user 0.48s system 98% cpu 1:02.78 total

In Python 3.10, this completes in about 50-60 seconds.

For comparison, compare to egrep, which uses the NFA method highlighted by Cox:

❯ time echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaa | egrep 'a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaa  0.00s user 0.00s system 45% cpu 0.002 total
egrep   0.00s user 0.00s system 83% cpu 0.004 total

This completes in basically no time at all.

My proposal is simply to do what Cox suggested, and create a fast branch in the regex engine which operates as the default regex engine, unless backtracking commands are detected in the regex pattern, in which case the current backtracking algorithm would be used. (Backtracking is not supported by NFA.)

MRAB · November 3, 2023, 5:41pm

You’re not the first to have suggested this. Are you intending to write the code to do this?

daviewales · November 3, 2023, 9:40pm

I had a look on the issue tracker and this discussions board and couldn’t find the previous suggestions. Perhaps I used the wrong search terms.

If there is interest, I can add it to my list of projects to maybe one day attempt. I don’t have a great deal of spare time, so no guarantees that I’ll get to it.

JamesParrott · November 4, 2023, 1:16am

Great optimisation David - really interesting that it goes back to 2007.

So many of us have “two problems” already. Good on you for being willing to write some code eventually.

I’m almost certainly missing the point. But for the record, my point is your considerable talents could be better spent elsewhere on things that would add much more value for all users.

My request is, please can someone provide a motivating example to make the case for this, that doesn’t rely on an awful, obviously inefficient regex, that the user shoud’ve thought twice or even thrice about before committing, let alone writing in the first place?

It’s far beyond my station to tell the core team what to do. But there are numerous new issues on github every day. So I’m guessing:

The core team don’t want to spend their valuable writing micro optimisations, that will only ever optimise shit code.

jeanas · November 4, 2023, 1:19am

With my Pygments maintainer hat, we run into this all the time.

https://pygments.org/docs/lexerdevelopment/#common-pitfalls-and-best-practices

https://github.com/search?q=repo%3Apygments%2Fpygments+catastrophic+backtracking&type=commits