Structural Pattern Matching Should Permit Regex String Matches

macintacos · January 11, 2023, 3:58pm

I want to be able to match cases in a match statement against different regular expression strings. Essentially, I would like to be able to do something like what is described in this StackOverflow issue in python:

match regex_in(some_string):
    case r'\d+':
        print('Digits')
    case r'\s+':
        print('Whitespaces')
    case _:
        print('Something else')

The above syntax is just an example taken from the SO post - I don’t have strong feelings about what exactly it should look like, I would just like it to fit cleanly into the language.

While working today, I did what I normally do when I don’t know if a feature exists in Python - I write it in a way that I would find the most intuitive, and then fall back to documentation if I can’t quite get it right. I had assumed that you would be able to pass some kind of arbitrary regex as an r"" string to some case statements to get some rudimentary matching of strings that are passed in, or perhaps use the re library to accomplish something like the above without a lot of hacking. Unfortunately that didn’t work, and I came to find out that this is basically not supported, and you have to roll your own solution.

I don’t necessarily think that every feature of regular expressions needs to be supported (for example, the SO post goes into extracting capture groups from the cases that potentially match), but I think some rudimentary pattern matching should be supported. This at least feels intuitive when writing match statements, and the fact that it’s not possible feels odd to me.

NeilGirdhar · January 11, 2023, 5:24pm

I wish Python had a really powerful parsing library. This would be kind of a partial solution to a bigger problem.

Rosuav · January 11, 2023, 6:02pm

How powerful? For instance, should Python’s standard library include a full LR(1) grammar parser? I’m sure that’d be useful at some point, but I’ll be quite honest, it’s not something I’ve often wanted, so PyPI would be a better place for it. Personally, I think the language would benefit from a built-in sscanf style tokenizer (way simpler than a regex, and can never hit quadratic parse time), but even that isn’t all THAT common a need.

So let’s stick to what we have: regular expressions. And the question was asked about using them in a match expression.

The first problem with the example as given is that r'\d+' is NOT a regular expression literal. It is simply a string literal, identical to "\\d+" or '''\\d+''' or any other syntax for a string. But the second problem kinda solves the first one for us, because the standard for match/case in Python is that you simply match on the value itself, not saying what kind of matching you’re doing, and then the pattern is what defines everything. So… what happens if we do this?

match some_string:
    case re.compile(r"\d+"): print("Digits")
    case re.compile(r"\s+"): print("Whitespace")
    case _: print("Something else")

This currently doesn’t work, but would be at least plausible. The trouble is that custom pattern matching is generally on the basis of a type (see examples of using a dataclass), whereas a regex is an instance that needs to know its pattern string.

Also notable: You can do a single regex match that has multiple named capturing groups, and then see which group got a value. This may work out as a better way to do the switching.

All that said, though, I’ve almost never been in a position of needing to check whether a value matches any of a series of regexps. It just doesn’t seem all that common a situation.

NeilGirdhar · January 11, 2023, 6:31pm

I didn’t mean in the standard library. I wouldn’t want it in the standard library.

I don’t know how common it would be, but I wanted to process some tex files once, and realized that it was practically impossible without writing my own parser.

If we had a parser, we could write this kind of matching much more elegantly and it would support all of the bells and whistles that parsers have like backtracking, “actions”, saving sub-patterns, matching using seen sub-patterns, etc.

Rosuav · January 11, 2023, 6:46pm

Well, it probably exists on PyPI already. I did a quick look and there are several parsing libraries. So what is your wish?

Ah, so you want a full grammar parser. That is DEFINITELY a useful tool, or perhaps more accurately, a family of tools, but to try to write a tex parser, you’d probably want something that already exists.

Never had to parse tex, but I’ve had occasion to parse JavaScript, and for that I use esprima. It’s on PyPI, right where it belongs.

NeilGirdhar · January 11, 2023, 7:08pm

Unfortunately, last time I checked, it did not exist. I want a parser that supports what are sometimes called “parser actions”, which is arbitrary code that runs before every potential match (possibly rejecting the match), and after every match (possibly updating some parser state).

Concretely I want to parse tex files, which means having the ability to parse things like arbitrary \begin{abc}....\end{abc} where abc is not known a priori. That means that upon matching the begin, "abc" has to be added to a stack, and then upon matching the end, it needs to be popped and report an error if it doesn’t match.

I looked for a while, and couldn’t find anything except basic parsers.

Exactly.

Anyway, regarding the idea suggested here, I think a real parser would subsume all these simpler parsing needs with an elegant framework.

Rosuav · January 11, 2023, 7:16pm

Maybe. But that parser has to be implemented in something, and that something could benefit from match/case simplicity. Layers upon layers.

(Unless the parser’s implemented in C for performance, of course. Which it quite possibly will be.)

daniele · January 11, 2023, 8:00pm

Raymond Hettinger describes an elegant solution that does not require embedding regular expressions in the Python syntax in this talk Structural Pattern Matching in the Real World - Raymond Hettinger - YouTube Quickly scanning the video, I think it is introduced around minute 14.

tjreedy · January 11, 2023, 11:26pm

Julian, you were on the right track! When a case pattern is a string, match target calls target == string. For regex patterns, the solution is to somehow wrap the real target string so that the wrapper compares equal to the regex. Using Raymond’s re.fullmatch solution in the video mentioned above, renamed, works in your example. (@daniele, thank you for finding this!)

import re

class REqual(str):
    "Override str.__eq__ to match a regex pattern."
    def __eq__(self, pattern):
        return re.fullmatch(pattern, self)

def try_match(s):
    match REqual(s):
        case r'\d+':
            print('Digits')
        case r'\s+':
            print('Whitespaces')
        case _:
            print('Something else')

try_match('345')
try_match('  ')
try_match('abc')

prints

Digits
Whitespaces
Something else

Variations possible if one does not require a full match. A similar idea would potentially work for other types of patterns.

(Added commen: people have noted that subsclassing builtins is in general awkward at best. But when, as here, the subclass instance is used once, with 1 known method, it works great.)

NeilGirdhar · January 12, 2023, 12:44am

That’s an interesting idea, but I don’t think it would be implemeted using regular expressions or match statements.