Allowing structural pattern matching to find things in the middle of lists

roblevy · August 12, 2024, 8:18pm

Inspired by this discussion on StackOverflow:

Boto3 lists buckets unhelpfully like this:

{'Buckets': [{'CreationDate': datetime.datetime(1, 1, 1, 0, 0, tzinfo=tzlocal()),
              'Name': 'foo'},
             {'CreationDate': datetime.datetime(1, 1, 1, 0, 0, tzinfo=tzlocal()),
              'Name': 'bar'},
             {'CreationDate': datetime.datetime(2024, 7, 30, 15, 4, 59, 150000, tzinfo=tzlocal()),
              'Name': 'baz'}],
 'Owner': {...}}

If I want to know whether a bucket foo is at the start of that list, I can write this:

match s3_response:
    case {"Buckets": [{"Name": "foo", "CreationDate": _}, *_]}:
        print("Bucket 'foo' found")
    case _:
        print("Bucket 'foo' not found")

And I can think of a similar solution if the bucket is known to be at the end.

But what about finding bucket bar which is in the middle of the list of buckets? You might hope you could do this:

match s3_response:
    case {"Buckets": [*_, {"Name": "bar", "CreationDate": _}, *_]}:
        print("Bucket 'bar' found")
    case _:
        print("Bucket 'bar' not found")

But that gives:

SyntaxError: multiple starred names in sequence pattern

Of course I could do things the boring old way, but where’s the fun in that:

'bar' in [bucket.get("Name", "") for bucket in s3_response.get("Buckets", [])]

As SO user jsbueno suggested, the “two *_ argument solution” looks kind of reasonable.

(Interestingly, ChatGPT actually offers this Syntax Error as a solution already, so clearly it thinks it’s a good idea!)

Another possible, but unattractive, way to write this would be:

match s3_response:
    case {"Buckets": [*buckets]}:
        for bucket in buckets:
            match bucket:
                case {"Name": "bar", "CreationDate": _}:
                    return True  # or whatever
    case _:
        return False

loic-simon · August 12, 2024, 10:00pm

What would the semantics be if two items matching the patters appear in the list? An exception at runtime?

Wombat · August 13, 2024, 2:16am

I can’t think of a workable pattern syntax to express being in the middle of a variable length sequence. It may be best to just use a guard expression.

Melendowski · August 13, 2024, 2:29am

I akin to this as, iterate through the sequence and if you find anything at all matching the pattern then stop.

blhsing · August 13, 2024, 2:31am

I’ve always hated the boto3 API making everything a list of dicts with names as values under the key Name rather than a simple dict with names as keys. /rant

Normally, unpacking into a sequence with two starred expressions is ambiguous and is therefore disallowed, but in this case of structural pattern matching where values matched by _ are to be discarded, [*_, {"Name": "bar", "CreationDate": _}, *_] does make sense. So +1 from me.

roblevy · August 13, 2024, 9:22am

The advantage of doing “pick the first match you find”, would be that you could then specify multiple matches as follows:

buckets = [
    {"name": "foo", "length": 10},
    {"name": "bar", "length": 20},
    {"name": "baz", "length": 30},
    {"name": "bar", "length": 40}
]

case buckets:
    match [*_, {"name": "bar", "length": length}, *_]:
        print(length)  # 20

case buckets:
    match [*_, {"name": "bar", "length": length1}, *_, {"name": "bar", "length": length2}, *_]:
        print(length1, length2)  # 20 30

This obviously wouldn’t work with an unordered sequence, like a set.

To match any number of “bar” buckets, you could do:

case buckets:
    match [*_, *{"name": "bar", "length": length}, *_]:
        print(length)  # [10, 20]

which could return {10, 20} if buckets was unordered?

zuo · August 13, 2024, 12:01pm

This obviously wouldn’t work with an unordered sequence, like a set.
[snip]
which could return {10, 20} if buckets was unordered?

Please note that:

sequences are always ordered;
sets are not sequences (generally).

And, when it comes to the case/match syntax, the [...] pattern matches sequences (except str/bytes/bytearray), but (generally) it does not match sets.

PS I say “generally”, as you could have a set which is also a Sequence.

brandtbucher · August 13, 2024, 11:57pm

Just my two cents… we deliberately disallowed this in the original design because it opens up a whole can of worms that requires lots of careful thought: backtracking.

Currently, patterns don’t ever backtrack, meaning that they’ll never check a particular subpattern more than once (each subpattern is compiled at most once, and there are no backward jumps). This keeps the time and/or space complexity of the pattern from exploding from simple patterns.

For instance:

case [*_, "foo", _] and case [_, "foo", *_] only check one element.
case [*_, "foo", *_] potentially checks N elements, where N is the length of the subject.
case [*_, "foo", *_, "bar", *_] potentially checks about N**2 elements.
case [*_, "foo", *_, "bar", *_, "baz", *_] potentially checks about N**3 elements.

I have a hunch that if you’re writing patterns like this (“regex for objects”), you may be working with the wrong data structures. Chances are that you don’t actually want a linear search for each element to happen behind the scenes. If you do, then a separate loop is probably more explicit.

And that’s ignoring the additional issues of:

Whether patterns search left-to-right, right-to-left, or in an undefined way.
Whether patterns like case [*_, "foo" | "bar", "baz", *_]" match ["foo", "bar", "baz"] (more backtracking fun).
Whether patterns like case [*_, a, b, *_] if a == b" match ["foo", "bar", "bar"] (even more backtracking fun!). The current behavior is that case [a, b, _] | [_, a, b] if a == b doesn’t match ["foo", "bar", "bar"]…

Anyways, consider me -0.

blhsing · August 14, 2024, 5:41am

Excellent points, but we sometimes have no choice but to work with the “wrong data structures” when dealing with certain popular APIs such as boto3 unless we are willing to bear the overhead cost of always transforming the wrong data structures into the right ones first.

I think by just allowing at most two *_s in a matching sequence we can satisfy the vast majority of real-world use cases while keeping the complexity sane and linear.

Like many other built-in operations that perform searches (such as list.index, list.remove, etc.) it can be simply documented that a pattern search is performed from left to right and stops at the first match, a behavior familiar to all Python users.

Backtracking at worst increases the time complexity from O(n) to O(n x m) where n is the size of the container and m is the number of items between the two *_s, which is usually insignificant compared to n.

So again, I think we just need an additional rule of allowing at most two *_s in a matching sequence. This feature will improve both readability and performance (because we otherwise would have to settle with the slower bytecode performance of uglier custom Python search code).

pf_moore · August 14, 2024, 7:52am

I don’t think language features should be motivated by bad API designs in 3rd party libraries. The correct fix here is to get boto3 to switch to a better API. The fact that it’s easier to get a discussion going on a change to Python says more about the support of boto3 than about the advisability of changing Python…

Rosuav · August 14, 2024, 7:59am

The way you’ve written it, “transforming… into the right [data structures]” is a last resort. I’d flip that on its head. If you first preprocess your data (usually an O(n) operation), you are then able to use far more efficient lookup techniques.

Yes, sometimes it may be handy to inline that (“I only need to do one lookup”), but that’s an alternative to doing the preprocessing.

blhsing · August 14, 2024, 8:04am

I’ve personally had many past use cases of a one-lookup query of a single name from a boto3 response so I empathize with the OP’s needs. Transforming the entire data structure into the right one is costly for a one-time lookup when an iterative lookup can short-circuit upon a match.

Asking the boto3 team to change their API is certainly a possibility, but irrespective of boto3 I see the [*_, foo, *_] pattern as an enhanced version of list.index–which everyone knows is inefficient to use in a loop, but which everyone knows is the best choice when the application needs only a one-time search. If list.index deserves a place as a Python built-in, so should the proposed pattern IMHO.

Rosuav · August 14, 2024, 9:37am

I agree, it’s definitely a use-case; I’m just flipping the goal around here, and saying that the goal is to simplify the one-off uses rather than to make lookups into messy data structures the norm.

But when you look at it in that way (as per your last sentence here regarding short-circuiting), you’ll often end up seeing a different and better way to do it - but often something that’s specific to your exact environment, and thus not really something for the language to support.

blhsing · August 14, 2024, 10:27am

I wouldn’t say those are one-off use cases. Oftentimes these one-lookup queries are needed in scheduled jobs that target buckets of specific names for example. Such needs are quite recurring.

Not sure why it has anything to do with my exact environment. I don’t think there is a better way to perform such one-lookup queries given the response from the given API. It’s like saying Python shouldn’t have offered list.index because we should always change the data structure first so we can perform an “efficient” one-time lookup in a dict instead. What’s “better” is really situational. A language should offer different tools to support different “better” approaches given different situations.

Rosuav · August 14, 2024, 10:48am

One-off for any given dataset. Otherwise, it’d be worth preprocessing it.