Make `re.Match` a well-rounded `Sequence` type

vberlier · May 7, 2025, 3:47am

It would be nice if the following worked as expected:

m = re.match(r"(a)(b)(c)", "abc")

assert isinstance(m, Sequence)
assert len(m) == 4
assert list(m) == ["abc", "a", "b", "c"]

abc, a, b, c = m
assert abc == "abc" and a == "a" and b == "b" and c == "c"

match re.match(r"(\d+)-(\d+)-(\d+)", "2025-05-07"):
    case [_, year, month, day]:
        assert year == "2025" and month == "05" and day == "07"

If you also work with Javascript this will feel very familiar:

let m = "abc".match(/(a)(b)(c)/)

console.log(m instanceof Array) // true
console.log(m.length) // 4
console.log(Array.from(m)) // [ 'abc', 'a', 'b', 'c' ]

let [abc, a, b, c] = m
console.log(abc) // abc
console.log(a) // a
console.log(b) // b
console.log(c) // c

Back in 2016, the re.Match object API was expanded to include __getitem__ as a shortcut for .group(...).

The goal was to improve usability and approachability by making re.Match objects fit a bit more seamlessly into python’s core data model. Accessing groups via subscripting is now intuitive, but because re.Match objects only have a __getitem__ and no __len__, they can’t be used as a proper Sequence type.

To me, this always felt a bit awkward. After digging up the original discussion, it seems like the reason why __len__ didn’t make it was that it was still undecided whether the returned value should take into account group 0 or not.

Almost a decade later, as a user, the way I see it is that the __getitem__ implementation we’re now used to suggests a regular Sequence type that also happens to transparently translate group names provided as subscript to their corresponding group index. In fact, this is actually how it works in the underlying C code.

With this in mind, we can simply define __len__ taking into account group 0, and we’ll finally be able to enjoy coherent re.Match objects that behave as proper Sequence types.

I have a working pull request, feel free to check it out:

gh-133546: Make `re.Match` a well-rounded `Sequence` type by vberlier · Pull Request #133549 · python/cpython · GitHub

How would named groups work here?

The re.Match object is a collection of associated matched groups, which all have an index. The fact that some of them can additionally be referenced by name doesn’t change anything to the number of matched groups associated to the match object.

Named groups are a non-exhaustive mapping. This is why I’m suggesting that __len__ should return the total number of matched groups. The Sequence protocol is guaranteed to offer a complete bijection. The fact that the __getitem__ implementation also works with group names is just a nice convenience on top.

tjreedy · May 7, 2025, 5:57am

m[0] is ‘abc’, size(m) must be 4. Adding .len would make m Iterable by the older iteration protocol. To clear up possible confusion, the doc could just say “m[0]” is the entire match, and it and all groups, named or not, are included in len(m) and yielded upon iteraton." Lacking a different reason for the omission, I think yes.

storchaka · May 7, 2025, 7:49am

This is a can of worms I warned about when __getitem__ was added for re.Match. It was only added with implication that __iter__ and __len__ will never be added.

This is because the semantic is confusing. Most users do not expect group 0 (the whole matched text) occuring when they want to “unpack” the match object. Use the groups() method to get all captured groups as a sequence.

See also Make re match object iterable · Issue #53738 · python/cpython · GitHub.

vberlier · May 7, 2025, 12:42pm

I think that trying to hide group 0 is actually a historical wart. By trying to safeguard users against it by making the groups() method omit group 0, we dug our own pit of ambiguity. Javascript doesn’t try to be clever about it, group 0 being the whole matched text is the common expectation across languages and regex libraries, we don’t have to let the inconsistency of the groups() method permeate the rest of the API.

In the thread you linked to, it seems like it was still undecided how subscripting should be defined:

m[x] == m.group(x) == m.groups()[x - 1]
m[x] == m.group(x + 1) == m.groups()[x]

Without a proper answer to this question, of course the semantics of len() would be confusing. But in 2016 the first option prevailed, and today m[x] behaves as m.group(x), just like in the regex module. Unfortunately, things were left a little bit half-baked for a long time, but we can correct that. The regex module already has a len() implementation that takes into account group 0.

$ uv run --with regex python -c 'import regex; print(len(regex.match(r"(a)(b)(c)", "abc")))'
4

Edit: I just want to bring up this nice motivating example from the old thread where this enables unpacking when looping over re.finditer.

for s, k, v in re.finditer(r"(\w+):(\w+)", "abc:123"):
    assert s == "abc:123"
    assert k == "abc"
    assert v == "123"

This would work exactly the same as in Javascript. I think this further disproves the idea that no programmer would expect group 0 to be the whole matched text when unpacking.

for (let [s, k, v] of "abc:123".matchAll(/(\w+):(\w+)/g)) {
    console.log(s) // abc:123
    console.log(k) // abc
    console.log(v) // 123
}

BrenBarn · May 7, 2025, 6:47pm

Well, they could read the docs. . .

tjreedy · May 7, 2025, 9:22pm

The last equality fails for x=0 as m.groups()[x-1] is the last subgroup instead of the full match. I otherwise agree.

vberlier · May 7, 2025, 9:33pm

Yeah, I was just trying to quickly convey how the two possible implementations of __getitem__ would relate to both m.group and m.groups.

tjreedy · May 7, 2025, 9:35pm

I definitely expect iter(m), if it worked, to initially yield m[0] == m.group(0), ‘the whole matched text’. If I do not want that,

Right.