It would be nice if the following worked as expected:
m = re.match(r"(a)(b)(c)", "abc")
assert isinstance(m, Sequence)
assert len(m) == 4
assert list(m) == ["abc", "a", "b", "c"]
abc, a, b, c = m
assert abc == "abc" and a == "a" and b == "b" and c == "c"
match re.match(r"(\d+)-(\d+)-(\d+)", "2025-05-07"):
case [_, year, month, day]:
assert year == "2025" and month == "05" and day == "07"
If you also work with Javascript this will feel very familiar:
let m = "abc".match(/(a)(b)(c)/)
console.log(m instanceof Array) // true
console.log(m.length) // 4
console.log(Array.from(m)) // [ 'abc', 'a', 'b', 'c' ]
let [abc, a, b, c] = m
console.log(abc) // abc
console.log(a) // a
console.log(b) // b
console.log(c) // c
Back in 2016, the re.Match
object API was expanded to include __getitem__
as a shortcut for .group(...)
.
The goal was to improve usability and approachability by making re.Match
objects fit a bit more seamlessly into python’s core data model. Accessing groups via subscripting is now intuitive, but because re.Match
objects only have a __getitem__
and no __len__
, they can’t be used as a proper Sequence
type.
To me, this always felt a bit awkward. After digging up the original discussion, it seems like the reason why __len__
didn’t make it was that it was still undecided whether the returned value should take into account group 0
or not.
Almost a decade later, as a user, the way I see it is that the __getitem__
implementation we’re now used to suggests a regular Sequence
type that also happens to transparently translate group names provided as subscript to their corresponding group index. In fact, this is actually how it works in the underlying C code.
With this in mind, we can simply define __len__
taking into account group 0
, and we’ll finally be able to enjoy coherent re.Match
objects that behave as proper Sequence
types.
I have a working pull request, feel free to check it out:
How would named groups work here?
The re.Match
object is a collection of associated matched groups, which all have an index. The fact that some of them can additionally be referenced by name doesn’t change anything to the number of matched groups associated to the match object.
Named groups are a non-exhaustive mapping. This is why I’m suggesting that __len__
should return the total number of matched groups. The Sequence
protocol is guaranteed to offer a complete bijection. The fact that the __getitem__
implementation also works with group names is just a nice convenience on top.