Add a `__len__` and `__getitem__` ABC to `collections.abc`

Checking the length and then accessing items is a fairly common pattern for collections in Python. Currently, typeshed’s custom SupportsLenAndGetItem protocol is used around 30 times in typeshed, although it is likely that more potential uses are hidden in other type annotations, especially Sequence.

Currently, the best way to type annotate these items is to use either _typeshed.SupportsLenAndGetItem – which doesn’t exist at runtime with all the problems that this brings – or using collections.abc.Sequence. The latter has a much broader interface than SupportsLenAndGetItem and is not considered a protocol for typing purposes (as opposed to most other ABCs in collections.abc).

I suggest we add another ABC to collections.abc, which will get treated as a protocol in typeshed:

class Assortment(Sized, metaclass=ABCMeta):  # inherits __len__
    @abstractmethod
    def __getitem__(self, index): ...

This would also added as a mixin to Sequence and Mapping.

The typeshed equivalent would look like this:

_GetItemIndexT = TypeVar("_GetItemIndexT", int, slice, int | slice, default=int)

@runtime_checkable
class Assortment(Sized, Protocol[_T_co, _GetItemIndexT], metaclass=ABCMeta):
    def __len__(self) -> int: ...
    def __getitem__(self, k: _GetItemIndexT, /) -> _T_co: ...

One of the key differences between ABCs and protocols is that whereas both make promises about method signatures, ABCs also make promises about the meanings of the methods.

For example, I noticed that bisect_left uses SupportsLenAndGetItem. But it’s not just counting on having access to the methods __len__ and __getitem__. It’s also counting on the interface invariant whereby __getitem__ can be called with an integer (or integer-like) index and returns the item at that index.

Typically, you would annotate something like that Sequence. Is there a reason why Sequence is not acceptable here?

Similarly, numpy.concatenate is annotated to accept SupportsLenAndGetItem despite having the comment: # NOTE: Allow any sequence of array-like objects.

2 Likes

I don’t think that’s true. On the type check side there is no difference between the two, except that ABCs are less flexible as they require a sub-class relationship. On the implementation side, an ABC could enforce some invariants, but that’s not guaranteed. I don’t think the the collections.abc ABCs enforce any invariants. On the documentation side, both protocols and ABCs can be documented with invariants.

1 Like

You’re absolutely right that interface behavior promises are not enforced (they can’t be).

There are however implicit promises made by ABCs. In my opinion, collections.abc.Sequence is implicitly promising that __getitem__(self, i) returns the ith element. This is exactly what it provides when you use its generated mixin method __iter__: the elements in order.

It seems that the two users of SupportsLenAndGetItem are in fact depending on the promise that __getitem__(self, i) returns the ith element. Therefore, why not use a Sequence?

1 Like

Sequence requires a bunch of other methods besides __len__ and __getitem__
What if someone wants to pass to numpy.concatenate something that doesn’t implement index, and count, and __reversed__?

2 Likes

If we use it for Mapping then we can’t restrict it to int, slice, int | slice

1 Like

Inheriting from Sequence only requires __len__ and __getitem__. The Sequence mixin provides the other methods that you mention.

If they inherit from Sequence, they get all of that stuff for free. Therefore, you are hypothesizing an object that doesn’t want Sequence to mix in the additional methods using its knowledge of how “sequences” work, right? So, the object is not a sequence.

Now, consider some function accepting SupportsLenAndGetItem. Presumably it is going to use the object without making any assumptions of it being a sequence. Are there any examples of such functions?

numpy.concatenate is not a good example because it is assuming that it is receiving an actual sequence (in the conceptual sense: successive indexes in the range [0, n-1] give the entire sequence).

I don’t think adding more ABCs or runtime checkable protocols is a good idea for this. I’d be fine with a non-runtimecheckable protocol that’s public and supported in typing for this, or waiting for intersections so people can just do something like compose Indexable[int, slice, int | slice] & Sized

1 Like

The fact that someone doesn’t want to inherit from Sequence doesn’t mean that the object is not a sequence.
The reason for not wanting to inherit it could be because of concerns about import timing or something like that.

What do you mean by import timing? You mean time to import collections?

Also, I want to raise another issue with using SupportsLenAndGetItem for numpy.concatenate. It means that something like this passes type checking:

x = {3: np.ones(10), 5: np.zeros(10)}
np.concatenate(x)  # Passes!!

This reduces some of the benefits of type checking.

I would guess that 99% of uses of np.concatenate already use list or tuple. And the future of numpy (the Array API) only accepts list and tuple (and nothing else). The benefit of tighter annotations are more errors caught.

1 Like

numpy is probably actually the best example here, because np.ndarray is not a Sequence, specifically because they don’t want the “full” Sequence semantics

This exact proposed idea was actually a suggestion in that thread 12 years ago

It’s good to have a “best example” for any proposal. But, the situation with Numpy arrays is extremely complicated. If you want to see an elegant breakdown of the protocols that apply to them, check out optype. You’ll see that __len__ does apply to some Numpy arrays.

What this proposal boils down to is adding optype’s CanSequence protocol. If the purpose is to better type Numpy arrays, I suggest we leave this to optype.

Also, the new Array API arrays don’t expose __len__ at all, so they are not instances of CanSequence.

I haven’t heard of optype before, thank you for that reference!

Definitely agree that the Numpy arrays (and the array API) are a complicated situation (and I’ve also just realized that you actually participated in the discussion on that issue I linked!), especially considering the same class can work like a scalar, a “sequence”, and a “nested sequence” [1] as well as being generic on both dtype and shape

I don’t think I fully agree with your distinction earlier where a Protocol doesn’t make a promise about the “meaning” of a method.

My own view is essentially: Protocols are used to “promise” that something “walks like a duck and quacks like a duck” but the mechanism is only actually checking that it can walk and quack. Using a Protocol means that you are accepting the risk that you might get a false positive of something that can walk and quack, but not like a duck.

I think then the key difference between our views is what do we mean by “promise”, where you are taking the stricter stance that “if you aren’t checking something, then you can’t promise it” (which is more correct, but less convenient for a developer)

I don’t want to misinterpret or misrepresent your view though, so please correct me if there’s anything above that you don’t agree with.


  1. with those quotes doing some heavy work ↩︎

1 Like

The reason it doesn’t make a promise in general is that when you check whether a class is a sublcass of a protocol, the only thing that’s checked is whether the methods exist. This is in contrast to nominal subtyping, which is explicitly opt-in, and so can come with promises. (You can say in the base class: “all subclasses must behave as follows…”) Protocols can’t do that, except I guess for magic methods, which are supposed to be reserved and so they can make behavioural promises.

That’s roughly it. If you really want the behavioral promises, my view is that you should just use nominal subtyping.

Consider what would happen if you tried to use CanSequence with np.ndarray: the false positive can cause runtime exceptions when you try to actually use the array as a sequence, which I think is a bad situation.

Or perhaps you’re imagining that you both check CanSequence and catch the potential exception. I think that would be an extremely rare pattern in Python: usually, you either LBYL or you EAFP. I’ve rarely seen them combined to check the same thing.

Moreover, having false positives with CanSequence would break pattern matching.