Minimizing requirements for children of `collections.abc.Sequence`

Alex-Wasowicz · June 13, 2024, 1:40pm

I personally find it weird, that classes implementing Sequence need to implement full __getitem__ functionality on their own (if they want it to work, obviously). Negative index is simple enough to be required, but still it shouldn’t be responsibility of child class. What is worse is requirement to implement slicing on its own. There is no simple slice iterator over a sequence in standard library, including itertools.islice which doesn’t detect sequences. Plus, some stub files (at least PyCharm-s) require Sequence as a return, not any Iterable. So minimal solution is bloated into:.

def __getitem__(self, key):
    length = len(self)
    if not isinstance(key, int): #unlikely
        if isinstance(key, slice):
            return tuple(self[i] for i in range(
                key.start if key.start is not None else 0,
                key.stop if key.stop is not None else length,
                key.step if key.step is not None else 1
            ) if -length <= i < length)
        key = int(key) #may raise TypeError
    if key < 0:
        key += length
    # Minimal sequence ``__getitem__`` here, for example:.
    if 0 <= key < length:
        return None
    else:
        raise IndexError

I think it would be better, if __getitem__ was a “mixin” method, that calls some non-mandatory magic name method. The only problem I see with this is additional work to be done on type checking system.

oscarbenjamin · June 13, 2024, 2:19pm

You can use the slice.indices() method:

    if isinstance(key, slice):
        return tuple(self[i] for i in range(*key.indices(length)))

Also it is better to convert the index with operator.index(key) rather then int(key) so that it does not accept floats or other non-integer types:

def __getitem__(self, key):
    length = len(self)
    if isinstance(key, slice):
        return tuple(self._getitem(i) for i in range(*key.indices(length)))
    key = operator.index(key)
    if key < 0:
        key += length
    if not (0 <= key < length):
        raise IndexError
    return self._getitem(key)

The reason that this logic needs to be implemented yourself is because other types like dict behave very differently.

What might be worth doing is adding a second argument to operator.index like:

def index(a, length=None):
    i = a.__index__()
    if length is not None:
        if i < 0:
            i += length
        if not (0 <= i < length):
            raise IndexError
    return i

Given the expected use for operator.index it seems reasonable that it could implement this standard logic around sequence length.

ncoghlan · June 14, 2024, 8:54am

I do like the idea of making the following addition to the operator module (there’s no need to make operator.index slower by adding an optional argument to it when a separate function will serve the purpose even more effectively):

def sequence_index(a, length):
    i = a.__index__()
    if i < 0:
        i += length
    if not (0 <= i < length):
        raise IndexError(f"index {a!r} out of range")
    return i

Beyond that, the challenge lies in making Sequence (and MutableSequence) subclasses meaningfully simpler to implement without causing other problems (such as introducing new ways to implement them incorrectly).

One clear weakness in the status quo is that the Sequence and __getitem__ documentation are both currently lacking is a description of how to correctly implement subscripting support in a way that mimics the behaviour of standard library sequences.

The __getitem__ documentation just says:

For sequence types, the accepted keys should be integers. Optionally, they may support slice objects as well. Negative index support is also optional.

Even Oscar’s sample code above contains an error, as it always returns a tuple for slices, when the type returned should be the same type as the instance being sliced (I’ve also inverted the sense of the type check, so only the one-liner simple case gets indented):

def _sequence_index(a, length):
    i = a.__index__()
    if i < 0:
        i += length
    if not (0 <= i < length):
        raise IndexError(f"index {a!r} out of range")
    return i

def __getitem__(self, key):
    length = len(self)
    if not isinstance(key, slice):
        return self._get_sequence_item(_sequence_index(key, length))
    # Return same type as this instance for slices
    cls = type(self)
    return cls(self._get_sequence_item(i) for i in range(*key.indices(length)))

At the very least, we should be providing that template code in the docs somewhere (probably as part of the Sequence ABC docs, with a specific mention in the __getitem__ docs).

Similarly for __setitem__ and __delitem__, as getting those right can be even fiddlier:

def __setitem__(self, key, value):
    length = len(self)
    if not isinstance(key, slice):
        self._set_sequence_item(_sequence_index(key, length), value)
        return
    # This is based on the way that slice assignment to Python lists works
    # (but accepts all Sequence instances without copying, not just list and tuple)
    # It's complicated enough that it would need testing to show it is correct as written
    start, stop, step = key.indices(length) # Check for a valid slice before anything else
    value_needs_conversion = value is self or not isinstance(value, Sequence)
    value_as_seq = list(value) if value_needs_conversion else value
    slice_indices = range(start, stop, step)
    slice_length = len(slice_indices)
    value_length = len(value_as_seq)
    if step != 1:
        # Non-contiguous and reversed slices can't change the sequence length
        if slice_length != value_length:
            f"attempt to assign iterable of size {value_length} to extended slice of size {slice_length}"
            raise ValueError(msg)
        for i, item in zip(slice_indices, value_as_seq):
            self._set_sequence_item(i, item)
        return
    # Contiguous forward slices can change the sequence length by adding or removing items
    # Existing entries are replaced up to the common length, then removed if the existing
    # slice is longer than the assigned value, or added if the assigned value has more entries
    if value_length == 0:
        self.clear()
        return
    for i, item in zip(slice_indices, value_as_seq):
        self._set_sequence_item(i, item)
    if slice_length > value_length:
        del_start = start + value_length
        del_stop = start + slice_length
        for i in reversed(range(del_start, del_stop)):
            self._del_sequence_item(i)
    elif slice_length < value_length:
        for offset in range(slice_length, value_length):
            self.insert(start+offset, value_as_seq[offset])

def __delitem__(self, key):
    length = len(self)
    if not isinstance(key, slice):
        self._del_sequence_item(_sequence_index(key, length))
        return
    # Reverse iteration to avoid incorrect indexing as entries are removed
    for i in reversed(range(*key.indices(length)):
        self._del_sequence_item(i)
    return

If something were to be added to collections.abc, then my suggestion would be to leave Sequence and MutableSequence alone, and instead find another way to add implementation support. Many custom container implementations are thin wrappers around an existing container type, so delegating the index and slice interpretation functionality to the wrapped container may make more sense than reimplementing it in terms of acces to the individual entries, so I don’t think we want to lose the existing ABC warnings about the required abstract methods.

The first option that occurred to me is to add SequenceMixin and MutableSequenceMixin classes that provide __getitem__, __setitem__, and __delitem__ support along the above lines.

The other option is to instead provide standalone helper functions that accept an already bound method for the individual item access, along the following lines:

def sequence_getitem(seq, key, get_single_item=None):
    if get_single_item is None:
        get_single_item = seq._get_single_item
    length = len(seq)
    if not isinstance(key, slice):
        return get_single_item(_sequence_index(key, length))
    # Return same type as this instance for slices
    cls = type(self)
    return cls(get_single_item(i) for i in range(*key.indices(length)))

# Use like:
class MySequence(Sequence):

    def _get_sequence_item(self, key):
        ...

    def __getitem__(self, key):
        return sequence_getitem(self, key, self._get_sequence_item)

The main benefit of the latter approach is that it’s more flexible when it comes to integrating it into existing code, since it doesn’t require adding a new mixin class to the inheritance heirarchy.

storchaka · June 14, 2024, 9:17am

I think it is OK to add an optional second argument for operator.index(). There are precedences with iter(), next() and others. This will not make it significantly slower.

As for returning the same type, we have the problem that constructor is not the part of the Sequence interface. Set has the same problem, and solves it by introducing the _from_iterable() method, which is not officially the part of the Set interface, but must be overridden in every Set subclass with unusual constructor. So, in the end, we will need to add a new internal method designed to be overridden.

oscarbenjamin · June 14, 2024, 3:25pm

I don’t have a strong opinion about sequence_index vs a second argument to index. Either is good and will also be faster compared to a pure Python implementation of the indexing logic.

Having that as a separate function rather than doing something with the Sequence ABC is useful because it is also needed for things that are not sequences e.g. for a matrix/array type:

def __getitem__(self, keys):
    key1, key2 = keys
    m, n = self.shape
    i = index(key1, m)
    j = index(key2, m)
    return self.items[i*n + j]

A full implementation here gets quite complicated because you can have mixed slice/integer and also it is common to support sequences as indices (outer indexing).

It would also be nice if there was a better way to handle index detecting an invalid type: I would prefer it to return None rather than raise TypeError so it is like

    i = index(key1, m)
    j = index(key2, m)
    if i is not None and j is not None:
        return self.items[i*n + j]

rather than

    try:
        i = index(key1, m)
    except TypeError:
        i = None
    try:
        j = index(key2, m)
    except TypeError:
        j = None
    if i is not None and j is not None:
        return self.items[i*n + j]

Typical cases are that each key might be

int-like (has __index__)
slice
sequence of int-like

Checking all three cases is awkward if duck-typing is allowed for the sequence part and usually you want to handle int-like up front as the fast path. That means that you are going to call index on an invalid object (slice or sequence) and need to handle that case.

Dutcho · June 15, 2024, 5:37am

If you’d rather LBYL than EAFP, can’t you match on SupportsIndex?

Alex-Wasowicz · June 15, 2024, 5:28pm

if isinstance(key1, SupportsIndex) and isinstance(key2, SupportsIndex):
    return self.items[index(key1, m)*n + index(key2, m)]

Just write what U mean. This avoids confusion. I still don’t know why U wrote “match”, instead of “match”.

oscarbenjamin · June 15, 2024, 6:57pm

I’ve never used isinstance checks with typing protocols before so I’m not familiar with how exactly it works. My guess is that this is a slow way of checking hasattr(key1, '__index__') which is itself a slower way than just calling index(key1).

Timings are:

In [17]: f = 2

In [18]: %timeit index(f)
33.7 ns ± 1.09 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [19]: %timeit hasattr(f, "__index__")
251 ns ± 2.72 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [20]: %timeit isinstance(f, SupportsIndex)
1.16 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Presumably the match/case statement can also handle checking against SupportsIndex if you don’t mind the check being 30x slower than actually calling index.

Alex-Wasowicz · June 16, 2024, 9:39am

It was wired to write, but just I wanted to know how code suggested by Dutcho would look like.

And if U want to write match statement with only one case. So, maybe Dutcho wanted to say something different, but most likely they just didn’t think about it and just typed.

oscarbenjamin · June 16, 2024, 12:45pm

A full implementation would have more cases. Something like:

from operator import index as _index

def index(a, length=None):
    i = _index(a)
    if length is not None:
        if i < 0:
            i += length
        if not (0 <= i < length):
            raise IndexError
    return i

class matrix:
    def __init__(self, shape, items):
        self.shape = shape
        self.items = list(items)

    def __repr__(self):
        m, n = self.shape
        rows = [self.items[i*n:(i+1)*n] for i in range(m)]
        return '\n'.join(map(str, rows))

    def _getitem(self, i, j):
        m, n = self.shape
        return self.items[i*n + j]

    def _getslice(self, si, sj):
        m, n = self.shape
        items = self.items
        i_indices = range(*si.indices(m))
        new_items = []
        for i in i_indices:
            new_items.extend(items[i*n:(i+1)*n][sj])
        new_m = len(i_indices)
        new_n = len(new_items) // new_m
        return matrix((new_m, new_n), new_items)

    def _getseq(self, indices_i, indices_j):
        items = self.items
        new_m = len(indices_i)
        new_n = len(indices_j)
        new_items = [items[i*n+j] for i in indices_i for j in indices_j]
        return matrix((new_m, new_n), new_items)

    def __getitem__(self, keys):
        key_i, key_j = keys
        m, n = self.shape

        try:
            i = index(key_i, m)
        except TypeError:
            i = None
        try:
            j = index(key_j, n)
        except TypeError:
            j = None

        if i is not None and j is not None:
            return self._getitem(i, j)
        elif isinstance(key_i, slice) and isinstance(key_j, slice):
            return self._getslice(key_i, key_j)

        if i is not None:
            indices_i = [i]
        elif isinstance(key_i, slice):
            indices_i = range(*key_i.indices(m))
        else:
            indices_i = [index(i, m) for i in key_i]

        if j is not None:
            indices_j = [j]
        elif isinstance(key_j, slice):
            indices_j = range(*key_j.indices(n))
        else:
            indices_j = [index(j, n) for j in key_j]

        return self._getseq(indices_i, indices_j)

I don’t know whether match/case would improve any of this (I’ve never actually used match/case).