Deprecate "old-style iteration protocol"?

I don’t think it is really unexpected that

for x in foo:
  do_something(x)

is similar to

i = 0
while True:
  x = foo[i]
  do_something(x)
  i+=1

with the extra details about IndexError stopping the iteration instead of being raised.

In fact, many newcomers write the while loop equivalent (for i in range(len(foo))) and use indexing when they come from other C like languages, and they learn later that python has a nicer way of doing it.

The stuff about __iter__ is an extra layer on top of that to customize it when you can’t / don’t want to use indexing for iteration.

They’re not broken, so don’t need fixing.

I don’t understand the desire many people have to break other people’s code and make more work for other people, especially when the feature they want to remove doesn’t affect them personally.

Its not whether the fix is four lines or four hundred lines, but that code that works in Python 3.0 through 3.11 suddenly breaks when you try to run it in 3.whatever, and the person trying to run the code has to work out why, then fix it. If they are even capable of it (maybe they are using a sourceless, byte-code only app, or maybe they’re an end-user with no programming skill).

Legacy code that works is not broken, and we should only break it if we have a really good reason.

The time to have removed this, if it needed removal, was in 3.0, when we removed or changed a bunch of other things for asthetic reasons (e.g. old style classes). We didn’t remove it then. That should tell us something.

1 Like

I have never, not once, seen a beginner ask a question about iteration in Python that was confused about the existence of the old sequence protocol, and I have spent a lot of time helping beginners on various forums.

Or if I have, it was so long ago, and so minor, that I have completely forgotten it.

But I have seen a lot of people, beginners and experienced coders alike, including some true Pythonista gurus, get confused about the iterator protocol and what it takes for an object to be an iterator (as opposed to what it takes for an object to be iterable).

Even without the sequence protocol, the iterator protocol is complex:

  • Objects with __iter__ and __next__ methods are iterators.
  • The __iter__ method must return self.
  • Objects with only an __iter__ method which doesn’t return self are very common, but they aren’t iterators and don’t seem to have a name apart from “iterable”.
  • But “iterable” also includes iterators.
  • If the __next__ method raises StopIteration, it must forever afterwards raise StopIteration. Otherwise it is officially broken.
  • People think that range() objects are iterators, they are not.

Compared to that, the sequence protocol is simple and straightforward! wink

1 Like

Lemme clarify a bit.

  • Objects with an __iter__ method are iterable. This method should return an iterator.
  • Objects with __iter__ and __next__ methods, where __iter__ returns self, are iterators.
  • If the __next__ method raises StopIteration, it must forever afterwards raise StopIteration. Otherwise it is officially broken. But broken iterators do happen.

(And your comment about range objects is part of that distinciton: a range object is iterable, but it is not an iterator.)

2 Likes

It’s not really “broken” if an iterator provides a way to rewind, advance, or otherwise modify the iteration. It just has to be used responsibly. For example, an io file object is an iterator of the lines in a file that supports the ability to seek() to the beginning or end, or to a byte offset (or opaque tell() value if it’s text I/O). For example:

>>> f = io.StringIO('1\n2\n3\n')
>>> next(f)
'1\n'
>>> offset = f.tell()
>>> list(f)
['2\n', '3\n']
>>> f.seek(offset)
2
>>> list(f)
['2\n', '3\n']

>>> f.seek(0)
0
>>> next(f)
'1\n'
>>> f.seek(0, os.SEEK_END)
6
>>> list(f)
[]
2 Likes

According to the documentation, it’s still broken. Broken things can still be useful, but you can expect bizarre behaviour from them around their brokenness.

https://docs.python.org/3/glossary.html#term-iterator

I know of the “deemed broken” wording, but I don’t like that phrasing. I think an iterator is only strictly broken when __next__() doesn’t continue to raise StopIteration if nothing else has intentionally modified the iteration state. I’d have no misgivings if the docs stated that such cases are “undefined behavior” in the iteration protocol. For example, a dependent iterator probably won’t or can’t reset its state appropriately for a source iterator that has been resurrected like this. The contract is that once an iterator raises StopIteration, its consumer(s) can throw it away as exhausted. It’s a simple use once and discard mentality. Anything more complex requires coupling between the producer and consumer.

3 Likes

Well, okay. Change the wording from “broken” to “undefined behaviour”. Actually, that would be quite entertaining - it’ll set Steve D’Aprano off on one of his rants.

But either way, an external user of an iterator can’t know whether anything has modified the iteration state, so the idea that an iterator can be exhausted and then have more data is independent of any call to seek() etc. As I understand it, file objects have been broken in this way basically forever, and it hasn’t stopped them from being useful; but people shouldn’t be surprised if code like this fails:

def mutate(it):
    for thing in it:
        yield thing.upper()

with open("somefile") as f:
    lines = mutate(f)
    for line in lines: print(line)
    f.seek(0)
    for line in lines: print(line)

If you know how the file object works, you can see a potential fix: just reinitialize the mutator each time. But the mutator isn’t required to cope with broken iterators, and I don’t think that it’s a problem to call the file object broken in this way.

Why not simply say that if an iterator returns StopIteration, clients are allowed to assume that the iterator will always return StopIteration on future calls to __next__()? That captures the key point here, while allowing other behaviour (iterators can offer a “reset” mechanism, and callers don’t have to make the assumption if they know better).

That’s pretty much what the docs say already:

Once an iterator’s __next__() method raisesStopIteration, it must continue to do so on subsequent calls. Implementations that do not obey this property are deemed broken.

1 Like

My point is that the docs say that the iterator is “broken” if it violates that assumption. My version avoids making that judgement, and simply notes that clients can assume what happens next without checking. It’s of little consequence in terms of how people write code, but it might stop some of the arguments about whether something is a “proper” iterator in cases where it makes no practical difference.

But I’m not about to make a PR for the docs, so I don’t actually care that much.

If it walks like a duck, and quacks like a duck, it’s probably a duck. Even if it occasionally honks when you’re not looking :slightly_smiling_face:

2 Likes

A duck that honks is a Citroën 2CV :wink:

1 Like

Clients don’t need permission to make other assumptions about iterators after they raised StopIteration. If you want to iterate over iterators like this, there are no Python Police to stop you (although your peers may laugh at you behind your back):

it = iter(some_iterable)
for i in range(100):
    for obj in it:
        process(obj)

For most iterators, the last 99 attempts to iterate over it will be empty loops, but you never know when an exhausted iterator will suddenly recover and stop being exhausted. Right?

The risk is actually the other way. Here is a legitimate idiom that will fail if the iterator suddenly unexhausts itself:

words = iter(words)
# Process words before "STOP" in one way, and words afterwards
# in another way.
for word in words:
    if word == "STOP":
        break
    process_before_stop(word)

do_some_more_stuff()
# Now process words after "STOP"
for word in words:
    process_after_stop(words)

We should be able to assume that if the first loop exhausts the iterator (i.e. that the sentinel “STOP” either doesn’t exist, or is the very last word), that the iterator will remain exhausted forever, and that the second loop will do nothing.

If iterators can be reset, then we don’t know if do_some_more_stuff() may have reset the iterator and broken our expectations about the iterator being exhausted.

And that is why, technically, file iterators are broken.

But then file I/O is a very grubby case. Errors can be transient; files can be modified by other processes even in the middle of a read. Reading a file is not idepotent: there is no guarantee that two reads of the same file from the same position will give the same data, even if you are reading from read-only media. Computing would be so much cleaner and simpler if there was no I/O :slight_smile:

Describing an iterator as “broken” is a provocative thing to say. But this is Python, and if you want to shoot yourself in the foot, you can. Broken things can be useful. If you want to give your iterators a reset mechanism, you can, but then don’t be surprised if that breaks people’s expectations about iteration.

1 Like

My proposal is to deprecate this fallback mechanism, emit a DeprecationWarning at first and eventually raise a TypeError unless __iter__ is explicitly implemented.

I really like this proposal (and judging by the hearts, others do too), but I find your example unmotivating.

I think your point about collections.abc.Iterable not considering the sequence protocol is motivating. This also means that getting type annotations to work with a class that implements is awkward.

So, the tradeoff to deprecation would be breaking some code unnecessarily. At what point does purity and simplicity (one way to implement an iterable class) win over breaking code? Maybe never, unfortunately. Or maybe it would be enough to have an exceptionally long deprecation period?

3 Likes

In Python 3.0.

“Simplicity”, I get. But please justify the “purity” part – what is special about iteration that it is more “pure” for there to be only one implementation of it? I mean, there isn’t even “only one” way to iterate over a data structure.

Does this sense of “purity” mean that you will also deprecate non-iterator iterables, like lists, sets, dicts, dict views etc? Or only keep them, and deprecate iterator iterables, like generators?

“Don’t be stupid Steve, you know that’s not what I mean!” – no I don’t actually, which is why I ask. There are at least two ways to make something iterable using the iterator protocol, so if we insist on “one way” we have to get rid of one:

  1. Iterables with __iter__ returning self and __next__ being a stateful function;
  2. Iterables with no __next__ at all, and __iter__ that returns an object that implements 1 above.

A long depreciation period has costs of its own.

My use of the word pure is based on my interpretation of the sequence protocol as a practical way to implement sequences without having to bother to implement __iter__. This is in contrast to what I imagine to be the pure way of doing that: to inherit from collections.abc.Sequence and to implement the abstract methods.

My guess is that the major motivation for the sequence protocol was its practicality in defining sequence classes without inheriting from Sequence or implementing __iter__. However, this works poorly with static type checkers and linters, which can’t tell that such a class is a sequence or even iterable.

It also works poorly with code that dynamically checks types. A lot of modern Python code uses LBYL because LBYL plays better with static type checking. And very modern code uses structural pattern matching, which is LBYL by nature. Trying to match on collections.abc.Iterable fails for classes that don’t implement __iter__. This was fine when EAFP was more common.

I understand the point about inducing a maintenance burden on users, however, not deprecating the sequence protocol induces a development burden on type checkers and linters (if they want to support such sequence types).

Does this sense of “purity” mean that you will also deprecate non-iterator iterables, like lists, sets, dicts, dict views etc? Or only keep them, and deprecate iterator iterables, like generators?

Sorry, I think I was unintentionally unclear. What I meant to argue for is to deprecate the sequence protocol only.

There are at least two ways to make something iterable using the iterator protocol, so if we insist on “one way” we have to get rid of one:

Sorry for the confusion, but I’m not suggesting deprecating iterables or iterators (which as you know are different concepts). Just the sequence protocol (which I would describe as a way of implementing the iterability of sequences).

What do you think?

1 Like

There could be a mixin in collections.abc like this:

class ContainerAsIterator(metaclass=ABCMeta):
    @abstractmethod
    def __getitem__(self, index):
        raise IndexError

    def __iter__(self):
        from itertools import count
        try:
            for i in count():
                yield self[i]
        except IndexError:
            return

So that you only need to inherit from that to get the old behavior.

Backward compatibility is not achieved by “hey, we’ve made a breaking change, but here’s something you can do that won’t work on older versions of Python that will reinstate the old way of doing things”.

2 Likes

See Many functions that accept Iterable should actually accept Iterable | SupportsGetItem · Issue #7813 · python/typeshed · GitHub for some previous discussion of this topic in the context of typing.

4 Likes

You always can use Iterable.register().