Deprecate "old-style iteration protocol"?

Currently, custom classes that implement __getitem__, but don’t implement __iter__ get a “default” __iter__ implementation. The official documentation seems to call it “old sequence iteration protocol” or “old-style iteration protocol”.

In practice, this “default” implementation assumes that the object is list-like and tries to iterate over it’s items (o[0], o[1], o[2], ... until an IndexError is raised). Unfortunately, this behaviour has some surprising side effects, when the object was dict-like instead of list-like.

Consider the following examples:

class identity_dict:
    def __getitem__(self, key):
        print(key)
        return key

d = identity_dict()
yes = d["yes"] # works as expected

# Each of the following lines runs forever, printing numbers from 0 to infinity
v = list(d)
"yes" in d
class faulty_delegate_dict:
    __getitem__ = {0: "zero", 1: "one", "foo": "bar"}.__getitem__

f = faulty_delegate_dict()
zero = f[0] # works as expected

# Each of the following lines raises an exception
# with a completely useless taceback:
#     Traceback (most recent call last):
#       File "foo.py", line X, in <module>
#         list(f)
#     KeyError: 2
v = list(f)
"bar" in f

# But this returns True
"one" in f

# And this prints "zero" and "one" before raising the exception
for x in f:
    print(x)

IMHO, trying to iterate over an object that doesn’t implement __iter__, but does implement __getitem__ should not work since the ability to get an item is not sufficient to iterate over items. Instead, it should raise a TypeError just like in the base case:

class not_iterable:
    pass

n = not_iterable()

"what" in n # TypeError: argument of type 'not_iterable' is not iterable
list(n) # TypeError: 'not_iterable' object is not iterable

My proposal is to deprecate this fallback mechanism, emit a DeprecationWarning at first and eventually raise a TypeError unless __iter__ is explicitly implemented.

P.S. Interestingly collections.abc.Iterable currently doesn’t consider this “old-style” iterables to be valid Iterables, so deprecating this fallback will also make the typing a bit more clear.

6 Likes

Let’s not break code that relies on the sequence protocol just to cover up a side-effect of writing an unusual, and potentially dangerous, class.

The problem here is that your “identity dict” behaves as an infinite lazy sequence. There is nothing wrong with infinite lazy sequences, but you can’t call list() on them.

from itertools import count  # Another infinite lazy sequence.

c = count()
list(c)  # Iterates forever.

Be warned that iterating over count() seems to be uninterruptable, at least when I tried it, and so you may need to kill the terminal it is running in.

A better idea, in my opinion, would be a protocol that iterators and lazy sequences can signal to consumers that they are infinite. We already have a __len_hint__, perhaps we could use that. That would still allow iteration in for-loops, but prevent “all steam ahead” iteration to exhaustion into a list, tuple, set etc.

The solution for identity_dict is to give it an __iter__ method. You can’t meaningfully iterate over the keys, so let’s prevent it:

class identity_dict:
    def __getitem__(self, key):
        return key

    __iter__ = None  # Prevent iteration.

The bottom line here is that there is nothing wrong with iteration using the sequence protocol, nothing wrong with looping over a lazy infinite sequence or iterator, but there is a problem when you try to instantiate such an infinite sequence of values into a list, set or other finite collection. We should determine a way that such finite collections can identify such infinite iterators up front.

2 Likes

I understand, why this is happening, but I still think, that this behaviour violates the principle of least astonishment. Besides backwards compatibility/legacy reasons, I see no justification that types implementing __getitem__ but not __iter__ are assumed to be “infinite lazy sequences” by default.

I am not suggesting, that lazy infinite sequences are bad, just that they shouldn’t be the implicit default.

To be clear, I don’t have a problem that list(itertools.count()) doesn’t terminate. My problem is that iter(object_that_only_implements_getitem) works at all. The faulty_delegate_dict example shows the broken behaviour a bit more clearly, IMHO. ("one" in f returns True)

__iter__ is a method that corresponds to iter(obj) in the python data model.
__getitem__ should be a method that corresponds to obj[key] in the python data model.

In practice, __getitem__ also “automagically” creates a default __iter__ implementation, that assumes that this class is a sequence with integer indices (and not a collection with arbitrary keys or something else entirely). This is wrong, IMHO.

I understand, that raising TypeError on iter(object_that_only_implements_getitem) would be a breaking change, however it’s not like python never has any breaking changes. I am proposing adding a DeprecationWarning right now and changing it to a TypeError “in a later version™”.

Also, I am not sure, how often is the current implicit behaviour even used (ie how often do people implement __getitem__ without implementing __iter__ and then actively rely on it iterating over the integers, compared to how often people just implement __getitem__ and forget about __iter__ because they only need the obj[key] behaviour).

I would argue, the current behaviour contradicts 4 out of 19 lines (21%) from the Zen of Python. :laughing:

Explicit is better than implicit.
Special cases aren't special enough to break the rules.
Errors should never pass silently.
In the face of ambiguity, refuse the temptation to guess.
3 Likes

Note “types implementing __getitem__ but not __iter__ are” NOT “assumed to be “infinite lazy sequences” by default.”

class FiniteGetItemExample:
    def __getitem__(self, key):
        if key in range(10):
            return key
        else:
            raise IndexError

list(FiniteGetItemExample())
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

About __getitem__ / __iter__ connection: many people coming from other languages (Java and its kin) code iteration unpythonically as a loop over getting items. The default implementation eases that path (no idea if that’s the reason it was included in paleo-Python).

Infinite lazy sequences are not the default. You have to give your class an __getitem__ method, and that method has to violate the expected behaviour of a finite sequence or mapping.

Your class does that, and the consequence is that it behaves as an infinite lazy sequence. If you don’t want it to behave as an infinite lazy sequence, then don’t program it to be an infinite lazy sequence.

The Principle Of Least Astonishment correctly applies to user interfaces, not APIs, and refers to what ordinary users expect, not skilled programmers. A widget that looks like a push button should behave as a push button. A hyperlink stating that it takes you to a glossary of terms should actually go to a glossary of terms.

We can extend the principle to APIs, but then the problem becomes that we no longer have anything even remotely close to “an ordinary user”. Programmers differ far more in what they consider “astonishing”, and that astonishment changes greatly as they become more experienced.

You might be surprised that your “identity_map” class behaves as an infinite lazy sequence, but I wasn’t, I recognised it immediately I saw the __getitem__ method, and to me there is no surprise that it iterates forever.

To me, it would be astonishing if it didn’t iterate forever.

So the POLA is a very weak argument when it comes to software APIs. Surprising to whom?

Don’t just dismiss backwards compatibility like that. The sequence protocol goes back to Python 1.x days, it is much, much older than the iteration protocol, and the Python language takes backwards compatibility very, very seriously.

Given the choice between

  1. breaking some unknown number of third-party scripts and applications,

  2. and requiring people who don’t want their subscriptable classes to be iterable to add one extra line of code to their class

we’re going to choose 2 unless there is some really, really, big and important reason to break people’s code.

“Some people, who haven’t learned about the sequence protocol, might be surprised; other people just don’t like it” is not a big important reason.

You should. That means that any time you call list(obj) on some unknown iterator, you have no idea whether it is going to terminate or loop forever.

And depending on your OS and the version of Python, that could simply raise MemoryError after some indefinite time (which your program probably doesn’t handle), or in the worst case it could lock up your computer to the point it needs a hard reset, or cause the OOM Killer to start randomly killing processes.

I’ve had both happen to me, although fortunately not on production servers!

Given how easy it is to write iterators that run forever, and how useful it is, I’m actually surprised that there aren’t more problems in practice with list not terminating.

Correct. It is a method, not the method. Iteration in the Python data module corresponds to a pair of protocols, the iteration protocol and the sequence protocol.

No, that’s not what happens. Your “identity_dict” class has no __iter__ method.

What actually happens is that iter() builtin accepts objects that follow the sequence protocol as well as the iteration protocol.

A side-effect of this is that any code that tests whether an object is iterable by looking only for __iter__ is wrong.

Perhaps it was a mistake to unify sequence indexing and mapping subscripting with a single dunder method, but given that they both use the same syntax obj[x] it is hard to see how they could use different dunders. Because Python is dynamically typed, the compiler cannot tell whether obj[x] is a sequence that expects an index or a mapping that expects a key. So that’s a language limitation we have to live with.

(Although for code written using the C stable ABI, there actually is such a method to distinguish the two.)

Which doesn’t solve your problem right now. Right now, you still need to prevent your identity_dict object from being iterated. You can’t afford to wait until Python 3.15 or 3.16 or even some far distant 4.0 version in another 30 years.

So depreciation doesn’t help you, it just inconveniences those whose working code will start raising annoying warnings and then eventually stop working.

Depreciating working language features is not a step we take lightly. Obviously there have been changes to the language, but they are mostly additions, and relatively few subtractions. Some code, written for Python 1.4 or 1.5 and perhaps even older, is still capable of running under Python 3.10 or 3.11.

2 Likes

If you want something to be dict-like and NOT iterable, you can simply block iteration:

>>> class Identidict:
...     def __getitem__(self, key):
...             print(key)
...             return key
...     __iter__ = None
... 
>>> d = Identidict()
>>> next(iter(d))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'Identidict' object is not iterable

I think, that it was pretty clear, what I meant. Types that implement __getitem__ are assumed to be “infinite lazy sequences” by default. If you want a finite sequence, you must raise IndexError. If you want a non-lazy sequence, you must define a __len__. If you want the collection to be a non-sequence, you have to implement __iter__ or set it to None.

An object that provides access to its elements via a key or index doesn’t need to be a Sequence, so the fact that python implicitly assumes so is wrong.

I am sorry, but this is just wrong. Quoting directly from the wikipedia article you linked:

In more abstract settings like an API, the expectation that function or method names intuitively match their behavior is another example.

Programmers are users. They are users of the API. “Skilled programmers” are just as human as “normal” users. If we follow your logic, we wouldn’t need error checking and intuitive APIs. “Skilled programmers” would just never make mistakes and would be able to instantly debug any problem in their code (even without useful error messages).

Yes, POLA is a subjective metric. That doesn’t make it a useless metric. This behaviour is a “gotcha”. Just because you are aware of this gotcha doesn’t make it any less surprising to people who haven’t stepped on this proverbial rake before.

Again, quoting directly from the wikipedia article you linked:

in particular a programmer should try to think of the behavior that will least surprise someone who uses the program, rather than that behavior that is natural from knowing the inner workings of the program.

The reason, why this behaviour is not surprising to you is because you are familiar with it, because you know “the inner workings” of this API. Without this knowledge, this behaviour ceases to be “obvious”.

The way __getitem__ automatically implies a “default” sequence-like __iter__ implementation does not match the rest of the python data model.

  1. You don’t magically get a __mul__ implementation for integers after defining __add__. You would be surprised if x*3 would automatically evaluate x+x+x if x.__add__ existed, but not x.__mul__ right?
  2. This behaviour is surprising, because it doesn’t intuitively follow from “defining a method that gets an item”.

Again, I think that it’s pretty clear what I meant here. I meant, that itertools.count() terminating or not is irrelevant to the problem described in my original post. Let’s not get sidetracked, this thread is about the fact that implementing only __getitem__ implicitly makes your collection a Sequence.

Yes, but this is a distinction without a difference as far as I am concerned. Again, I think, that my point is perfectly understandable here:

  1. The programmer didn’t create an __iter__ method
  2. Python saw that there is a __getitem__ method and assumed that the collection is a Sequence
  3. Code calling iter or iterating over the collection or checking for element membership behaves as if a “default” __iter__ implementation was present, due to faulty assumption (2)

Yes, hasattr(foo, "__iter__") is False, but at this point you seem to be nitpicking the details. If it acts like a duck, walks like a duck and quacks like a duck… These types act like there is a default __iter__ implementation.

As I mentioned in my original post, collections.abc.Iterable from the standard library is also “wrong” in this way (yes, it is a documented behaviour, but an inconsistent documented behaviour).

My problem right now, is that (non-sequence) collection types that define __getitem__ and not __iter__ behave “surprisingly”. Often, this surprising behaviour results in infinitely hanging code, exceptions in unrelated places and sometimes even silently incorrect code. DeprecationWarnings absolutely would make the situation better.

And some code written in Python 3.9 is not capable of running under Python 3.10. Also, Python 2 was EOL two years ago. People, who still run code that was written for Python 1 and never updated, are unlikely to be using Python 3.10 right now anyway.

I am not saying that we should “move fast and break things”, but let’s not pretend that any change that breaks backwards compatibility should be instantly rejected just because some Python 1 code might no longer work.

2 Likes

Two different people answered you the same way, so if you were misunderstood, what you meant wasn’t “pretty clear”.

But I don’t think you were misunderstood. I think you were, and still are, just incorrect in your claim that classes (with or without __getitem__ are “assumed to be infinite lazy sequences by default”.

The interpreter makes no assumption about your class being lazy or infinite. The fact that your class behaves as a lazy infinite sequence is because you programmed it to behave as a lazy infinite sequence (as well as an infinite mapping).

If the user of your identity class were to write this:


obj = identity_dict()  # Your class.

key = random.random()

while True:

    try:

        print(obj[key])

    except LookupError:

        break

    key = random.random()

it would still loop forever and never terminate. And the nature of the identity dict is such that there is nothing you can do to prevent this infinite loop except tell your users “Don’t do that!” and make them aware that it is a lazy, infinite mapping.

At least with iteration (for-loops or list) you can disable that completely by setting the iter dunder to None.

It isn’t lazy and infinite “by default”, it is lazy and infinite because you programmed it to be that way.

Well yeah.

How else are you going to signal that the sequence (or mapping) doesn’t have an index/key if you don’t raise an exception?

For real collections (sequences or mappings) you have some actual data structure with a finite size, so this issue doesn’t come up. When you run out of data, you get a LookupError (IndexError or KeyError).

Your mapping doesn’t have actual data. It lazily simulates fake data, and does so without terminating. So what did you expect to happen if the caller repeatedly looks up indexes/keys over and over again?

If every key always succeeds then that implies that it is infinite and lazy.

If it was not your intention to write an infinite, lazy collection then it is your code that is buggy, not the language. If it was your intention, then congratulations, you succeeded, and the Python interpreter did exactly what you told it to do, which was to loop forever.

Right. That’s because iteration in Python can use two different protocols, with iter taking precedence.

Your class is not considered a Sequence. isinstance(identity_dict(), collections.abc.Sequence) returns False.

But your class behaves sufficiently like a sequence in this regard, because you programmed it to behave like a sequence.

In other words, your class might not swim or fly like a sequence, but it quacks like a sequence. For a task like iteration that only requires quacking, your class might as well be a sequence.

This sort of duck-typing is built deep in the Python execution model. If you don’t want it, I’m afraid you are using the wrong language :frowning:

Dunders are used by the interpreter to implement certain behaviours. If you write the dunder, you are responsible for that behaviour. The interpreter isn’t assuming anything – you have explicitly written the dunder to cause your class to behave in the way that it then behaves.

As far as your issues with the POLA, we’re going to simply have to disagree on this one.

You wrote a class with a dunder used to define sequence behaviour, and then were surprised that your class behaved like a sequence. I was not.

Right. And that is why arguments from the POLA are very weak when it comes to APIs, beyond such obvious and trivial examples that (let’s say) a function called “print” should print.

APIs consist of much more than just function names. Knowledge of protocols can be surprising if you don’t know the protocol, but that is not a violation of POLA. That’s just lack of knowledge.

This is just wrong. __getitem__ does not imply a default __iter__ implementation.

This is the second time you have made that wrong claim about a default __iter__, please stop repeating that misinformation. Your class has no __iter__ method, the interpreter does not add one, and iteration using __iter__ is only one of two ways that iteration is defined in Python.

Just as the str() builtin falls back on __repr__ when __str__ doesn’t exist, and the not-equal operator falls back on __eq__, and most operators have a reversed __rop__ method. You cannot assume that operations in Python only use a single dunder.

No, but then I am something of a mathematician of sorts, so it wouldn’t surprise me for multiplication to fall back on addition. That is the most natural thing in the world.

I beg to differ. It is not surprising, because it does intuitively follow from getting an item.

The most simple, natural, intuitive form of iteration is:

  • get the first item (index 0 in Python);

  • get the second item;

  • get the third item;

  • get the fourth item;

etc, halting when there are no more items.

It is a critical distinction with obvious consequences, not the least of which is that testing for iteration by looking only for a iter dunder is not sufficient.

Iteration is not controlled only by the presence of a iter dunder. If you thought it was, you were wrong. It is as simple as that.

You lacked knowledge about the design of Python. Your lack of knowledge doesn’t mean that the interpreter is wrong, or that we should break working code to bring Python back into line with your incorrect assumption about iteration. It just means you lacked knowledge.

Now you know better. Congratulations. You are a more knowledgeable Python programmer today than you were two days ago.

In a practical sense, it is highly unlikely that depreciation would make it into 3.11, so you shouldn’t expect it before 3.12. At which point it will likely be silent depreciation. Unless you run Python with all silent warnings enabled, which hardly anyone does, you probably wouldn’t see the warning until 3.13 or 3.14.

Point of note: collections.Counter() behaves as if every possible element is in it, yet it isn’t infinitely iterable. This isn’t a problem, since it has a different definition of iterability, but it goes to show that synthesizing results in response to __getitem__ doesn’t mean they have to be iterated over.

Steve, I know how you adore a good argument, but just because you can catch the OP on a few technicalities (like talking about an implicit __iter__ function, when really what’s under discussion is iterability in general - yes, congratulations, you found a technical error), don’t assume that the argument’s merits do not exist. It IS surprising that some forms of __getitem__ will make an object iterable and others will not. For instance, this one will not:

>>> class X:
...     def __getitem__(self, item):
...             if isinstance(item, float): return item
...             raise KeyError
... 

But this one will:

>>> class X:
...     def __getitem__(self, item):
...             if isinstance(item, int): return item
...             raise KeyError
... 

Yes, it’s documented. It doesn’t mean it won’t be surprising.

(And yes. Programmers most certainly ARE users, and the Principle of Least Astonishment absolutely DOES apply. I have had the unpleasant experience of working with a number of highly surprising APIs, and it is not something to wish on one’s worst enemy.)

(Unless your worst enemy is a self-aggrandized Wordpress “expert” who charges exorbitant rates for minimal work, in which case (a) they deserve everything that PHP can throw at them, and (b) you deserve a better enemy.)

10 Likes

Correct, this class is both lazy and infinite. However, this class is not a sequence. It is an infinite, lazy collection. You can not derive the ability to iterate over a collection from only the ability to get a value from a key/index.

The __getitem__ method intuitively seems like a method, that ought to only provide the ability to get a value from a key/index, but in practice it also provides an ability to iterate over the collection by assuming that it is a sequence (unless you explicitly opt-out via setting __iter__ manually).

You can tell, that it assumes that it’s a sequence, because it iterates only over non-negative integers and checks for IndexError (instead of iterating over every possible value (which is impossible of course) and checking for LookupError).

Right. And I am arguing that one of these two protocols ought to be deprecated. That would make the concept of “iterable” correspond 1-to-1 with the __iter__ method.

It behaves like a sequence despite not being a Sequence, because that’s what the current language semantics entail (aka the current data model semantics assume that this class is a (lowercase s) sequence). And I am arguing that these semantics are bad and unintuitive.

Iteration is not controlled only by the presence of a iter dunder. I knew that it wasn’t. I thought that it shouldn’t. “This is how it currently is” is not a good argument against changing things.

I find your approach to this discussion to be needlessly condescending. My argument is “this is how it currently is and I think, that it is unintuitive and should be changed”, not “I don’t understand what is happening here, please help”.

See prior discussions regarding and related to this subject:

2 Likes

OK. This is a plausible suggestion. I suggest that we focus on this as the rest of the discussion is getting heated and frankly neither productive nor particularly interesting.

The fallback to __getitem__ with an index for iteration was from the days before the iterator protocol, and was included when that protocol was added for backward compatibility. However, it was not, and never has been, deprecated, so it is still a valid way of creating a class that you can iterate over (I’m not using the word “iterable” here because I don’t want to get sucked into the nitpicking debate).

It could be deprecated, but the benefit is small (it’s not that confusing, and there arent that many cases where it causes an issue, despite your comments - all your examples are based round a single class that you wrote, you haven’t provided any evidence that this is a widespread problem). And we have little or no evidence that the downside would be similarly small - no-one has surveyed how many classes still rely on this fallback, or how widely used they are.

So if you really want to push for deprecation, I think you need to focus on those practical points - cost and benefit - rather than debating theory and opinion, which as far as I can tell is what this discussion is tending towards.

But frankly, I think it’s a waste of time. You can fix your class with a one-line addition (__iter__ = None). If others have this issue, they can do so too. The time spent already on this discussion is far greater than the time it would take to simply fix the problem in your class. If it bothers you that much, add a comment to the class explaining how you wish you didn’t have to do this, but you must in order to work with (at least) Python 3.11 and earlier, and there’s no sign yet that Python is likely to change…

6 Likes

Fair point. Do we have a representative collection of python code that could be scraped for statistics like this? I could probably hack something together with BigQuery or the like, but I would prefer to avoid that, if we already have some tooling for this. I think I remember seeing somebody gathering similar stats on the bug tracker/discuss previously.

I actually already did that. :laughing:

1 Like

Please remember that not all Python code is open source or publicly available.

The best we can do is find a lower bound on classes that use the sequence protocol.

It may or may not be representative of proprietary code, but there’s grep.app to search GitHub, which can be limited to Python files—of course, with any of this, you have to be able to form a query that will capture what you are looking for, which may not be entirely obvious here.

If it’s any solace, type checkers don’t recognise old style iterables as compatible with iterable. And in all the years of mypy, this missing support has only come up a couple time, especially so once people mostly stopped using Python 2.

I’m aware of one popular library (torch.utils.data.Dataset) that relies on old style iteration for providing iterability. I tried to get them to use __iter__, but they claimed some users had use cases where it wasn’t actually an iterable. Of course, most code assumes that it is iterable (including plenty of code in torch), so I wasn’t sympathetic to that concern. But I didn’t feel like arguing the point :slight_smile:

2 Likes

Put me on team “deprecate old-style iteration protocol”, especially now that Python 2 is fully retired. We would not have added this today if we started with __iter__.

For those projects that depend on the old-style iterator, is copy-pasting this enough to fix them? Edit: Nope, see below

    def __iter__(self):
        from itertools import count
        for i in count():
            yield self[i]

You would also have to put the loop in a try: except: block and turn IndexError into StopIteration I think, so it has to be a bit longer.

If the current behavior stays then opting out of it is much easier (__iter__ = None as explained above). So current behavior is more convenient to the users too.

I think the boilerplate is this because generators are supposed to return instead of raise StopIteration since Python 3.7.

    def __iter__(self):
        from itertools import count
        try:
            for i in count():
                yield self[i]
        except IndexError:
            return

I rather see the current implementation of the iteration protocol as unnecessarily complex, unexpected and confusing. I think that deprecation and later removal of the old protocol would contribute to making Python more accessible.

2 Likes