Probably not a PEP: mutating a sequence during iteration is fuzzy

The docs no longer tell the truth about what happens when a mutable sequence changes length during iteration, and what actually happens varies now across the specific mutable sequence type,.

This acts closest to what the docs probably mean by

There’s no guessing what “for the most part” might mean, but bytearray probably illustrates the intent:

>>> xs = bytearray([1, 2])
>>> xs.extend(xs)
>>> xs
bytearray(b'\x01\x02\x01\x02')
>>> xs = bytearray([1, 2])
>>> xs.extend(iter(xs))
>>> xs
bytearray(b'\x01\x02\x01\x02')

So in both cases the length of the bytearray is doubled, It’s very different if you use a list (or array.array) instead Then extending a list with itself again doubles length, but extending with iter(xs) goes on and on until you run out of RAM.

That’s surprising by itself, and then it’s doubly surprising that bytearray acts differently than the original mutable sequence types (list and array.array).

The list behavior was certainly intended (I was there at the time :wink:), but the brief text about it (in the Reference manual’s explanation of the “for” statement) appears to have been purged when the iterator protocol was introduced. But was replaced with fuzzy hand-waving.

If it seems surprising that iteration could rely on length-changing mutations during iteration, this is a quite common way to do a breadth-first search:

for node in frontier:
    for s in successors_of(node):
         if is_interesting)(s):
             frontrrr.append(s)

It inherently relies on that iteration will reliably discover elements appended during iteration. Nothing wrong with it. And this spelling works regardless of whether list or bytearray is the type of “frontier”. That bytearray captures the length just once at the start is (seemingly) unique to bytearray.extend().

For a concrete real example where it matters, this is an elegant way to construct all the divisors of an int n, given its prime factorization as a dict mapping primes to their exponents:

n = \prod p_i^{e_i}
def divisors(pe):
    import math
    from itertools import islice
    expect = math.prod(e + 1 for e in pe.values())
    ds = [1]
    for p, e in pe.items():
        ds.extend(d * p for d in islice(ds, e * len(ds)))
    assert len(ds) == expect
    return ds

Then, e.g.,

>>>> divisors({2: 2, 5: 2}) # divisors of 100
[1, 2, 4, 5, 10, 20, 25, 50, 100]

This is worth studying: It’s a remarkably easy way to construct the result, with no shenanigans involving exponentiation and/or itertools.product. Just one multiply per final divisor.

Initialize to a bytearray([1]) instead, and it dies with an assertion error. Comment that out, and it only constructs the divisors 1, 2, 5, and 10. The islice argument is materialized before extend begins appending, not al all the intent.

Which is akin to how I stumbled into it, when changing some very old code to use bytearrays “to save RAM”.

I don’t believe any of this can be deduced from the current docs. But I don’t think anything about behavior can be changed now.

  • bytearray has acted this way since its introduction I considerer it to be a design error, but too late to change now. Or do you think it could?
  • list has acted this way forever, and mounds of code relies on it.
6 Likes

If an Implementation Detail is too cryptic, can we borrow that useful term from C, and officially note that mutating a sequence during iteration results in Undefined Behaviour?

3 Likes

I think it’s decades too late for that. New container types can do anything they like. but are obligated to document it (e.g., dicts document that keys should neither be added nor deleted during iteration, but that changing the value associated with a key is fine).

The behavior of the original mutable sequence types was wholly defined from the start, but the docs got lost.

There’s far too much code relying on that explanation to even dream of saying “na, just kidding - it’s actually undefined” now :wink:

It’s the source of, e.g., this ubiquitous advice (repeated all over the web)::

Iterating in reverse is the classic and efficient way to modify a list in place using indices, as changes to the list length and element positions do not affect the indices of the remaining items you intend to process. 

That pattern is common as mud,. Ir was intended behavior from the start. And used to be documented.

5 Likes

FWIW, I think it’s a good idea to special case a.extend(iter(a)) for lists + list iterator because this is a C-level infinite loop which is quite annoying to interrupt (i.e. CTRL+C doesn’t work), and I don’t see a reason to ever execute it.

Then why don’t you just create a PR to update the docs? That seems reasonable to me, no matter what if any changes are done later on.


In general, by positing this in PEPs you are implying that you are making some kind of “Enhancement Proposal”, but I can’t read out of your post what you want. If this is just some discussion on the documentation, Documentation would be a better category. If this is just some musing on how python works, Python Help is the correct category currently (although it’s misnamed and I wish the mods would change the status quo there).

1 Like

I don’t see how that could work. .extend() is passed an iterator in this case, and iterators in general support no way to know in advance how many objects they’ll deliver, or from where they came.

While byetarrary is immune in this specfic case, it’s dead easy to provoke its .extand() into uninterruptible cosnume-all-RAM behavior too:

>>> import itertools
>>> b = bytearray()
>>> b.extend(itertools.repeat(42)) # never completes

There is no “perfect” category, and I picked what I thought was best. You’re free to disagree. of course, but I won’t argue about it. Things that informed my decision:

  • I have no specific outcome I’m advocating for. It’s instead a space to explore.

  • A PEP may or may not come of it, depending on how it goes. My best guess about that was revealed in the topic title.

  • While doc changes are certainly needed (regardless of outcome), it’s premature before knowing how people land on all the intimately related issues (relative values of various kinds of backward compatibility and cross-type consistency).

  • Nothing here is “musing on how python works”. I know how it works, and that’s the problem: the docs don’t match its actual behavior. In such cases, either or both may need to change. Ordinarily, “original intent” weighs in too, but in this case bytearray is already well established - it deviated from “original intent” long ago. Yet left the general mutable sequence docs in a state that seem to apply only to it in these cases. This isn’t a no-brainer to me.

2 Likes

That’s true in the general case, but it does seem solvable specifically for a given type.

Make a list iterator a special type, which indicates the list from which it was built, and then special case that type in extend.

I have no idea if the benefit is worth the cost in this case (including not only complexity but runtime cost in extend), but it does seem very much doable.

Wouldn’t it be fine to make this the general doc, but without changing any of the real and sometimes-documented behaviors of builtins?

Any code which recklessly modifies a set, for example, during iteration is asking for trouble.

As far as I’m concerned any language level doc that says that “in general, modifying a container type during iteration results in undefined behavior” merely documents the status quo. That specific types may document ways in which they define that behavior doesn’t change the fact that it’s not well defined for types like Sequence.


I favor moving this topic to Documentation. I think it makes more sense there, since this seems much more likely to focus on docs than implementation changes.

1 Like

Let’s please drop this distraction. It doesn’t matter at all to the topic. Yes, you could special-case list, and arrray.array , and any other type for which

x.append(iter(x))

blows up. But nobody writes code like that. I gave it just to illustrate the discrepancy between how list and bytearray behave in a simplest-possible way.

There are many ways to fool any mutable sequence type’s .extend() method into the same “uninterruptible consume-all-RAM” behavior - even bytearrat,extend() (and I already showed code doing so). Any unbounded iterator of any kind suffices, and there is no general solution to that.

I don’t object to cautioning against mutating a thing during iteration. I object to the docs failing to note that this is not the case for list and array.array: their behavior under mutation is defined, was intentional, used to be documented, and is relied on by a whole lot of real-life code. I also object to that bytearray broke that model (but only in some cases), but can live with that it does. Probably too late to change.

Except that’s not a mutable sequence type, so isn’t relevant here. If they try anyway, this is most likely:

>>> s = set(range(3))
>>> for x in s:
...     s.remove(x)
Traceback (most recent call last):
    ..
    for x in s:
             ^
RuntimeError: Set changed size during iteration

It’s possible that will always raise (and similarly for dcits and deques), but things have changed here over the years and they’re not the topic (so I’m not going to burn time looking into their current states). I have no objection to them raising.

Lots of container types try to complain about mutation during iteration. But there’s a reason the mutable sequence types don’t: their behavior is defined, and in use (although the docs no longer admit to it - “why not?” I can’t guess). But if bytearray intentionally doesn’t want to play along, it too should raise if it detects mutation during iteration. It’s in a world of its own now.

Yes, but with respect to container types that aren’t mutable sequence types. List is the original such thing, and nothing about it was accidental. It’s no accident that, e.g., the divisors() function I I showed before works the same way under PyPy.

2 Likes

It appears you got confused by me using the word “musing”, but yes, that is exactly what you are doing: aimless discussion.

I am going to bow out, and unless Tim actually decides to take a position here I don’t expect anything to come of this discussion which is also off topic for this category. I don’t care enough about this topic to discuss it with Tim.

Since you explicitly refer to C: Undefined Behavior is a property of the whole program. Already the compiler is allowed to do essentially anything: it can reject the code, the program can crash, or the compiler can remove code paths based on detecting UB. This may lead to one of those backwards-in-time bugs.

For a high lever language like Python, a more appropriate concept would be what C calls “unspecified behavior” (“behavior upon which this document provides two or more possibilities and imposes no further requirements on which is chosen in any instance”). Specifically, an iterator might return a reference to an object or throw an exception, like StopIteration.

If someone uses documented behavior, I would strongly oppose the term “reckless”.

I would even say that such changes to the documentation are breaking changes that need a news entry since - in essence - they deprecate documented behavior (deprecation without scheduling a removal).

2 Likes

I wouldn’t say lost, rather moved/rewritten:

Improve SeqIter documentation

1 Like

Thanks! Missed that, and so did bots. To spare readers clicks:

That’s a (too) telegraphic account of how things actually work, in a “for” loop, with some holes. Notably, a collections.deque claims to be an instance of collections.abc.Sequence, but raises RuntimeEreror if a loop changes its length while iterating over it. deque objects also don’t support slice notation (which the table in that section claims Sequence objects support).

But that’s not the focus here. bytearrays too appear to follow the model in “for” looips.

My problem popped up in .extend(iterator), and I’m not sure the behavior of that has ever been documented properly. The docs today appear to tell the truth only of bytearray (“iterator” is consumed in whole before appending begins; while for other mutable sequence types that support size-changing mutation during iteration, the “iterator” sees changes made by appending to the object as things go along).

IOW, seq.extend(iterator) works like so for the classic types:

for elt in iterator:
    seq.append(elt)

and so by the docs you found it follows that the newly appended elements are visible to the iterator. Since that’s “the obvious” meaning, I expect I just assumed it was guaranteed. Regardless, real code does rely on it..

But for bytearrays it acts like (which appears more than once in the docs, regardless of specific sequence type):

seq[len(seq) : len(seq)] = iterable

Complicating it further, iterator above was intentional. For a sequence-type iterable, as shown at the start, for a list L L.extend(L) and L.extend(iter(L)) have radically different behavior (in effect, the former appends a copy, while the latter keeps on appending copies until memory is exhausted).

3 Likes