Copy a generator with state

About a week ago I asked in the help section of this forum how to copy a generator here, hoping that there may be some trick, shortcut or hack to do it even if not “officially supported”, there was no response, so I guess there is really nothing it can be done just yet.

I think this is a feature that may be very useful and was requested directly or indirectly multiple times, there was even a deferred pep more than 20 years ago (PEP 323 – Copyable Iterators | peps.python.org) in this direction.

The lack of this feature produces multiple issues that may look unrelated but affect many users of python. For instance generators can’t be pickled and therefore they can not stored nor passed to a multiprocessing worker or any function that requires pickling. And of course there are also legit use-cases for forking/bifurcating a generator that cannot be solved with itertools.tee.

There is a bunch of questions regarding this topic in SO and other forums, with tens of thousands of views which shows that there would be at least some demand for this feature.

Of course the implementation may be tricky, but I wanted to ask if there would be some interest on supporting this feature.

Edit 1:

Simple yet realistic example where this feature is needed without involving multiprocessing:

def avg(): 
    # online average generator
    # in general this generator may come from a package (hard to modify)
    i = 1
    new = yield None
    while True:
        new+= yield new/i
        i+=1

# keeps the current average of online given data without storing the data
t = avg()
next(t)
t.send(0) # 0/1 =0
t.send(100) # (0+100)/2 = 50

# Imagine that I got a provisional value, and i want a provisional result, here it arises the necessity of a bifurcation.
t_copy = copy(t) # NOT SUPORTED

# provisional value
t_copy.send(100)   # (100+ 100)/3 = 66.66666666666667

# after the real data arises I need the definitive result.
# definitive  
t.send(80)   # (100+ 80)/3 = 60   NOT happening as is not bifurcated

Till now i didn´t find any convincing solution to this problem that can be generalized even for a small subset of generators.

Edit 2:

As noted from @yoavdw this feature already exist in stackless python where you can copy a generator by pickling and unpikling it. It also used to exist in pypy, the functionality was lost in a version change but one of the mantainers recently wrote that they may look into support it again, let’s hope it is not a very complicated task and they are successful to do so.

There are many issues with copying complex iterators, so we even consider deprecation of copying some of itertools iterators.

Consider a generator created by the following generator function:

def g(a):
    for x in a:
        yield x

It is equivalent to the following code:

def g(a):
    _it = iter(a)
    while True:
        try:
            x = next(_it)
        except StopIteration:
            break
        yield x

The state of the generator is determined by the instruction counter, frames stack, values stack, and local variables. How would you copy the value of _it? If it is the same in the generator copy, the copy will share the state with the original. Advancing the iterator in one generator will affect the behavior of other iterator. If you make a shallow copy of _it, you should make also a shallow copy of a, because for interpreter there is no difference between _it and a. And this can break the generator, because it can expect to iterate the original a. For example, the following code depends on iterating the original a:

a = [1]
for x in g(a):
    if x:
        a.append(f(x))
6 Likes

Copying goes beyond what generators are designed for.

If something beyond calling the generator function twice, and beyond itertools.tee is needed, make a class that implements the full iterator protocol (with __next__ and __iter__) and deal with copying instances of it, sub iterators and all the other edge cases, as best suits your application.

5 Likes

If you want a generator that can be copied, you need to write your own class for this, the builtin simplified syntax is not capable of dealing with this and never will be able to for various reasons.

Not everything is pickable/copyable in a reasonable way. The most famous example of this is probably sockets and opened files, where the state depends on external factors in such a way that copying it without being aware that it is such an object doesn’t make sense.

generator in the sense of the iterators that have been returned by generator functions are in a similar boat where it is really problematic to copy it without being fully aware that it is a generator. So if you really need this, write your own code for it, solving the edge cases that apply to your situation.

2 Likes

Just for reference, I’m pretty sure PyPy and Stackless Python both support this feature:

(Not suggesting this makes it easy or trivial to implement in CPython, it’s probably a very different story, just thought it could be useful to someone reading this)

1 Like

pypy cannot pickle generators:

>>>> import sys
>>>> sys.implementation.name
'pypy'
>>>> import pickle
>>>> def g():
....     yield from range(10)
....
>>>> gen = g()
>>>> next(gen)
0
>>>> next(gen)
1
>>>> pickled = pickle.dumps(gen)
pickle.PicklingError: Can't pickle <class 'generator'>: it's not found as builtins.generator
>>>>

I’ve removed all but the actual exception raised from the traceback, as it’s not relevant. The issues above are extremely pertinent. Generators are stateful, you can’t really make a copy of them safely without making assumptions one way or another that aren’t appropriate as general assumptions.

2 Likes

Just checked and as mentioned by @yoavdw , Stackless python does the job quite well, of course when i said copying a generator with state that should be a deepcopy i.e. recursive, so ranges or other generators should be included.

Stackless python has no issue here, as you can see in this script that is able to bifurcating a generator without using a deque as itertools.tee

import stackless
import pickle

a = (i for i in range(10))
print("Before bifurcation:", next(a))

pic = pickle.dumps(a)
bifurcation_1 = pickle.loads(pic)
print("bifurcation_1:", next(bifurcation_1))

bifurcation_2 = pickle.loads(pic)
print("bifurcation_2:", next(bifurcation_2))
print("bifurcation_2:", next(bifurcation_2))
print("bifurcation_2:", next(bifurcation_2))
print("bifurcation_2:", next(bifurcation_2))

print("bifurcation_1:", next(bifurcation_1))

You can check it here Try It Online, the trick is to import stackless, otherwise it won’t work.

Also I found a partial implementation that claims to be the way for implementing this feature in the normal python python - Copy a generator - Stack Overflow , but i could not evaluate how far it is for being a good solution

I unfortunately don’t have the knowledge yet to understand the limitations of the current interpreter compared with the stackless one, but I think other questions like how deep the deepcopy should be can be answered just fine for most if not all cases.

I think it should work with some interpreter build options, I will try it later and post an update

@yoavdw I think you were right also with pypy, did you meant this flag maybe?

https://doc.pypy.org/en/latest/config/objspace.usemodules._pickle_support.html

To have it working in pypy is already a major step forward as it has a bit better adoption than stackless python which looks almost abandoned.

That’s the exact flag, yeah!

After a few days I finally got around to it and got it to work!

Unfortunetly, I only managed to get it to work on PyPy for 2.7, and not on the 3.10 version (pickling generators didn’t work there, even with the flag)

Built on branch main with the command: pypy ../../rpython/bin/rpython --opt=2 targetpypystandalone.py –withmod-_pickle_support

@mikeshardmind Tried out the closest thing I could to your example in Python 2.7, and it works:

>>>> import platform
>>>> platform.python_implementation()
'PyPy'
>>>> import pickle
>>>> def g():
....     for i in xrange(10):
....         yield i
....
>>>> gen = g()
>>>> next(gen)
0
>>>> next(gen)
1
>>>> pickled = pickle.dumps(gen)
>>>> new_gen = pickle.loads(pickled)
>>>> next(new_gen)
2
>>>> next(gen)
2
>>>> next(new_gen)
3
>>>> next(new_gen)
4
>>>> next(gen)
3
>>>> another_new_gen = pickle.loads(pickled)
>>>> next(another_new_gen)
2
>>>>

That’s very cool to me!

1 Like

It doesn’t seem to work on newer version of pypi, and these lines right here are an example of what I wouldn’t want to happen. I shouldn’t need to go out of my way to ensure a generator can’t be copied, but anyone who wants can implement an iterator that is pickleable if it is actually safe to copy.

class Foo:
    def __init__(self, ...):
        ...
    def __next__(self) -> ...:
        ...
    def __iter__(self)  -> ...:
        return self
    def __getstate__(self) -> ...:
        ...
    def __setstate__(self) -> ...:
        ...

Those are the things that must be implemented

1 Like

For me these are precisely the lines are the ones I would like to see working, otherwise the copy won’t be a bifurcation of the generator and is not a deepcopy.

This is specially important for generators with send that may take different input data after the bifurcation happens.

A possibility that would satisfy both is that copy would not copy internal generators and deepcopy would do it, but pickle in my opinion should do it as it allows to continue the run elsewhere (even in another machine, multiprocessing etc).

The problem of defining an Iterator that is “pickeable” as you show is that is just a different thing.
How does this work when your custom generator depends on pre-existing normal generators? like when you use “yield from” and a big etc.

Now try to copy a generator from my example, which iterates an external list.

Sure,

it should just copy the list as well as it is already the case in stackless, feel free to check it out here:

https://tio.run/##zVBLDoIwEN3PKSasaFJNAHVB4sZrGEMqFG2EtqElgdNjGwQhcaE7Z/fevM9kdG/vSiYbY1n@qLgxwyBqrRqLMwMvQguPARge8RzRmCZ0R/f0cAEoeIm3kJEU0E2pGuxQSGQj9tMLXhXYAViVXUXZNjmz3AV5F4BuhLRhcOLOynHaCyXTgKLknQ1XNuItInf28aZt0dbarDQEFilZ9JZWihUmdIBMrSvhXLhifeGSiL/Niz/mxeRvxL89YRie

import stackless
import pickle

a = [1,2,3,4,5,6]

def g(a):
    for x in a:
        yield x

to_bifurcate = g(a)

print("Before bifurcation:", next(to_bifurcate ))

pic = pickle.dumps(to_bifurcate)
bifurcation_1 = pickle.loads(pic)
print("bifurcation_1:", next(bifurcation_1))

bifurcation_2 = pickle.loads(pic)
print("bifurcation_2:", next(bifurcation_2))
print("bifurcation_2:", next(bifurcation_2))
print("bifurcation_2:", next(bifurcation_2))
print("bifurcation_2:", next(bifurcation_2))

print("bifurcation_1:", next(bifurcation_1))

And when you modify the original list, it will only affect the original generator. This may be not what you needed.

Is exactly what I would expect, and is how it works in stackless and pypy.

The pickled and later unpicked generators can be executed in another process or even a different machine. When doing a deepcopy / bifurcation of the generator I don’t expect them to keep any uncontrollable hidden connection to the original.

If that connection were somehow there, it would only keep there if you unpickle the generator in the same process and not if unpickling in another process or machine, which would lead to inconsistent behavior.

There are a lot of footguns with the idea of making all generators copyable. A lot of code is written with the assumption that it isn’t copyable in python, this includes things like generators that hold the state of a PRNG. There are also cases of generators that yield lines from an open file descriptor accessible only to that user on that machine, which inherently prevents pickle from being used to pass between machines. even if you think these semantics are better for you, it’s not generalizable like you want, and it is a recipe for problems for other existing code. You really should just pass what you need to create the generator with the appropriate state to whatever thread.

Something of note is that the only reason stackless supported this at all is for generators as coroutines and pickling their tasklets. It wasn’t ever intended use to bifurcate generators in stackless, but to allow them to be moved.

From their own docs, they actually kill tasklets when pickling them before sending so that it isn’t bifurcation.

We want to show that the old code cannot be resumed, and in order to do so, we destroy the tasklet it was running within.

You can also see that even within stackless’s intended use, sometimes you’d still need to teach it about how to pickle some state:

It should be possible to pickle any tasklets that you might want to. However, not all tasklets can be unpickled. One of the cases in which this is true, is where not all the functions called by the code within the tasklet are Python® functions. The Stackless-Python pickling mechanism has no ability to deal with C functions that may have been called.

and

If you pickle a tasklet, its Context won’t be pickled, because Context objects can’t be pickled. See PEP 567 for an explanation.

It is sometimes possible enable pickling of Context objects in an application specific way (see for instance: copyreg.pickle() or pickle.Pickler.dispatch_table or pickle.Pickler.persistent_id). Such an application can set the pickle flag PICKLEFLAGS_PICKLE_CONTEXT to include the context in the pickled state of a tasklet.

Another option is to subclass tasklet and overload the methods tasklet.__reduce_ex__() and tasklet.__setstate__() to pickle the values of particular ContextVar objects together with the tasklet.

Reading the wole thing, what you mention is a comment in an example:

Destroy the tasklet:

t1.kill()

We want to show that the old code cannot be resumed, and in order to do so, we destroy the tasklet it was running within.

Which means that using t1.kill() shows that the old code cannot be resumed, not that taskslets cannot be resumed, which is is the whole point of a tasklet… they can be resumed once, twice, many times, in this process, now, later and whenever.

Also the first two paragraphs:

One of the most impressive features of Stackless-Python, is the ability to pickle tasklets. This allows you to take a tasklet mid-execution, serialise it to a chunk of data and then unserialise that data at a later point, creating a new tasklet from it that resumes where the last left off.

What makes this particularly impressive is the fact that the Python® pickle structure is platform independent. Code can for instance initially be run on a x86 Windows machine, then interrupted, pickled and sent over the network to be resumed on an ARM Linux machine.

Which is also from the docs you sent and exactly contradicts what you sent here.

Of course if the generator contains/uses a non pickable item it should not be pickable, that is not a surprise, the whole proposal is to make generators pickable not every possible python object. But there is nothing new here, python will fail to pickle any object containing an open socket.

I think the text you sent is clear, the context can’t be pickled, we may want to pickle that as well, but that is another topic.

Also I’m totally unaware on anything that relies on generators not being pickable, even a random number generator. For that case if you would pickle it and it relies in some sort of entropy source it would be good to go as the bifurcation will not give the same results. If it doesn’t have an entropy source then this feature may be interesting, like for producing the same random numbers across different instances, maybe for testing porpoise.

At this point I don’t see any footgun here, is a feature that can be implemented, was successfully implemented in two implementations of python and it will have many possible applications.

Realistic usecase

I see that you think, why not to pass the arguments to the process? and to be honest this is will depend on how easy is to reconstruct the generator having the state, for my use case i don’t know how to do it.

Just you to imagine one of the simplest possible example, calculating an average.

def avg(): 
    # online average generator
    # in general this generator may come from a package (hard to modify)
    i = 1
    new = yield None
    while True:
        new+= yield new/i
        i+=1

# keeps the current average of online given data without storing the data
t = avg()
next(t)
t.send(0) # 0/1 =0
t.send(100) # (0+100)/2 = 50

# Imagine that I got a provisional value, and i want a provisional result, here it arises the necessity of a bifurcation.

# provisional value
t.send(100)   # (100+ 100)/3 = 66.66666666666667

# after the real data arises I need the definitive result.
# definitive  
t.send(80)   # (100+ 80)/3 = 60   NOT happening as is not bifurcated

Now, of course I can come up with my own implementation of avg() , or extract manually the locals and compute it etc, but that is precisely what i want to avoid as avg() could be whatever function you may imagine from a external package, and i look for a solution that can be generalized.

Maybe any of you have solved this issue before without copying the generator, but tbh I had no clue on how to do it, and I don’t see why to copy a generator like these is remotely problematic.

They can be resumed any number of times, but the intended use of stackless is moving the tasklets, this is why they are only copied while interrupted, and you are supposed to destroy the original when doing so, ie. NOT bifurcation.

It does not contradict what was said, quoted, or the docs in their entirety.

You should take a look at how most randomness works in computing. There’s generally initial seeding, possibly from a secure entropy source, and then a state machine that often receives no further new entropy. If you want the same state each time for tests, a lot of random generators allow setting a seed, however allowing copying this can create unintentional issues. Some are relatively harmless, like where people have observed some games all the characters idle animations are in sync. Some are much more harmful.

There’s also things like copying framelocals being unsafe with things like unique id generation methods that rely on a monotonic increasing number per time interval.

I’m not going to keep going with more examples, there are plenty more and it’s not something I care to argue any further.

You will simply get an error that the file descriptor is unpicklable. I don’t see the problem. Not every generator needs to be pickable, just any generator that can be.

Isn’t this always the case when pickling? A deep copy. Affecting both generators would be a shallow copy. The same can be said about pickling a class that has a reference to an external list, it would get pickled too.

1 Like