There is a shared object that I need to pass to multiprocessing workers in a Pool. It has two properties: (1) it cannot be serialized such that it has to be initialized from scratch in each worker separately and (2) its instantiation takes a very long time. I would like to instantiate the object once per worker and share it across all tasks the worker processes. I would like to minimize the intervention (i.e. do not subclass Pool, etc). What would be the best way to do it?
You can pass the Pool an initializer
function and initargs
, which will run at the beginning of the process. In the initializer you can create your object as a global, and then the worker functions can rely on it being available.
It’s a little clunky, and of course you have to remake the object for each new process. But it’s better than for each task!
If you feel really adventurous, you could try the free-threaded build of python 3.13 instead…then you wouldn’t need to serialize the object.
Where would I store the object that I created with initializer? I mean: the closest I got to resolving this is indeed declaring global obj
and then initializing it into this variable but I am not sure this even works (did not try). It obviously has an ambiguity regarding whether both initializer
and worker
are de-serialized and reconstructed with the same globals
. Is it documented?
Unfortunately, the thing does not work with threads so that is why I need processes.
It’s a module global. You build your object that way, and nothing more needs to done to “store” it.
Start trying! This is easier to get working than it is to explain.
Your mental model seems off base here. This has nothing to do with serialization or “reconstructing”. Each worker process builds into own global object(s), from scratch, by the initialization function called once per worker process (not per task!) by the Pool()
constructor. The module globals bound by the initialization function retain their bindings across tasks, for the lifetime of the worker process.
Probably not to the level of detail you want.
Alternatively, on a system that supports fork()
(not Windows), you can build objects once in the main process before invoking Pool()
, and worker processes will inherit copy-on-write clones of the module globals at the time Pool()
is invoked.
I ended up with this one which works as expected and does not use initializer
at all.
from multiprocessing import Pool
from functools import cache
@cache
def init():
print("init()")
return 1
def worker(b):
return init() * b
with Pool(3) as pool:
print(sum(pool.map(worker, [1] * 42)))
I guess initializer
is not needed if you store it in globals anyway…
Yes, that’s the solution that @tim.one suggests in the post above. Note that will only work on an OS that uses fork
to make new processes (e.g. Linux).
Because the OP is missing a if __name__ == "__main__":
guard, it will die horribly on WIndows (each worker process will attempt to create 3 new workers of its own, etc).
So, ya, they’re not running on Windows.
But what they’re intending to do doesn’t actually depend on fork()
at al. On Windows, each worker will create its own init()
function with an empty cache, and will populate the cache the first time the worker calls it.
That will work fine, but is needlessly convoluted. They could just as well have done, e.g.,
init = 1
def worker(b):
return init * b
Which is another approach: the module just can compute what it needs at the time it’s imported. That works on Windows too, and if that’s all they want, there’s no need to tell Pool()
to call an initializer.