Pattern for sharing large (Python) data structures between processes

laike9m · April 6, 2021, 6:19am

This is a known caveat of Python. I’ve encountered similar problems in the past, where I need to parallize the processing of many protobuf objects. Since this is cpu intensive I want to use multiprocessing. I looked at all the options, but essentially, there’s no good way for this type of usage – in Python, if you want to share objects between processes, these objects have to be picklable. This includes the aforementioned multiprocessing.Manager. Since many types are not pickable, there’s no solution that is general enough for all use cases.

If you know that your data is always gonna be picklable, then those solutions shall apply. but still, you suffer from the overhead of pickle/unpickle/passing objects around. Overall speaking, parallizing processing in-memory data is a known hard problem in Python (if at all possible). Probably the best option is to design the whole program around parallism from the start. For example, if you can split the input at the beginning, and let them be handled in different processes, this way it’s gonna be much more efficient than dealing with an existing large list after those objects have been generated. Again, this requires extra careful design that is not needed in other programming languages like C++/Java. Since CPython chooses to have a GIL, this is the price we have to pay.