Pattern for sharing large (Python) data structures between processes

I have a library which provides a whole set of callables implemented as classes for processing data. The user can combine those callables into a pipeline so that data gets processed by one callable after the other.

Some of these callables implement things like huge lookup-dictionaries, tries or other datastructures (implemented in pure Python).

The aim is now to have several processes in a pool run such a pipeline in parallel.

My questions: what would be the best approach to make sure that those huge datastructures used by some of the callables are not duplicated but shared between all those processes? Most of these datastructures are read-only, so a solution that works for read-only datastructures (portably on all OS) would already be useful.

I am coming from a Java background where this is trivial, because Java multiprocessing can easily share Java data between processes portably.

With Python this seems to be much more complex. On the one hand process data gets shared when the fork method of creating processes is used, but only on *NIX, so this is not portable. On the other hand there are things like multiprocessing.Value or SharedMemory which do not work for arbitrary Python data structures. And there are still other ways to get something like using a Manager.
As I understand, these also require different code depending on whether we are in a multiprocessing situation or not whereas in Java, it does not matter as long is data as only read and it can always be read with the same code as in a single process situation, no matter what data structure this is.

So I would like to learn about some kind of pattern or approach that is maybe used with Python in such cases, such that I do not have to use some tailor-made different code for each callable but can have a generic way for how to do this for all callables that need multiprocessing access to those data structures.

2 Likes

I have a library which provides a whole set of callables implemented as
classes for processing data. The user can combine those callables into
a pipeline so that data gets processed by one callable after the other.

Some of these callables implement things like huge lookup-dictionaries, tries or other datastructures (implemented in pure Python).

The aim is now to have several processes in a pool run such a pipeline in parallel.

My questions: what would be the best approach to make sure that those huge datastructures used by some of the callables are not duplicated but shared between all those processes? Most of these datastructures are read-only, so a solution that works for read-only datastructures (portably on all OS) would already be useful.

I would use Threads. See the threading stdlib module. While that serves
your purposes, no funny stuff is needed because you’re all in the same
process anyway. You just need to orchestrate access to avoid races.

I am coming from a Java background where this is trivial, because Java multiprocessing can easily share Java data between processes portably.

And Java also has threads as standard.

With Python this seems to be much more complex. On the one hand process data gets shared when the fork method of creating processes is used, but only on *NIX, so this is not portable. On the other hand there are things like multiprocessing.Value or SharedMemory which do not work for arbitrary Python data structures. And there are still other ways to get something like using a Manager.

Once you start making multiple processes you need to either copy data
between them or have some shared files or memory. fork() is a lousy way
to arrange chared memory, even on UNIX.

Pickle is the usual way people package data structures for
serialisation/deserialisation, but that inherently copies stuff. Ok for
passing things through pipelines, but no good for big shared data
structures.

You can mmap files and put things in a shared file, also. Then attach as
needed to the file from different processes. That is supported on
Windows and UNIX (see the mmap stdlib module).

BUT…

You don’t put python objects in there, because you’ve got distinct
interpreters, which manage references to the objects independently.

But you want store data in the files (eg like dbm hash files, or big
arrays of numbers, what have you).

As I understand, these also require different code depending on whether we are in a multiprocessing situation or not whereas in Java, it does not matter as long is data as only read and it can always be read with the same code as in a single process situation, no matter what data structure this is.

So I would like to learn about some kind of pattern or approach that is maybe used with Python in such cases, such that I do not have to use some tailor-made different code for each callable but can have a generic way for how to do this for all callables that need multiprocessing access to those data structures.

Can you sketch how this might look in Java?

Cheers,
Cameron Simpson cs@cskk.id.au

There’s the standard library multiprocessing.Manager which allows you to create the most common instances to be shared between process with minimal code. You should be able to back your data structures with dicts and lists created by these managers. I don’t think they’re particularly optimised for multiple readers, but they should be a good start

1 Like

I would use Threads. See the threading stdlib module. While that serves
your purposes,
I do not think threads do serve my purpose as threading only uses a single CPU. As far as I understand, threading would help me if one process needs to wait for several events (e.g. file IO) that would otherwise block the CPU from doing useful work.

My requirements are the opposite: I already have all the data (in memory, that big data structure I mentioned) and now I want to do a lot of processing distributed over all CPUs in the machine, where I need read-only access to that data structure. This can only be done with multiprocessing, not threading.

In Java, “threads” can utilize multiple cores and still share all the data structures in the VM so Java “threading” is really multiprocessing and entirely different from Python threads.

You don’t put python objects in there, because you’ve got distinct
interpreters, which manage references to the objects independently.

Yes, but I only need to read those objects. To make this clear: my datastructure can be e.g. a hash tree that needs 20G memory and I am running on a machine with 32 cores. The machine has 128G in total of which only about 6 are free. If each of the 32 cores would need its own copy of the 20G that would just not work.

This can be achieved without any problem at all in Java so I had been hoping there is some way to share the datastructure between Python processes as well.

There’s the standard library multiprocessing.Manager which allows you to create the most common instances to be shared between process with minimal code. You should be able to back your data structures with dicts and lists created by these managers. I don’t think they’re particularly optimised for multiple readers, but they should be a good start

Thank you, this sounds good! I have to confess I do not really understand how this works under the hood but it seems that the supported proxy objects can actually be nested.

My datastructure is so far implemented as a class, but for the purpose of creating an object shared via proxies I guess it should be sufficient to just treat the class instance like a Namespace instance and essentially have a local copy of the class method run on the proxy objects instead of the original class attributes?

I’m not familiar with multiprocessing.Manager, but if this is the sort of thing you want to do and Manager turns out not to be the right tool for the job, you might consider Pyro5. This might also be not be suitable, all this talk of proxy objects just reminded me of the package.

As for “patterns”: with this kind of amount of data, consider using a database server or some kind instead of in-memory Python objects.

In Java, “threads” can utilize multiple cores and still share all the data
structures in the VM so Java “threading” is really multiprocessing and
entirely different from Python threads.

Multithreading: threads have shared access to the memory address space.
Multiprocessing: process boundaries segregate memory.

Due to the GIL, you are right in saying that multithreading doesn’t do what you
want in terms of maximising core utilisation. multiprocessing might.

Yes, but I only need to read those objects. To make this clear: my
datastructure can be e.g. a hash tree that needs 20G memory and I am running
on a machine with 32 cores. The machine has 128G in total of which only about
6 are free. If each of the 32 cores would need its own copy of the 20G that
would just not work.
This can be achieved without any problem at all in Java so I had been hoping
there is some way to share the datastructure between Python processes as
well.

I wonder whether one could leverage the fact that a fork() under linux gives
you Copy-on-Write (COW) memory pages and if you are just reading from your data
structures in subprocesses, you could potentially “share data” this way.

I haven’t tried it locally as I am sure others will know off the top of their
heads: given X swap capacity, when would the kernel start OOM killing?

Threads in Python are real system threads, the OS will distribute out threads across multiple CPU cores when it needs to. The issue is that only one thread (in a given process) at a time can “run Python code” (due to the global interpreter lock, aka GIL), which means C code (eg NumPy, TensorFlow, carefully-written Cython) and file and network IO will allow other threads to run.

I don’t know specifically about Java, but in general you can have multiple threads of execution (process-managed (eg greenlet) or OS-managed (eg threading)) in a given process, and you can have multiple processes spawned/forked from the same original process.

CPython specifically, but also Python in general, is mostly designed around simple usage of data-structures in a single thread. The concurrency and parallelism in the standard-library is either based on C libraries or recently added.

Python is mostly built around functional interfaces. You can completely re-implement dict with your own class, but have it send values over pipes (local or network) on store/get; this is essentially what multiprocessing.Manager.dict is.

You could replace your attributes with property getters and setters which set values in a multiprocessing.Manager.dict

This is a known caveat of Python. I’ve encountered similar problems in the past, where I need to parallize the processing of many protobuf objects. Since this is cpu intensive I want to use multiprocessing. I looked at all the options, but essentially, there’s no good way for this type of usage – in Python, if you want to share objects between processes, these objects have to be picklable. This includes the aforementioned multiprocessing.Manager. Since many types are not pickable, there’s no solution that is general enough for all use cases.

If you know that your data is always gonna be picklable, then those solutions shall apply. but still, you suffer from the overhead of pickle/unpickle/passing objects around. Overall speaking, parallizing processing in-memory data is a known hard problem in Python (if at all possible). Probably the best option is to design the whole program around parallism from the start. For example, if you can split the input at the beginning, and let them be handled in different processes, this way it’s gonna be much more efficient than dealing with an existing large list after those objects have been generated. Again, this requires extra careful design that is not needed in other programming languages like C++/Java. Since CPython chooses to have a GIL, this is the price we have to pay.