I have a library which provides a whole set of callables implemented as classes for processing data. The user can combine those callables into a pipeline so that data gets processed by one callable after the other.
Some of these callables implement things like huge lookup-dictionaries, tries or other datastructures (implemented in pure Python).
The aim is now to have several processes in a pool run such a pipeline in parallel.
My questions: what would be the best approach to make sure that those huge datastructures used by some of the callables are not duplicated but shared between all those processes? Most of these datastructures are read-only, so a solution that works for read-only datastructures (portably on all OS) would already be useful.
I am coming from a Java background where this is trivial, because Java multiprocessing can easily share Java data between processes portably.
With Python this seems to be much more complex. On the one hand process data gets shared when the fork method of creating processes is used, but only on *NIX, so this is not portable. On the other hand there are things like multiprocessing.Value or SharedMemory which do not work for arbitrary Python data structures. And there are still other ways to get something like using a Manager.
As I understand, these also require different code depending on whether we are in a multiprocessing situation or not whereas in Java, it does not matter as long is data as only read and it can always be read with the same code as in a single process situation, no matter what data structure this is.
So I would like to learn about some kind of pattern or approach that is maybe used with Python in such cases, such that I do not have to use some tailor-made different code for each callable but can have a generic way for how to do this for all callables that need multiprocessing access to those data structures.