A lot data science is being done in Python nowadays, some of which is very compute heavy so I’m curious how people handle getting a good REPL experience where the code can run on multiple machines. I see that there are a bunch of frameworks like dask, ray, spark, but I don’t understand based on the APIs Python provides how they could have very robust support. If you have a run_on_cluster(foo)
call frameworks may be smart enough to serialize the code for foo
(I understand pickle
won’t do this but dill
and cloudpickle
do) but then I don’t see how they could reliably infer everything that foo
uses and make sure that state is also replicated? Is this just a limitation people live with or is there some magic to figure out function dependencies?
If I want to be able to run a sequence of commands more than once, I write a script…
Instead of “REPL state”, why not distribute a Jupyter notebook instead? They’re hugely popular with Data Scientists.
There isn’t really a difference between the REPL and the interpreter running on a script. Serializers will inspect the function or object being serialized and construct global and local dictionaries for the referenced objects. This is why having things like database cursors in serialized functions don’t work.
To learn more about how to inspect objects and functions for their dependencies, see: inspect — Inspect live objects — Python 3.13.0 documentation
For example, here is dill’s detection module for walking the ast to find needed state from the interpreter. dill/dill/detect.py at master · uqfoundation/dill · GitHub
I’m not sure I follow what exactly your asking. Can you give an example of what you’d like to do?