User story to consider for dependency specifications: reproducible science

sirosen · August 19, 2023, 3:40am

I had an interesting opportunity today to chat with some colleagues I don’t often see, and I ran some questions past them to see how PEP 722, PEP 723, and “packages which don’t produce a wheel” (packages which are always distributed and run as source trees) would play.
I’m still digesting their answers to those questions. But they shared something completely different which has some interesting interplay with these other areas of concern.

I’ll call this use case “reproducible scientific python”, and it goes something like this:

the user is a scientist/researcher of some kind, capable of writing python but by no means an expert in the language or packaging
their working environment is not a text file but rather a notebook (Jupyter, Google Collab, etc)
their dependencies are accumulated over the course of some research project by running !pip install ... inside the notebook (! shells out in Jupyter, so that’s just running an arbitrary pip command)
at a late stage in their project lifecycle, once their code is working on toy datasets and small examples, they wish to run their notebook “at scale” and potentially across a matrix of datasets ^[1]

At this relatively late point in a working python project, suddenly the user is exposed to a very different paradigm for talking and thinking about packages. Now they aren’t things that are manually installed – they are version specified, they can be listed in X, Y, or Z place, and they need to be written down all at once. That their python version could be different from the version on the HPC cluster where they want to run is usually news to these users, and not of the welcome variety.

My first reaction, which I’m betting some readers of this post will share, was that this is something for the notebook software to support.

However, here’s the natural follow up which leads me to think that it relates to core python packaging discussions:
What should the notebook software do?

I can come up with answers like “run pip freeze and call it a day”, but I lack conviction about that resulting in a notebook written on a macOS laptop lifting and shifting to a Linux cluster with any reasonable expectation of success.

Mostly, I wanted to share a use case which was not on my radar. It feels different from the examples and use cases I usually see used on this forum. Hopefully this helps to expand the view of what python users need, and therefore what kinds of solutions are appropriate. Or if not that, I hope I at least spun this user story into a decent yarn and held everyone’s interest for a minute.

I have actually modified this part of the use case slightly to be easier to explain. The real project, which is what these colleagues of mine work on, is specific to ML models and has to do with building a library of models which can be shared. I’m not clear on absolutely all of the details. ↩︎

btskinn · August 19, 2023, 3:57am

Take a look at conda-store.

It’s for exactly this purpose, providing a framework to keep track of (fully-specified, pinned) conda environments over time so that you can always know what packages you were and are working with.

It’s still under pretty active development. As best I know, the user story for individuals working on their own machines isn’t fully fleshed out yet…the focus thus far has been on JupyterHub-like contexts…but it’s coming.

If I had to hazard a guess, the individual story will be mostly told by the JupyterLab extension.

It would represent a shift in toolbox for someone accustomed to managing packages with pip, but to my mind the benefits would be worth that learning curve.

sirosen · August 19, 2023, 4:15am

I wasn’t aware of conda-store – I’ll read a bit and pass along that reference for sure! Thanks!

I believe that when I asked about whether or not these users were using conda, the answer was a very glib “if we’re lucky!” So I’m not sure if conda-store provides a complete answer. I’ll have to read about it to better understand.

aragilar · August 19, 2023, 9:05am

I’ve run into this issue (not doing ML though, so the need to interact with specific hardware may change what options are available to them). There are two options for them (I’ve done both with varying success):

Use their favourite cloud provider. This gives (relatively) complete control over your environment (so you can set up the cloud to match your local system), but you need to be able to pay for it.
Conform to what the HPC system provides, and change your development setup to match it as much as possible. Singularity/Apptainer are a possible middle ground (being HPC-focused containers), but you want to make friends with the sysadmins/support staff so you can make them work out how to best align your requirements with their system.

petersuter · August 19, 2023, 9:25am

To me PEP 722 would seem very useful in the notebook context and feel familiar and natural.
Switching from !pip install ... commands to a simple declarative comment is a small step without downside from that perspective.

Notebooks already have buttons like “Run Cell”, “Run Below” etc. above each cell (a line or small block of code).
I imagine they would quickly add an “Install Script Dependencies” button above PEP 722 comments (or just install them automatically on “Run Cell” if required), and maybe even things like “Lock Versions” that update the comment with the current versions frozen etc.

Nothing of this seems difficult (to implement, learn, …) since it is already basically how it works (using non-standard magic notebook commands.) Standardizing it would make such convenient notebook usability more useful long-term and outside the initial notebook context. With e.g. VSCode the “notebook” can be a normal Python text file with # %% comments to denote cells. It’s amazing.

You can copy this snippet of Python code:

# %%
# Script Dependencies:
# matplotlib
# skimage
import matplotlib.pyplot as plt
import skimage
def show(*args): plt.axis('off') ; plt.imshow(*args)
# %%
cat = skimage.data.chelsea()
rocket = skimage.data.rocket()
show(cat)
# %% 
nightcat = skimage.exposure.match_histograms(cat, rocket, channel_axis=-1) ; show(nightcat)
# %%
magiccat = cat[:,:,1] > cat[:,:,0]
mask = skimage.morphology.isotropic_opening(magiccat, 1)
mask = skimage.morphology.isotropic_dilation(mask, 20)
mask = skimage.segmentation.chan_vese(skimage.img_as_float(cat[:,:,0] - cat[:,:,1]*0.5), init_level_set=mask, max_num_iter=20, lambda1=100)
mask = skimage.morphology.isotropic_dilation(mask, 2)
magiccat = cat.copy() ; magiccat[mask,2] = 1 - nightcat[mask,0] ; show(magiccat)

and paste it in your text editor and click “Run Below” and get the same view … if you have the same version of the dependencies installed. With the frozen inline script dependencies there would be no “if”.

notatallshaw · August 19, 2023, 3:34pm

If the user is using a library that compiles on one OS but not another but wants to run their Python code on both, then no amount of packaging solutions is going to solve this, unless I’m missing something?

It seems to me that packaging solutions can only solve the situation where a user is trying to run in two different environments that are close enough to at run the same exact high level requirements and can share a set of superset constraints between the two environments.

I have come across similar situations in both my own work and assisting researchers in commercial spaces. This has led me to implementing the following workflow:

Create a minimum set of requirements for your environments but be specific with the versions
Use pip freeze to create a file that will be fed into constraints as a lock-like file
Create your environment on the other OS using the pip minimum requirements and lock-like file as constraints
If successful run pip-freeze to generate a lock-like file for this additional platform, going forward use both lock-file files as constraints
If failing to resolve or not passing tests identify the conflict and start step 1 over with a constraints file that precludes the conflicts

Building this machinery is, in general, beyond the capability of the user you described, but the idea of a user being able to maintain reproducible environments consistently across multiple OSes I just don’t think is feasible 100% of the time.

I think some tooling can help with this as previously mentioned, and I am working on a project myself in very early stages which I think can help simplify these steps, but I think tooling beyond packaging is ultimately needed to solve this for any given project such as cloud tooling which will run your code in multiple OSes and resolve a minimum set of requirements.

sirosen · August 20, 2023, 1:59pm

Peter’s reply above is in line with my thoughts.

I greatly appreciate all of the thoughts and input on this topic, to the point that I’m going to send my coworkers a link to this thread for them to read, but I’d like not to get too caught up in trying to solve the use case in this discussion – except insofar as solutions are relevant to other packaging discussions.

What I’m thinking about is whether or not the current packaging landscape is supportive of the tools supporting such users.
What does jupyter need or want in order to support environment and dependency management? How do we make sure that lessons learned in that context translate to raw python files and vice versa?

I think PEP 722 aligns with these kinds of users writing manual dependency lists. However, neither 722 nor 723 particularly well supports having the notebook maintain that list for you, based on a clicky installer of some kind.

aragilar · August 21, 2023, 9:19am

I wonder if someone has looked at adding dependencies to the underlying JSON notebook format? Once that exists, then transforming between PEP 722 and the notebook could be a feature of tools which transform notebooks into scripts.

jamestwebber · August 21, 2023, 1:40pm

This user story is quite relevant to my own workflow, with the difference that I’m more comfortable with packaging and so I tend to make a package for my own code that I can reuse in the notebooks (which may or may not be versioned, depending on intended audience).

I don’t think PEPs 722/723 are relevant here except that they’re about dependencies–the specifics are all about formatting schemes that don’t make sense in a notebook setting. Of the two, 723 makes a little more sense because you could just have a separate cell that was TOML formatted, with metadata in it^[1]. But I don’t know if I’d call that “PEP 723” or just a separate thing for notebooks.

I do sometimes need to share notebooks with others, and what I’ve done lately is to paste YAML with the conda env into a comment in the top. So embedding a block of TOML would be essentially the same, and if jupyter could actually create and install from that block that’s even better. Although honestly I’d still prefer a conda env because the requirements can be difficult to install with pip alone.

this isn’t something it can do now, but it knows how to format TOML and could support this if it had a use ↩︎

agoose77 · August 25, 2023, 8:13am

For posterity, Conda used to support defining specifications in the notebook itself:

github.com

conda/conda/blob/4.4.x/conda_env/specs/notebook.py

try:
    import nbformat
except ImportError:
    nbformat = None
from ..env import Environment
from .binstar import BinstarSpec


class NotebookSpec(object):
    msg = None

    def __init__(self, name=None, **kwargs):
        self.name = name
        self.nb = {}

    def can_handle(self):
        result = self._can_handle()
        if result:
            print("WARNING: Notebook environments are deprecated and scheduled to be "
                  "removed in conda 4.5. See conda issue #5843 at "

This file has been truncated. show original

This was removed, although I could see a future in which custom environment resolvers could be added via plugins. I don’t know if anyone’s attempted that, though it probably would be a fair amount of work.

ofek · August 25, 2023, 4:26pm

FYI for those who haven’t seen, PEP 723 now defines metadata comment blocks that could in theory be supported by other tools. Currently I say that the types are standardized but if anybody here would find it useful I don’t mind changing the text to allow for arbitrary block types.

sirosen · August 25, 2023, 8:41pm

I really liked seeing your block type solution in the final version of 723. IMO we can work with the PEP as written if accepted, and it’s easy to open it up with more types, arbitrary types, or some extension space (X-, tool., etc) in the future if there’s demand.

I’m eager for us to have some embedded metadata spec – 723 or 722 – so that we can start seeing tools pick it up and run with it!

Some users are operating very far from the space of package maintainers, but then need a kind of bridging into the ecosystem to be done for them, usually by dedicated engineers. Aligning the user data closer to something that those engineers can consume easily will make this process faster, easier, and more reliable.
Eventually we may see an end state in which that engineering time isn’t needed, but I’m doubtful about that. (Docker is still not present on every desktop, etc). In a way, this user story is equal parts about the end user and their supporting engineers.

BrenBarn · August 29, 2023, 3:50am

This is an interesting an important use case and is similar to some situations I’ve been in myself.

It relates to an issue I mentioned in another thread, which is that the current Python packaging setup more or less requires thinking about packaging matters at a fairly early stage. It is not so easy to just take a “bundle of code” (be that a notebook or a collection of scrips) and just make it distributable. Instead the code has to be organized in a specific way from the get-go, and if it isn’t, you have to go back and switch things around later, which can be a hassle.

This is certainly true. At a minimum, a required library may not be available on a different platform, which would preclude reproducing the whole environment. But what I think of as a good goal is if the user can at least be clearly told what’s going wrong. So if I go to “install” someone’s notebook or replicate their environment, and I get a message saying “this code requires blahlib, but no version for your platform was found”, that’s still a win. What we don’t want is a giant screen of cascading and confusing errors (e.g., because when it couldn’t find a version it tried to compile its own on the fly).

That’s an interesting idea. I’d think this could be connected to the Jupyter UI, which would also help the “non-developer-Python-user” community. For instance, if Jupyter had some kind of GUI that let people search for and checkbox libraries they needed, and this was recorded in JSON metadata that was kept in sync with a cell at the top of the notebook that imported everything. This wouldn’t handle every possible case (e.g., conditional imports), but might be helpful.

I think it sort of is and sort of isn’t.

As the original post in this thread mentioned, one of the issues with this notebook workflow is the gradual accumulation of dependencies, which only later are (or aren’t!) reviewed to get an understanding of what all is needed to run the code. My intuition is to say that the best way to make that easier is with code-analysis tools that actually parse the code and tell you which libraries are needed. This relieves the programmer/scientist of the burden of keeping some dependency metadata in sync with what’s actually imported.

I also still think that some of the problems here come from the Python import mechanism itself, in particular the difficulty of simply dropping a directory tree somewhere and saying “I want to be able to access everything in here with relative paths (including relative imports)”. This would make it easier for people to just send zip files around without having to ensure that everything is packaged up with a specific nice directory structure.