Structured, Exchangeable lock file format (requirements.txt 2.0?)

njs · February 15, 2019, 5:13am

I think we’re talking past each other here. Let me expand a bit, to explain how I see these different pieces fitting together. (This is partly inspired by these notes that some of you have seen before.)

Let’s assume we have some concept of an “isolated environment”. You know, the kind of thing you can install stuff into, and then later run the stuff, and it doesn’t interfere with the rest of your system. Maybe it’s a virtualenv, maybe it’s a conda environment, maybe it’s, I don’t know, a docker container using apt to manage packages. Whatever. But let’s say we have a system for describing environments like this, what’s installed into them (packages and versions), commands to run using these environments, and ways to store these descriptions alongside a project, and a smooth user interface for all of that.

This is really useful to a whole set of different users:

It gives beginners a simple way to run their scripts, or the python REPL, or jupyter, in an environment that they can control, and where it’s easy to install third-party libraries like requests without the problems caused by sudo pip.
It gives application developers a way to describe their dev and production environments, and share them across the team, share them deployment services like heroku, etc.
It gives library developers like me a way to describe different test environments, associated services like RTD, tooling that new contributors need, etc.

Notice that everything I’ve said is true regardless of how the applications are packaged – if I can download Black as a single-file standalone binary, then that’s great for a lot of reasons, but in this context I still want a tool that can pick the correct version of that standalone binary and drop it into an isolated environment. Also, everything I said so far applies the same regardless of what kind of environments we’re talking about, whether it’s virtualenvs or conda or whatever. Installing a specific version of gcc into an isolated environment? Sure, conceptually it makes total sense. (And conda users actually do this all the time.)

But then on top of this core idea of course any particular implementation has to make some choices, and these add extra constraints and complications.

Digression: The core difference between pip and conda, is that pip knows how to talk about PyPI packages, and conda knows how to talk about conda packages. This sounds inane when I write it, but it’s actually a deep and subtle issue. They have two different namespaces; they use the same words to mean different things: to pip, the string "numpy" means “the package at https://pypi.org/project/numpy, and the equivalent in other channels that share the pypi namespace”. To conda, that same string "numpy" means "the package in the conda channels under the name "numpy"". Which in this case is the same software, but our tooling doesn’t know that. Another example: to pip, the string "gulp" means “a decorator to make debugging easier”, and to conda it means a javascript build system. These incommensurable namespaces are why there’s no way for wheels to declare dependencies on conda packages, or vice-versa, and why using pip and conda in the same environment screws everything up. Both sides find the resulting environment literally impossible to describe. They’re each missing some of the vocabulary they’d need.

So back to the core idea of pinning and project-specific environments. One way to implement it would be to make our isolated environments virtualenvs. That’s the natural thing if your environment descriptions are written using the PyPI package namespace. And if you’re, say, developing a library to be uploaded to PyPI, then this is a very convenient namespace to use, because (1) your project’s own dependencies have to be expressed in this namespace, and you want to re-use them in the environment descriptions, and (2) it means you can easily talk about all the versions of all the packages on PyPI.

Another way to implement the core idea would be to make the isolated environments be conda environments. This would be super awkward for me, since I write libraries that get uploaded to PyPI, and so I’d have to hand-maintain some mapping between my PyPI-namespace dependencies and my conda-namespace dependencies. For our other hypothetical users though – the beginners, the application developers – it’s really going to depend on the specific user whether a virtualenv-based or conda-based approach is more useful. They have different sets of packages available, so it just depends on whether the particular packages that you happen to use are better supported by virtualenv or conda.

Now, the folks working on the tools that use the pypi namespace mostly don’t talk to the folks working on the tools that use the conda namespace. Which is unsurprising: in a very literal sense, the two sides don’t have a common language. So, by default, Conway’s law will kick in: the pypi namespace folks will implement a pinning/environment manager that uses the pypi namespace to describe environments, and that will certainly be a thing that helps a lot of us solve our problems. And the conda namespace folks will do whatever they decide to do, which will probably also help people solve slightly different problems. And that’s not a terrible outcome. More things to help people solve problems are good!

But… there’s also a third possibility we might want to think about. The “original sin” that causes all these problems is that PyPI and conda use different namespaces. What if we invented a new meta-namespace, that included both? So e.g. "pypi:gulp" would mean “what pypi calls gulp”, and "conda:gulp" would mean “what conda calls gulp”, and now we can use both vocabularies at the same time without namespace collisions. And then:

We could describe the state of hybrid environments, on disk or in lock files: “the packages in this environment are: pypi:requests == 2.19.1, conda:python == 3.7.2, …”
A sufficiently clever package manager could do things like: when someone requests to install pypi:scikit-learn from the package source https://pypi.org, then it downloads the wheel, and discovers that it has metadata saying Install-Requires: numpy. Since this is in a wheel, our package manager knows that this really means pypi:numpy. Next it checks its package database, and see that it already has a package called conda:numpy installed, and the conda:numpy package has some metadata saying that it Provides: pypi:numpy. Therefore, it concludes, conda:numpy can satisfy the this wheel’s dependency.
We could add wheel platform tags for conda, e.g. cp37-cp37m-conda_linux_x86_64. And then since we know this wheel only applies to conda, it would be fine if its metadata included direct dependencies on packages in the conda: namespace, like Install-Requires: conda:openssl==1.1.1.

CC: @pzwang