Structured, Exchangeable lock file format (requirements.txt 2.0?)

Maybe we need a zipapp that includes the interpreter itself, but how does one do that cross platform?

Exactly. And doing it cross platform is the heart of the problem. Briefcase is the best I’m aware of, but even that has limitations and problems (at least on Windows, where I don’t think they can be solved by simple bugfixes) (aside: BeeWare already have a project categorization kind of like what I’ve been talking about)

(I have many more thoughts on this topic, which I’ll save for another thread if/when we start one here. bpo-22213 is where we’re at right now.)

I feel that not all apps are created equal. On one hand we have tools like Black etc. (even pip) that are only useful as an executable, and PyPI isn’t really the best way to distribute it, but on the other there are tools that sit in the middle, e.g. Pytest, that need to work both as a command and a library (I assume most Pytest users import pytest—at least I do). Those are still best managed by the project-level lock file IMO.

Would it be a good idea to split standalone app discussions into a dedicated thread? I feel that it is an important topic (and want to participate in it as well), but not mutually exclusive with the lock file.

Yeah, definitely a separate topic to go deeper into. The relevance here was identifying what I consider a non-problem for lock files (having one for each app used by a project) - in generalizing a format across tools, I don’t think we need to account for that.

1 Like

I think there is a clear need that isn’t being met by requirements files alone, and I think that is demonstrated by tool adoption in python and in other languages. Dependency resolution and environment reproduction, whether it is application specific or whether it is being used by a library developer to determine if the specified constraints are even valid or need to be narrowed, is obviously meeting some important needs (like avoiding conflicts or broken environments).

IMO the question is whether it’s important to standardize on lockfiles specifically – which I’m not personally convinced of I guess… I don’t really have a problem with it but I don’t see a pressing need either

I think we’re talking past each other here. Let me expand a bit, to explain how I see these different pieces fitting together. (This is partly inspired by these notes that some of you have seen before.)

Let’s assume we have some concept of an “isolated environment”. You know, the kind of thing you can install stuff into, and then later run the stuff, and it doesn’t interfere with the rest of your system. Maybe it’s a virtualenv, maybe it’s a conda environment, maybe it’s, I don’t know, a docker container using apt to manage packages. Whatever. But let’s say we have a system for describing environments like this, what’s installed into them (packages and versions), commands to run using these environments, and ways to store these descriptions alongside a project, and a smooth user interface for all of that.

This is really useful to a whole set of different users:

  • It gives beginners a simple way to run their scripts, or the python REPL, or jupyter, in an environment that they can control, and where it’s easy to install third-party libraries like requests without the problems caused by sudo pip.
  • It gives application developers a way to describe their dev and production environments, and share them across the team, share them deployment services like heroku, etc.
  • It gives library developers like me a way to describe different test environments, associated services like RTD, tooling that new contributors need, etc.

Notice that everything I’ve said is true regardless of how the applications are packaged – if I can download Black as a single-file standalone binary, then that’s great for a lot of reasons, but in this context I still want a tool that can pick the correct version of that standalone binary and drop it into an isolated environment. Also, everything I said so far applies the same regardless of what kind of environments we’re talking about, whether it’s virtualenvs or conda or whatever. Installing a specific version of gcc into an isolated environment? Sure, conceptually it makes total sense. (And conda users actually do this all the time.)

But then on top of this core idea of course any particular implementation has to make some choices, and these add extra constraints and complications.

Digression: The core difference between pip and conda, is that pip knows how to talk about PyPI packages, and conda knows how to talk about conda packages. This sounds inane when I write it, but it’s actually a deep and subtle issue. They have two different namespaces; they use the same words to mean different things: to pip, the string "numpy" means “the package at https://pypi.org/project/numpy, and the equivalent in other channels that share the pypi namespace”. To conda, that same string "numpy" means "the package in the conda channels under the name "numpy"". Which in this case is the same software, but our tooling doesn’t know that. Another example: to pip, the string "gulp" means “a decorator to make debugging easier”, and to conda it means a javascript build system. These incommensurable namespaces are why there’s no way for wheels to declare dependencies on conda packages, or vice-versa, and why using pip and conda in the same environment screws everything up. Both sides find the resulting environment literally impossible to describe. They’re each missing some of the vocabulary they’d need.

So back to the core idea of pinning and project-specific environments. One way to implement it would be to make our isolated environments virtualenvs. That’s the natural thing if your environment descriptions are written using the PyPI package namespace. And if you’re, say, developing a library to be uploaded to PyPI, then this is a very convenient namespace to use, because (1) your project’s own dependencies have to be expressed in this namespace, and you want to re-use them in the environment descriptions, and (2) it means you can easily talk about all the versions of all the packages on PyPI.

Another way to implement the core idea would be to make the isolated environments be conda environments. This would be super awkward for me, since I write libraries that get uploaded to PyPI, and so I’d have to hand-maintain some mapping between my PyPI-namespace dependencies and my conda-namespace dependencies. For our other hypothetical users though – the beginners, the application developers – it’s really going to depend on the specific user whether a virtualenv-based or conda-based approach is more useful. They have different sets of packages available, so it just depends on whether the particular packages that you happen to use are better supported by virtualenv or conda.

Now, the folks working on the tools that use the pypi namespace mostly don’t talk to the folks working on the tools that use the conda namespace. Which is unsurprising: in a very literal sense, the two sides don’t have a common language. So, by default, Conway’s law will kick in: the pypi namespace folks will implement a pinning/environment manager that uses the pypi namespace to describe environments, and that will certainly be a thing that helps a lot of us solve our problems. And the conda namespace folks will do whatever they decide to do, which will probably also help people solve slightly different problems. And that’s not a terrible outcome. More things to help people solve problems are good!

But… there’s also a third possibility we might want to think about. The “original sin” that causes all these problems is that PyPI and conda use different namespaces. What if we invented a new meta-namespace, that included both? So e.g. "pypi:gulp" would mean “what pypi calls gulp”, and "conda:gulp" would mean “what conda calls gulp”, and now we can use both vocabularies at the same time without namespace collisions. And then:

  • We could describe the state of hybrid environments, on disk or in lock files: “the packages in this environment are: pypi:requests == 2.19.1, conda:python == 3.7.2, …”

  • A sufficiently clever package manager could do things like: when someone requests to install pypi:scikit-learn from the package source https://pypi.org, then it downloads the wheel, and discovers that it has metadata saying Install-Requires: numpy. Since this is in a wheel, our package manager knows that this really means pypi:numpy. Next it checks its package database, and see that it already has a package called conda:numpy installed, and the conda:numpy package has some metadata saying that it Provides: pypi:numpy. Therefore, it concludes, conda:numpy can satisfy the this wheel’s dependency.

  • We could add wheel platform tags for conda, e.g. cp37-cp37m-conda_linux_x86_64. And then since we know this wheel only applies to conda, it would be fine if its metadata included direct dependencies on packages in the conda: namespace, like Install-Requires: conda:openssl==1.1.1.

CC: @pzwang

3 Likes

Fedora does that with RPM. I put some details in a new topic:

1 Like

FYI I have not forgotten about this topic. It is pinned in a browser tab :slight_smile: Thank you very much for starting the discussion, and I owe you a thoughtful reply; but I am desperately firefighting a few things right now.

2 Likes

Did this ever happen? I feel like it did, but can’t find the thread now.

I’ve just hit this again, hard. VS Code has the immensely annoying default of expecting you to install pylint, black etc in every environment that you want to run code in. I know you can specify an explicit path to the tools, but to do that you need an exe somewhere - and that’s exactly the “standalone app discussions” issue that I’d like to discuss further. At the moment, I need to set up and manage some sort of “tools” virtualenv (or in practice a standalone copy of Python) and there’s nothing in the ecosystem to encourage the authors of tools like pylint and black to offer anything more user-friendly :frowning:

2 Likes

For what it’s worth, conda environment files are documented here

1 Like

As someone who uses conda for both python and R dependencies I think it desperately needs the concept of namespaces.

There’s a proposal to add such but it hasn’t yet been a priority

I don’t think the proposal there would implement the concept of a meta-namespace as you describe but it might allow for the meta-namespace concept to be layered on top.

Hey Peter, any updates?

I’ve been dabbling with the namespace idea (not implementing it, but to leave room for future support in the lock file format), and notice that Conda seems to lack documentation on exactly how packages are specified, how environment files are consumed, etc. There is documentation on how to use them, and Conda is effectively the reference implementation, but without proper documents it makes things really difficult for people trying to interoperate with Conda :frowning:

was something like this what you were looking for?

https://conda.io/projects/conda-build/en/latest/concepts/package-anatomy.html

or were you looking more for the way that individual requirements are specified?

https://docs.conda.io/projects/conda-build/en/latest/resources/package-spec.html#build-version-spec

env files are definitely under-documented. We’re very interested in improving that, and maybe this topic will be a good standard to unify on. We’ll either be adopting whatever comes out of this discussion, or otherwise trying to unify our 3 (!) ways of specifying environments (conda’s lists of specs, conda-env’s YAML files, and anaconda-project’s YAML files).

I was looking for an overview of what exactly can go into an environment.yml file, something similar to this:

The best I can find currently is:

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually

But that only covers general cases, and I can’t find e.g. what top-level sections are possible other than name, channels, and dependencies.

The version spec you linked was also quite helpful for understanding what can go into dependencies though, thanks. (Are there any other special dependecy entries beside pip that can have nested requirements? Or can channels define that themselves, say to require an NPM install?)

I’ve been treading water with this lock file format (with accompanying JSON schema):

There are still several holes (e.g. VCS support and editable install, but I don’t want to spec the latter at least until it is spec-ed), but I feel the general structure is quite serviceable.

Some important characteristics (IMO):

  • JSON format (more easily verified than requirements.txt, but still expressive enough, and parsable with built-in tools).
  • Dependency keys are decoupled to package names. This makes the structure much simpler, and solves a difficult problem in Pipfile (and Poetry also, if I’m not mistaken), where you cannot specify a package’s version based on environment markers (e.g. v1 on Windows, v2 otherwise).
  • Package information is not tied with a dependency entry. This (with the previous point) allows for room for potential support to other package managers (e.g. Conda)
  • Keeps dependency information in graph (so tools can install each package with --no-deps, but still know what depends on what without resolving).

I hope this could be helpful if the topic is brought up during the mini-summit :slight_smile:

2 Likes

Apologies for not keeping up with the threads on here, turns out that discourse is one too many things to stay on top of. I look forward to discussing this at PyCon with some of you / @uranusjr if there is anything in particular you want me to bring up shoot me an email or a msg and I can make sure to try and cover it

If you are using pipx, pointing to the pipx-installed tools would probably be a good solution here.

One thing to bring up to make this a bit more topical: when Victor removed Trollius from PyPI he broke some projects, but if we had a lock file that had recorded where the files were on PyPI then I believe people could have continued to download the files and simply not have noticed the project was removed from the index if they had a lock file.

I am kind of torn on this. Recording the URL has obvious benefits to users, but as a maintainer I kind of want to keep the posibility to remove an artifact after it’s uploaded. Say if I botched a wheel for one platform, now I can immediately kill it to limit the damage (and let users resort to install from sdist). I’d be out of options if URLs are recorded in lock file (releasing a new version does not matter since the package version is locked either way).

It’s probably possible to build some index features to fix this situation, but then we’d be better off just fix the Trollius problem at the index side in the first place.

1 Like