Structured, Exchangeable lock file format (requirements.txt 2.0?)

I’ve been dabbling with the namespace idea (not implementing it, but to leave room for future support in the lock file format), and notice that Conda seems to lack documentation on exactly how packages are specified, how environment files are consumed, etc. There is documentation on how to use them, and Conda is effectively the reference implementation, but without proper documents it makes things really difficult for people trying to interoperate with Conda :frowning:

was something like this what you were looking for?

https://conda.io/projects/conda-build/en/latest/concepts/package-anatomy.html

or were you looking more for the way that individual requirements are specified?

https://docs.conda.io/projects/conda-build/en/latest/resources/package-spec.html#build-version-spec

env files are definitely under-documented. We’re very interested in improving that, and maybe this topic will be a good standard to unify on. We’ll either be adopting whatever comes out of this discussion, or otherwise trying to unify our 3 (!) ways of specifying environments (conda’s lists of specs, conda-env’s YAML files, and anaconda-project’s YAML files).

I was looking for an overview of what exactly can go into an environment.yml file, something similar to this:

The best I can find currently is:

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually

But that only covers general cases, and I can’t find e.g. what top-level sections are possible other than name, channels, and dependencies.

The version spec you linked was also quite helpful for understanding what can go into dependencies though, thanks. (Are there any other special dependecy entries beside pip that can have nested requirements? Or can channels define that themselves, say to require an NPM install?)

I’ve been treading water with this lock file format (with accompanying JSON schema):

There are still several holes (e.g. VCS support and editable install, but I don’t want to spec the latter at least until it is spec-ed), but I feel the general structure is quite serviceable.

Some important characteristics (IMO):

  • JSON format (more easily verified than requirements.txt, but still expressive enough, and parsable with built-in tools).
  • Dependency keys are decoupled to package names. This makes the structure much simpler, and solves a difficult problem in Pipfile (and Poetry also, if I’m not mistaken), where you cannot specify a package’s version based on environment markers (e.g. v1 on Windows, v2 otherwise).
  • Package information is not tied with a dependency entry. This (with the previous point) allows for room for potential support to other package managers (e.g. Conda)
  • Keeps dependency information in graph (so tools can install each package with --no-deps, but still know what depends on what without resolving).

I hope this could be helpful if the topic is brought up during the mini-summit :slight_smile:

2 Likes

Apologies for not keeping up with the threads on here, turns out that discourse is one too many things to stay on top of. I look forward to discussing this at PyCon with some of you / @uranusjr if there is anything in particular you want me to bring up shoot me an email or a msg and I can make sure to try and cover it

If you are using pipx, pointing to the pipx-installed tools would probably be a good solution here.

One thing to bring up to make this a bit more topical: when Victor removed Trollius from PyPI he broke some projects, but if we had a lock file that had recorded where the files were on PyPI then I believe people could have continued to download the files and simply not have noticed the project was removed from the index if they had a lock file.

I am kind of torn on this. Recording the URL has obvious benefits to users, but as a maintainer I kind of want to keep the posibility to remove an artifact after it’s uploaded. Say if I botched a wheel for one platform, now I can immediately kill it to limit the damage (and let users resort to install from sdist). I’d be out of options if URLs are recorded in lock file (releasing a new version does not matter since the package version is locked either way).

It’s probably possible to build some index features to fix this situation, but then we’d be better off just fix the Trollius problem at the index side in the first place.

1 Like

Not necessarily. It would really depend on what the fallback logic was when a URL returned a 404. You could potentially have an option to update/re-lock any dependency that is now missing, automatically grab the sdist if a wheel file is missing, etc.

Auto-relocking is generally not the intended behaviour (some would go as far to argue it defeats the lock file completely). Especially on CI where it generally happens without supervision.

Auto fallback to sdist is a possibility, but that’d require other changes to how things are defined. As of now sdist has no special meaning during dist selection.

Again, it’s entirely possible to fix any of these, but I feel then locking URL is causing more problems than it solves, and doesn’t really fit well into the current design of packaging. Unless you want to argue there’s something wrong with the underlying design (and I might not disagree if you do :stuck_out_tongue:)

The repository format doesn’t guarantee stable URLs and there are a number of reasons why a repository might decide to have non stable URLs (such as one time authentication tokens embedded in the URL).

Specific to PyPI we are unlikely to ever have URLs not be stable. However the fact files stay available after deletion is an implementation detail of PyPI. The mirrors generally do not exhibit the same behavior and we’ve talked about adding a garbage collection process to PyPI that will remove these no longer referenced files after a period of time.

1 Like

In Nix we have the concept of “fixed-output derivations”. These are essentially functions that can have network access, but at the end the output they generate (a file or folder) has to match a hash. We use these generally for fetching source code. So, a simple function takes a single or list of urls that offer the file in question, and the hash.

When processing further, the file is identified by the hash. This way the handling does not depend on url’s that may break; as long as you can provide a file with the same hash, you’re fine.

I just want to point out one issue we’re having with Poetry and Pipfile and that prevents us from using it as input. When they lock, they record the hashes of files that match the requirements, but not what is actually used or going to be used. Instead of a single hash per package there can be multiple, e.g. one for a wheel and another for a sdist. This gives flexibilty to the user, because they may choose e.g. to use wheels if they have been uploaded in the meantime, but it limits the reproducibility.

(There is a way we could potentially use it, but it’s messy. It basically means checking against the PyPI API to find out what artifacts the hashes correspond to (file type, OS, Python version), and then filtering. Of course, this also only works for packages on PyPI.)

(I am not particularly familiar with Nix; sorry if my ignorance shows.)

Wouldn’t that also restrict the platform the lock file can be used on? I feel the flexibility is needed for Pipfile.lock and poetry.lock because otherwise the lock file would only be useful if everyone uses the same development platform, which is not practical for most Python OSS projects (and really not fit well with the rest of Python packaging).

The extra PyPI round trip could be avoided if we also include the file name with the hash (so instead of a list of hashes it’s a mapping of name-hashes). Wheel names by design contain all the necessary information for dist selection. I think this is a great idea! Especially now with hash-based dist selection implemented in pip :smile:

Correct, although…

That’s understandable. In principle various OS versions and Python versions can be supported, however, offering multiple artifacts per platform/version (so sdist and wheel) means choice and thus no longer reproducibility. Of course that leads to the key question: what level of reproducibility should be possible by the use of a lock file? Offering choice limits reproducibility, however, the tool that interprets the lock file (e.g. Nix) could choose to always only accept sdist. But then that restriction is not recorded in the lock file.

Indeed.

I just want to point out also that we’re slowly moving more towards using VCS as input, and not PyPI. Packages on PyPI don’t always include tests, artifacts are not always uploaded, and it happens that was is uploaded contains faults (e.g. wrong version).

Therefore, I think there is a need for an index that records where the canonical source is. Clearly, a lock file would then have to support various source code from various VCSs as input as well.

2 posts were split to a new topic: What to learn from project deletions on PyPI?

Correct, but that’s by design. Historically the Python community has favoured cross-platform support over reproducibility. But if the tools were updated to allow an opt-in flag to lock only the files used would that give you the reproducibility you’re after?

We do and it’s called PyPI. :wink: Having PyPI link to external files used to be supported but it was more trouble than it was worth. IOW I wouldn’t want to go the Go route of package management.

Unfortunately what is uploaded there are built artifacts, even sdists. I’m of the opinion the index should record references to the source in a VCS, if it exists. But I suppose there are different views on what the purpose of an index is.

I feel it is somewhat sensible for PyPI (or the files hosted on PyPI) to contain such information. Both wheels and sdist already have similar metadata (e.g. Project-URL but for the abstract idea of a project, not the concrete source code). In fact, a Source-URL was once proposed, but failed to be included.

AFAIU the information does not need to be actually understood by the index, pip, or even the lock file; they merely need to know where to find the information (where to download that metadata), and let the user interpret the value obtained.

Of course, there has to be some agreement between the two “ends” of the process (the package author and the consumer of the field). It’s certainly true that we could add a mechanism to include (more or less) generic data in package metadata, but without some level of agreement from package authors to include that data, it’s not going to be much practical use. My concern here is that there could easily be an unwarranted assumption that the “canonical” source is available independently of the sdist, or that projects can provide a “source file” rather than (say) a VCS URL, or … And I don’t think that the PyPA wants to get into the business of mandating how projects provide their sources.

Maybe the core metadata specification could add “User-specified” medadata - for example, saying that any metadata field with a name starting with “X-” is allowed, optional and assumed to be a multiple-use string. Then consumers like nix could start by proposing an informal standard using an “X-” field (X-Nix-Canonical-Source?), and if it gains wide enough adoption, propose “promoting” it to an officially standardised metadata field?

1 Like

We already have Project-URL, which is a mapping of “some user-defined string” to URL.

If nix wants to promote users to specify this metadata, they can do so today. Oh, and they’d need to choose the name of the “Project-URL” to be using (“source repository” or “sources” or what have you). In addition to not needing a metadata update, this information gets exposed on the PyPI page and Warehouse (pypi.org) can fetch statistics if it’s a public GitHub repo (and maybe GitLab too?).

3 Likes