A file format to list Python dependencies of an application without strict reproducibility guarantees

Since PEP 665 is moving forward with the decision to only support wheels, I am creating a new proposal that’s separate to, but depending on PEP 665 to provide sdist and source tree support, as this is evidently desired from the discussion in Supporting sdists and source trees in PEP 665.

Speaking as an author of PEP 665, my personal motivation to limit the PEP’s scope to wheels is mainly from my impression, based on discussions around the PEP, that a lock without strict reproducibility guarantees is unlikely to be accepted by the community, due to their inherent nature of not actually providing locked-down dependencies, based on “locking down” being defined to mean “guaranteeing reproducibility”. This, however, does leave a use case unsatisfied. In certain workflow, users do not want to actually lock their dependencies down to strict reproducibility in a file. Instead, they assume the file’s consumer is able to somehow maintain reproducibility.

Where PEP 665 produces this flow:

     ┌──────┐   ┌─────────────┐
     │ file ├──►│ environment │
     └──────┘   └─────────────┘

so in order for the environment to be reproducible, the file must fully describe reproducibility.

While this alternative workflow described in the previous paragraph produces this:


  external        ┌─────────────┐
   inputs  ──────►│             │
                  │ environment │
 ┌──────┐ ┌──────►│             │
 │ file ├─┘       └─────────────┘
 └──────┘

and wants a file that, assuming external inputs are reproducible, guarantees reproducibility, while considering this reproducibility of external inputs out of scope.

This leads me into two natural conclusions:

  1. Like how we don’t cram sdist and wheel into one single format, the two flows need two different formats.
  2. Like how we defined wheels first and (slowly) apply usable parts of wheels to sdists, the file format for the latter flow can and should reuse syntaxes defined in PEP 665, while being kept semantically distinct.

In other words, PEP 665 is analogous to wheel for application dependency specification, and we now need an sdist equivalent for the same purpose.

So I’m proposing a new format that will

  1. Have a distinct filename suffix to PEP 665.
  2. Reuse all PEP 665 syntaxes and semantics where possible.
  3. Give certain reused fields additional or alternative semantics to support sdist and source tree.

Top-level fields; [tool] and [metadata] sections

These all have exactly the same syntaxes and semantics as proposed in PEP 665.

[package._name_._version_] sections

The section name scheme uses exactly the same syntaxes and semantics as PEP 665.

Fields direct, requires-python, and requires use exactly the same syntaxes and semantics as PEP 665. This also implies that a package must either define metadata statically, or is able to generate consistent metadata dynamically in a reproducible build requirement.

The url field, in addition to a wheel file, may specify a URL to an sdist or source tree. As in PEP 665, there are no special restrictions on the URL format, and the installer is free to not use the URL for actual artifact retrieval.

The following additional rules are introduced:

  • If url specifies a file…
    • filename must be the name to use for the downloaded file.
    • hashes must contain at least one hash for the file.
  • If url specifies a VCS location…
    • filename must be the name to download the directory into.
    • hashes must contain a field named commit_id, with the value being the fully resolved revision identifier, as defined in PEP 610.
  • If url specifies a (non-VCS) directory…
    • filename must be an empty string.
    • hashes must contain a field named editable, with the value being a boolean.

VCS example

[package.packaging."20.3"]
url = "git+https://github.com/pypa/packaging.git@20.3"
filename = "packaging"
hashes.commit_id = "b10b90a7d1c3afac6852462fa2548db231810adf"

specifies cloning the repository into a directory named packaging, and the clone must point to commit b10b90a7d1c3afac6852462fa2548db231810adf.

Installation

The next natural conclusion I drew is that, like how we install an sdist by turning it into a wheel first, this new format should also be installed by being turned into a PEP 665 lock file (plus additional information), so we can reuse how a PEP 665 lock file can be installed into an environment.

All source trees and sdists included in the new format must be able to be locally built into a wheel with PEPs 517 or 610, or the legacy bdist_wheel. We explicitly exclude support for things requiring setup.py install and setup.py develop. This means, I believe, we can turn all specifications of this new format into a combination of zero or more wheels built from non-wheel specifications in the file (we might need to be able to formally prove this? I think it’s possible), and one PEP 665 lock file that references those built wheels. Installation is therefore achieved by installing this converted PEP 665 lock file.

Since this new format will have a different, distinct filename scheme to PEP 665, such a file can always be distinctively identified by tools. This means that an installer, if it wishes to, can seamlessly take either file formats, and perform the necessary conversion behind the scenes, like how pip can install either source or wheel with the same PEP 508 format.

Proposed format name and filename suffix

I’m proposing the name pin file and the .pypin.toml suffix, since this kind of dependency specifying is commonly called “pinning versions”. Alternative proposals are welcomed.


This covers everything, I think? I’d be very interested in hearing thoughts about this.

3 Likes

I totally understand the rationale to separate the two formats, but the first thought that came to my mind is: if .pypin.toml is to exist, then .pylock.toml no longer needs to keep multiple solutions in a single file(AKA blind installer), Poetry and PDM (those who proposed the smart installer) can stick to .pypin.toml to provide that information.

If so, the two file formats play different roles in these package managers:

  • .pypin.toml is to replace pdm.lock and poetry.lock as a standard pin file.
  • .pylock.toml can be generated by export command to provide single solution for the given environment.

That’s perfectly fine to me. I imagine peopl who use pylock would be those who really want reproducibility, and most tools would emit pypin by default.

A locker can also have a “strict mode” that emits pylock instead (which is trivial because the two formats are exactly the same, the locker just needs to check if there are any non-wheel entries in the dependency graph). Another idea is to provide an “export” command that builds all non-wheels in a pypin and emit a pylock + those built wheels.

My opinion is that this should go into PEP 665 because it changes its scope quite substantially if we’re gonna be using pypins for development, and bartering pypins for pylocks for deployment. This all needs to form a coherent whole.

P.S. When I suggested that locking be done in two steps in the sdist thread, Brett pointed out that:

But then what’s the initial point of the lock file? If you have to recompute things then that waters down the supply chain security by dynamically adding things in an unsupervised manner. Even if you make it only additive to the initial lock file, what the sdists suddenly require as dependencies will lead to supply chain attacks.

… which seems to be applicable here as well and would be good to address.

I think this is where I have questions, too. If “most tools” will emit pypin, does that imply that you see the pylock format as being for niche and internal use only? I think we need to establish (but I’ll do it on the pylock thread) whether there’s still the critical mass of use cases for the pylock format on its own merit, now that it’s wheel-only.

I think @layday has a very good point here, the way this proposal is presented feels to me like we’ve only just decided on a wheel-only pyloc format and we’re already moving on and discussing the extended version which handles sdists, before the ink is dry on the first proposal (indeed, before it’s even been approved!) If the PEP authors are already exploring sdist format, maybe it does need to form a coherent whole, rather than being split up into parts.

I know I pushed for the “define sdist behaviour properly or exclude them” position, but now that we have excluded them, I’d hope that we could focus on finalising the wheel-only PEP. And that means, for example, revisiting the justification to confirm that it still holds if sdists aren’t supported, and updating the rationale to explain the motivation for making reproducibility a core feature - I don’t recall seeing a lot of requests for pip to support reproducible installs, for example.

I’m also worried that if the community doesn’t put its energy behind making the pylock file a success, but instead gets immediately distracted by follow-up proposals for sdist support, PEP 665 will ultimately fail and work on extending it to sdists will be wasted anyway :slightly_frowning_face:

1 Like

Except for those situations where you truly don’t want sdists to begin with but you want to support multiple platforms without having to guess what those platforms are ahead of time. Sdists do fill a usability gap with .pylock.toml, but it also doesn’t necessarily eliminate the usefulness of the multi-platform story either.

I think the question comes down to whether we think people wanting reproducible builds and security are going to want to support potentially arbitrary platforms, or only platforms they know ahead of time?

To be clear, my comment was in reference to deployment, not development; you wouldn’t want to ship your .pypin.toml file in your Docker container, but using it on your dev machine is up to you based on your auditing practices for building wheels. So my comment is pertinent here only from the perspective that there is still a use for lock files even if pin files came into existence.

What’s “internal use” to you? I would personally still want to audit the lock file and any changes in a PR as that’s what I would deploy to production and it would still get committed to my VCS, so it isn’t an implementation detail to me.

My mistake. I think I was somehow assuming that lock files would be used to produce pin files, but honestly I can’t work out what I meant any more :slightly_frowning_face:

My key point though is that if @uranusjr expects “most tools” to produce pin files, does that mean that he sees lock files as being the rare case? That seems an odd position for a co-author of PEP 665 to take…

I don’t interpret his quote the same way you do. Making the default be pin files just means the expectation is more than 50% of users will want/need that, not that it will be a “rare case” that people will want a lock file. Honestly, my assumption is if we had pin files that tools would emit lock files when possible, and fall back to pin files when necessary. That still provides a secure-by-default posture for users while allowing for those situations where users need sdists and are okay with the security risks. Otherwise they get better security for free when it happens to work out that they can simply use wheels.

For deployment there could be a tool that takes the pin file and produces a lock file long with locally built wheel files that got pushed to deployment (or even all the wheel files; requiring the hashes in PEP 665 means you can help verify no tampering when pushing to deployment). But all of that still benefits from a lock file format standard as proposed by PEP 665 as the final artifact that gets consumed and used.

A quick note about this example: this URL format is pip specific (and the direct URL format part of PEP 508 is non normative). So I think this should rather use, or draw inspiration from, the PEP 610 data structure which splits components in a tool-neutral way.

Also, why is filename necessary for VCS references ? I would think it is not since the checkout and build will typically occur in a temporary directory.

Reproducibility is a relative concept, that depends on the boundaries the user is setting about what it cares about. Even with wheels, the user is still responsible for ensuring “reproducibility” of the wheel dependencies (starting with libc, the OS version, the patch python version etc). So when allowing artifacts that need to be built, the boundary is simply moving (including compilers and build tools in what falls under the user’s responsibility).

In the end what you are locking in both cases is artifacts that you are going to download to install (and possibly build).

So to me, making this distinction between pin files and lock files is not necessary, and will be confusing for users. I would rather let installers and lockers determine their capabilities (i.e. to support building or not) or have options that prevent users to lock or install source artifacts if they are concerned about running code during installation.

In the end, why would python need different file formats for lock files and pin files ? Are there other ecosystems that do that ?

Yes, but where to stop is always going to be somewhat arbitrary (e.g. does the CPU maker have to be considered due to bugs at the silicon level?). In terms of locking we are talking about the Python packaging level.

That is another way to approach the problem. I think that comes down to how much do you want to know upfront and how much effort you need to put in to answer certain questions about what would be installed. For instance, are people okay to find out at install time whether they can recreate an environment with the provided inputs because their installer doesn’t support something specified in the input file? Or is it okay to have to skim the file to find out whether there’s something like an sdist in the list of requirements which they can’t use? Or does the fact the that file is a lock file, and so communicates upfront the expectations, more what people want?

Or put another way, is it easier to explain to people (or enforce at the company level) what installer someone has to use or what file format is required?

I don’t think there’s a blatantly obvious answer and it’s all subjective.

I don’t think most ecosystems have the level of packaging flexibility that we have to contend with (let alone size and reach), so direct comparisons are a tough one to make here (e.g. how many have the back-end compilation flexibility we have to support such a wide range of FFI languages? Or do most only have a wheel-equivalent concept and an sdist build isn’t even a thing, or vice-versa?).

And none of this is a “need”, it’s at most a desire/perspective.