PEP 665, take 2 -- A file format to list Python dependencies for reproducibility of an application

I’ve not kept up with the pip resolve discussion, just would be leery of introducing a temporal variable into the lockfile generation…that’s why I was assuming part of the lockfile generation might be to leverage the local site-packages. If I have a data scientist that comes up with some whiz-bang algorithm for some Hard Problem, I’d like to be able to reproduce his python environment as closely as possible. To me, that’s best done by looking at what is actually installed in his environment and not what the resolver would choose if run fresh. In other scenarios a “clean” run of the resolver could be totally appropriate. To me, both are valid depending on the context.

1 Like

Unfortunately, it is impossible to generate a lock file solely from site-packages contents. A lock file requires artifact hashes, but those archives are not stored after installation (think the Python tarball you download from python.org, and the resulting Python you get after installation), and it’s impossible to rebuild the exact same archive only from things in site-packages. A lock file generation process can make use of what’s currently available in a site-packages directory, but must require outside input to generate something useful.


I’m starting to think maybe we should just make filename a required field. The semantical information this field holds is always needed by the locker (since hashes is required and a lock must somehow have access to a file to calculate a hash), and it’s probably easier for everyone if we just require the locker to always record filename even if it means some minor duplication when the same information is also available in url.

1 Like

I was actually thinking yesterday if url wasn’t specified that we should probably require filename, but I’m also fine with just always requiring it regardless for simplicity.

Update incoming…

1 Like

This scenario may require the data scientist to use a tool which handles both installing and recording what was installed in a single command.

2 Likes

And landed in PEP 665: make `filename` required · python/peps@7d90a86 · GitHub .

2 Likes

To help us drive towards completion on this PEP so we can request a pronouncement, here are the currently open issues.

This is being discussed in Supporting sdists and source trees in PEP 665 .

With my “Python extension for VS Code” hat on, while I can add support for installing from a lock file, how do I help update or generate one? Specifically, how do I know when someone adds a new dependency that they want recorded? Since there is not standardized input file for a lock file, I really don’t have a way to do that. And so while users ask me for a way to install a package into their environment, how do I do that and make sure they do the right thing and write down that installation as well?

So this open issue is asking whether there should be something about this in the PEP? For instance, should this PEP update PEP 621 to say you can have a [project] table with just dependencies and a lock file MAY/SHOULD/MUST be derived from that?

Platform compatibility tags - Python Packaging User Guide does not define what a “best-fitting wheel file” is. Unfortunately this is necessary if we are going to allow the lock file to create a dependency graph that doesn’t fully eliminate possibilities for a package down to a single option on any platform.

As such, I see three options:

  1. Force lockers to construct a dependency graph that leads to only a single wheel file on any resolution of the graph.
  2. Define what a “best-fitting wheel file” is.
  3. Having a required rank key which the locker uses to specify the assumption it had when generating the lock file as to what wheel file is expected to be used when multiple options are available.
1 Like

I happened to have had a brief conversation with some Julia folks during PackagingCon on how they select binaries, and they said they currently do it in an “arbitrary but deterministic” way i.e. more or less the same as what we have now and don’t really see a problem with that. So I’m now kind of wondering maybe this level of specification is not strictly necessary, and we may be able to get away by requiring tools to do it predictably (and most of them would probably rely on packaging anyway).

2 Likes

That’s how I am leaning as well - say that this should be done in a deterministic manner, and note that the expectation (not requirement) is that all the tools will end up using packaging.tags and pick the first matching tag from the list of compatible tags it provides.

1 Like

A few more questions about markers and tags.

Something itches me with the idea of using the filename or URL to determine if a package entry is applicable. It kind of closes the door to sdists (for which the filename does not include compatibility tags) and VCS references (for which there is no filename at all).

Since there is the notion of marker and tag in the global metadata section, would it make sense to have these fields at the package level too ?

Related to this, a question about requires-python (which is present in both the global metadata section and at the package level). Since the required python version is also expressible as a PEP 508 marker (where it is named python_version) could it be considered redundant with the marker metadata field ? Or could a marker field at the package level replace it while providing additional flexibility at the same time ?

Finally, is there a particular reason for using [[package._name_._version_]] instead of [[package._name_]] with version as a field ? Since installers do not need to consider the version when determining which package to install the additional nesting may not be necessary. I actually think this makes the format harder to understand - not being a toml expert, I had to actually load the toml example and print it as a dict to be sure what it meant.

1 Like

It sounds reasonable to use PEP 621 project.dependencies as input.

Why changing PEP 621 at this stage, though ?

By the way, is it correct to say that PEP 621 project.dependencies would be copied verbatim to PEP 665 metadata.requires ? Or would there be other considerations such as things that can be expressed in PEP 621 dependencies that would not be suitable for PEP 665 requires ?

1 Like

I don’t quite follow. Sources, as they are defined now, should always work (assuming source compatibility, which is covered by requires-python), do they not? So if they are requested, they should always be installed; why do we need markers and tags?

Theoratically yes, but in reality, based on feedback from pipenv (i.e. my personaly experience) and pdm, converting Requires-Python to python_version markers is extremely messy, and I don’t want anyone (and especially myself) to need to do that.

Oh but they may need to. This is a legistimate feature request from tool authors because people ask very strongly for this.

1 Like

One might imagine, say, to lock one commit for linux and another one for windows, because each needs different patches.

Ah, ok. But even then, what are the benefits of keying by version ? Or the drawback of having a flat list per package.

1 Like

This is why supporting sdists and source tree is a separate concern that may end up with their own PEP (as well as why the file format is versioned); as of right now there’s not reason to layer on extra things to tell something is a wheel. We don’t need to prematurely optimize for something that isn’t in the PEP yet and for which adding support for in the future will definitely require additional support by all tools involved.

Organizational. If your lock file pulls in 2 or 3 different versions of a package to meet various platform requirements, do you want to have to scan the body of array entry to see which version it is, or would you rather have it clear by the section? Or put another way, semantically various files for the same version are related, but files from different versions have no relation to each other beyond the fact that the same project released them. Grouping by release is (at least) how I view things on PyPI and as a whole and what you typically lock by, not as an overall project.

1 Like

Well, that’s enough co-authors to accept that as the solution. :smiley:

Because there are a bunch of fields that PEP 621 mandates due to core metadata that an app simply doesn’t need to care about (e.g. you don’t really need to give your app a name).

It’s up to the locker to decide what to put into metadata.requires. This is on purpose so the resulting dependency graph can be whatever it needs to be to resolve appropriately.

1 Like

Updated in the PEP in PEP 665: close out the open issue about "best-fitting wheel" · python/peps@95fe2fc · GitHub .

1 Like

IMO this needs an entire PEP to discuss. It is not unreasonable to reuse project.dependencies as inputs, but there are still a bunch of unanswered questions. The input, for example, must allow the user to specify an alternative index to fetch packages (and allow index fallback etc.), and once you do that you’ll also need to discuss how the format can be successfully taught with minimal confusion regarding how the index specification is only effective if you use pyproject.toml directly, not when you package it into a distribution (i.e. you can’t specify indexes, put the package to PyPI, and expect pip to fetch the package’s dependencies from your own index instead of PyPI). This is even arguably enough rationale against reusing project.dependencies as input to the lock file.

2 Likes

I figured this is where this is going to end up, but I figured I would at least ask to see if there was a chance people actually agreed on this. :wink:

Interesting difference in opinion as my brain always thinks of that as something to specify on the command-line rather than in the configuration file. But I do know that requirements files support this, so maybe its usefulness is broader than I realize/think.

OK, I will remove this as an open issue, leaving us just with the sdist discussion to resolve before the PEP is ready for pronouncement!

2 Likes

There’s a partial PEP in progress somewhere to define a shareable format for this kind of configuration.

I don’t think this necessarily has to be locked, though it’s definitely convenient to have the source listed provided installers allow it to be overridden (e.g. I should be able to use an internal PyPI mirror/cache rather than being forced to go via the internet).

1 Like

I just added the following to try and make it as clear as possible that the PEP is flexible around anything it doesn’t specify on purpose:


As Flexible as Possible

Realizing that workflows vary greatly between companies, projects, and
even people, this PEP tries to be strict where it’s important and
undefined/flexible everywhere else. As such, this PEP is strict where
it is important for reproducibility and compatibilitiy, but does not
specify any restrictions in other situations not covered by this PEP;
if the PEP does not specifically state something then it is assumed to
be up to the locker or installer to decide what is best.

2 Likes

Note that I specifically said “input” without specifying where the input should come from :wink:

The input can come from the lock file used for installation, but it can also come from additional user inputs (e.g. command line options, configuration files, or environment variables) specified to the locker and installer, or even only available in the original application manifest without the information being locked.

Also note that what indexes were used to generate the lock file is inheritantly not meaningful knowledge to the installer, since the lock file already provides enough information for the installer to find exact artifacts without index knowledge. So the installer only needs to allow two use cases:

  1. No index override, where the installer simply uses the artifact URLs provided by the lock file.
  2. An explicit index override, where the installer ignores all the artifact URLs and find artifacts based on versions, filenames, hashes etc. instead.
3 Likes