PEP 665, take 2 -- A file format to list Python dependencies for reproducibility of an application

pradyunsg · November 9, 2021, 6:11am

Works for me too!

These should only be alternative schemes, so… I don’t think we wanna make this as generic as uri. Keeping this as url is what I’d lean toward.

brettcannon · November 9, 2021, 7:13pm

Just updated the PEP to say an installer may support whatever schemes it wants for the URL.

pf_moore · November 9, 2021, 9:19pm

Thinking about this some more, the PEP talks about how to install what’s specified in the lock file, but it doesn’t say anything about the environment it’s being installed into. My assumptions are (roughly):

The expected (but not required by the PEP) use case is that the lockfile is installed into an otherwise-empty environment. (The obvious exception is that the installer is likely to be installed in the environment). I assume installers are allowed to only support this scenario, and reject attempts to install into an already-populated environment, if that’s all they want to handle.
Because the lockfile specifies exact packages and versions, any incompatibilities with already-installed packages can’t be fixed. I assume that the installer should abort the install if there’s an incompatibility (naive installers could blindly install, but that would create a broken environment, so I assume that is not recommended).
What about reinstalling a lockfile? Should installers have a mode where they uninstall all packages that are specified in the lockfile if the currently-installed version is different from the version specified in the lockfile? I assume this is not required, but installers have the option to have such a mode (and maybe an “uninstall all packages named in this lockfile” mode).
What if the target environment contains A 1.0 which depends on B==1.0, which is already installed. And the lockfile specifies B 2.0, but doesn’t mention A. Should installing the lockfile upgrade B, creating an inconsistent environment? Or fail, to avoid the inconsistency? My assumption here is that users shouldn’t do that, and installers are allowed to make a mess of the environment if they do (because it’s a hard problem - see below - and I don’t think we should force installers to be that clever).

This is another round of the “how intelligent must the installer be?” question. And to an extent, it’s 100% fair to say “it’s up to the installer”. But I think the PEP should state clearly what the minimum level of behaviour is that users can rely on. Even if that’s just the bare minumum “installers must do the right thing when installing a lockfile into an empty environment, but if there’s already packages other than the installer in the environment, behaviour is at the option of the installer”.

For context, pip doesn’t fully handle this sort of situation for “normal” installs (see this issue for an extended discussion of the complexities) so I don’t think there are any simple answers here. Pip has pip check to detect broken environments after the fact, precisely because we can’t catch all problems at install time. I would imagine installing lockfiles into already-populated environments would be a good example of a case where pip check would be advisable after the install…

brettcannon · November 10, 2021, 12:26am

Updated to say installer must support installing into an empty environment, but installing into an environment with packages already there is optional.

dhashby · November 10, 2021, 10:16pm

I’d like to caution against the inclusion of the package._name_._version_.url field in this PEP, as it (inappropriately IMO) couples the “what” and the “where from”.

For context, I routinely work in high security environments that are prohibited (by both policy and network configuration) from pulling “directly” from pypi, instead having to pull from locally-maintained mirrors. Having this field in the spec creates possible confusion from both the perspective of the locker and the installer. From the locker perspective, it opens the door for possible “leakage” of internal mirror endpoints that really should not be exposed outside the organization boundaries (even if the package is intended to be shared outside the organization.) A hash should be more than sufficient to validate that what’s being installed is what is intended.

If you keep the field in the spec, I’d encourage specifying that installers SHOULD NOT use the URL value if local configurations specify a non-default pypi endpoint.

brettcannon · November 10, 2021, 10:59pm

I also agree with this sentiment. If people requesting sdist support cannot think of something that isn’t transformative to the PEP then I say we punt on it to another PEP (the file format is versioned for this exact sort of reason). As such, I’m going to be specifically selective about what I reply to below to purposefully try to cut short additive discussions here and push to have them in another topic (sorry, @stewartmiles ; still happy to answer all of your questions elsewhere).

The coupling is loose; installers are explicitly allowed by the PEP to ignore the url key if they choose to.

This is why the hash is required and the URL is optional. So you could have your locker leave out URLs entirely or you could post-process your lock file and strip out the URLs (although I would advise filling in the filename key if you do).

The PEP says you can use whatever mechanism you want to get a file as long as the file you end up with matches the hash.

Consider the URL a hint on where to get the file. We purposefully put it the url key there so that if you want to download from a known location (e.g. PyPI), you don’t have to use the simple index to do extra network fetches and file parsing to figure out what the actual URL is to download from.

Since the PEP explicitly says you don’t need to honour the URL I don’t think there’s anything to really add (unless I’m misunderstanding your concern).

brettcannon · November 10, 2021, 11:19pm

I have moved the sdist discussion to Supporting sdists and source trees in PEP 665 where I will more fully reply to @stewartmiles .

dhashby · November 10, 2021, 11:29pm

I work almost exclusively with end-users of python - analysts who happen to be using python to do their job, not folks building python tooling. I’m not expecting to have “my locker”…I expect we’ll be trying out whatever community solutions come along that implement the spec. I’m just trying to preempt future foot-guns that could develop by tool devs making inappropriate assumptions about what their users have available. Having private mirrors is an increasingly common use-case given all the software supply chain hoopla, so I’d hope that use-case would be readily addressed by any related tooling.

Agree that the filename key is needed in any event.

I did poke around in dist-info a bit to see if the “source URL” was currently captured someplace…I may have missed it, but didn’t find it in any of the files. I guess you’re being intentionally vague on how the locker gets its data, but unless I’m missing something the hashes and filenames could be gathered locally at any time, while the URL would need captured at install-time. But I guess that’s a problem for whoever’s writing the locker to figure out.

I’m definitely in favor of having better mechanisms to tighten up the package supply chain, just leery of creating side effects that drive my users to have weird workarounds (or worse, disabling the security features altogether).

pf_moore · November 10, 2021, 11:43pm

I think it would be useful to hear from people who would be developing lockers as to whether this is likely to be an issue.

Assuming pip ends up gaining “locker” functionality in the form of a “pip resolve” command that wrote a lockfile from a set of requirements, we’d be able to get the source URL easily, as it would be wherever the finder located the wheel while resolving the requirements.

You seem to be thinking more of a locker that reads a set of installed files and creates a lockfile from them. I don’t know if anyone is planning on writing something like that, but yes I imagine there would be problems to work around with that sort of approach.

dhashby · November 11, 2021, 12:13am

I’ve not kept up with the pip resolve discussion, just would be leery of introducing a temporal variable into the lockfile generation…that’s why I was assuming part of the lockfile generation might be to leverage the local site-packages. If I have a data scientist that comes up with some whiz-bang algorithm for some Hard Problem, I’d like to be able to reproduce his python environment as closely as possible. To me, that’s best done by looking at what is actually installed in his environment and not what the resolver would choose if run fresh. In other scenarios a “clean” run of the resolver could be totally appropriate. To me, both are valid depending on the context.

uranusjr · November 11, 2021, 4:42am

Unfortunately, it is impossible to generate a lock file solely from site-packages contents. A lock file requires artifact hashes, but those archives are not stored after installation (think the Python tarball you download from python.org, and the resulting Python you get after installation), and it’s impossible to rebuild the exact same archive only from things in site-packages. A lock file generation process can make use of what’s currently available in a site-packages directory, but must require outside input to generate something useful.

I’m starting to think maybe we should just make filename a required field. The semantical information this field holds is always needed by the locker (since hashes is required and a lock must somehow have access to a file to calculate a hash), and it’s probably easier for everyone if we just require the locker to always record filename even if it means some minor duplication when the same information is also available in url.

brettcannon · November 12, 2021, 3:26am

I was actually thinking yesterday if url wasn’t specified that we should probably require filename, but I’m also fine with just always requiring it regardless for simplicity.

Update incoming…

brettcannon · November 12, 2021, 3:33am

This scenario may require the data scientist to use a tool which handles both installing and recording what was installed in a single command.

brettcannon · November 12, 2021, 3:37am

And landed in PEP 665: make `filename` required · python/peps@7d90a86 · GitHub .

brettcannon · November 12, 2021, 3:47am

To help us drive towards completion on this PEP so we can request a pronouncement, here are the currently open issues.

This is being discussed in Supporting sdists and source trees in PEP 665 .

With my “Python extension for VS Code” hat on, while I can add support for installing from a lock file, how do I help update or generate one? Specifically, how do I know when someone adds a new dependency that they want recorded? Since there is not standardized input file for a lock file, I really don’t have a way to do that. And so while users ask me for a way to install a package into their environment, how do I do that and make sure they do the right thing and write down that installation as well?

So this open issue is asking whether there should be something about this in the PEP? For instance, should this PEP update PEP 621 to say you can have a [project] table with just dependencies and a lock file MAY/SHOULD/MUST be derived from that?

Platform compatibility tags - Python Packaging User Guide does not define what a “best-fitting wheel file” is. Unfortunately this is necessary if we are going to allow the lock file to create a dependency graph that doesn’t fully eliminate possibilities for a package down to a single option on any platform.

As such, I see three options:

Force lockers to construct a dependency graph that leads to only a single wheel file on any resolution of the graph.
Define what a “best-fitting wheel file” is.
Having a required rank key which the locker uses to specify the assumption it had when generating the lock file as to what wheel file is expected to be used when multiple options are available.

uranusjr · November 12, 2021, 7:20am

I happened to have had a brief conversation with some Julia folks during PackagingCon on how they select binaries, and they said they currently do it in an “arbitrary but deterministic” way i.e. more or less the same as what we have now and don’t really see a problem with that. So I’m now kind of wondering maybe this level of specification is not strictly necessary, and we may be able to get away by requiring tools to do it predictably (and most of them would probably rely on packaging anyway).

pradyunsg · November 12, 2021, 7:45am

That’s how I am leaning as well - say that this should be done in a deterministic manner, and note that the expectation (not requirement) is that all the tools will end up using packaging.tags and pick the first matching tag from the list of compatible tags it provides.

sbidoul · November 12, 2021, 11:03am

A few more questions about markers and tags.

Something itches me with the idea of using the filename or URL to determine if a package entry is applicable. It kind of closes the door to sdists (for which the filename does not include compatibility tags) and VCS references (for which there is no filename at all).

Since there is the notion of marker and tag in the global metadata section, would it make sense to have these fields at the package level too ?

Related to this, a question about requires-python (which is present in both the global metadata section and at the package level). Since the required python version is also expressible as a PEP 508 marker (where it is named python_version) could it be considered redundant with the marker metadata field ? Or could a marker field at the package level replace it while providing additional flexibility at the same time ?

Finally, is there a particular reason for using [[package._name_._version_]] instead of [[package._name_]] with version as a field ? Since installers do not need to consider the version when determining which package to install the additional nesting may not be necessary. I actually think this makes the format harder to understand - not being a toml expert, I had to actually load the toml example and print it as a dict to be sure what it meant.

sbidoul · November 12, 2021, 11:39am

It sounds reasonable to use PEP 621 project.dependencies as input.

Why changing PEP 621 at this stage, though ?

By the way, is it correct to say that PEP 621 project.dependencies would be copied verbatim to PEP 665 metadata.requires ? Or would there be other considerations such as things that can be expressed in PEP 621 dependencies that would not be suitable for PEP 665 requires ?

uranusjr · November 12, 2021, 12:48pm

I don’t quite follow. Sources, as they are defined now, should always work (assuming source compatibility, which is covered by requires-python), do they not? So if they are requested, they should always be installed; why do we need markers and tags?

Theoratically yes, but in reality, based on feedback from pipenv (i.e. my personaly experience) and pdm, converting Requires-Python to python_version markers is extremely messy, and I don’t want anyone (and especially myself) to need to do that.

Oh but they may need to. This is a legistimate feature request from tool authors because people ask very strongly for this.