PEP 665, take 2 -- A file format to list Python dependencies for reproducibility of an application

I misread the description of tag and naively assumed the locker would combine all the wheel tags from the package array into tag. Could you explain how this field might be used?

It’s like tags on a wheel, but the lock file is that wheel (and only has dependencies). If a platform matches the tags, the lock file can be installed on it, otherwise the installer should reject it.

1 Like

Absolute paths are useful in build pipelines that place artifacts in a shared artifact directory.

2 Likes

I’m also OK with this. @pradyunsg , what do you think?

And if we change this should we keep it the key name of url or change it to uri?

1 Like

I just pushed a small update that lists Pyflow as supportive of the PEP.

1 Like

Yeah I agree with @pf_moore here, file:// AFAIK doesn’t support relative paths, e.g file URI scheme - Wikipedia calls out a couple of examples that show that the path after the host points at an absolute not a relative path on Unix / POSIX systems.

Separating out local paths for editable installs from the URI / URL would be great.

1 Like

I updated the PEP to state lockers must respect SOURCE_DATE_EPOCH.

1 Like

Thank you!

2 Likes

Works for me too!

These should only be alternative schemes, so… I don’t think we wanna make this as generic as uri. Keeping this as url is what I’d lean toward.

1 Like

Just updated the PEP to say an installer may support whatever schemes it wants for the URL.

1 Like

Thinking about this some more, the PEP talks about how to install what’s specified in the lock file, but it doesn’t say anything about the environment it’s being installed into. My assumptions are (roughly):

  1. The expected (but not required by the PEP) use case is that the lockfile is installed into an otherwise-empty environment. (The obvious exception is that the installer is likely to be installed in the environment). I assume installers are allowed to only support this scenario, and reject attempts to install into an already-populated environment, if that’s all they want to handle.
  2. Because the lockfile specifies exact packages and versions, any incompatibilities with already-installed packages can’t be fixed. I assume that the installer should abort the install if there’s an incompatibility (naive installers could blindly install, but that would create a broken environment, so I assume that is not recommended).
  3. What about reinstalling a lockfile? Should installers have a mode where they uninstall all packages that are specified in the lockfile if the currently-installed version is different from the version specified in the lockfile? I assume this is not required, but installers have the option to have such a mode (and maybe an “uninstall all packages named in this lockfile” mode).
  4. What if the target environment contains A 1.0 which depends on B==1.0, which is already installed. And the lockfile specifies B 2.0, but doesn’t mention A. Should installing the lockfile upgrade B, creating an inconsistent environment? Or fail, to avoid the inconsistency? My assumption here is that users shouldn’t do that, and installers are allowed to make a mess of the environment if they do (because it’s a hard problem - see below - and I don’t think we should force installers to be that clever).

This is another round of the “how intelligent must the installer be?” question. And to an extent, it’s 100% fair to say “it’s up to the installer”. But I think the PEP should state clearly what the minimum level of behaviour is that users can rely on. Even if that’s just the bare minumum “installers must do the right thing when installing a lockfile into an empty environment, but if there’s already packages other than the installer in the environment, behaviour is at the option of the installer”.

For context, pip doesn’t fully handle this sort of situation for “normal” installs (see this issue for an extended discussion of the complexities) so I don’t think there are any simple answers here. Pip has pip check to detect broken environments after the fact, precisely because we can’t catch all problems at install time. I would imagine installing lockfiles into already-populated environments would be a good example of a case where pip check would be advisable after the install…

Updated to say installer must support installing into an empty environment, but installing into an environment with packages already there is optional.

1 Like

I’d like to caution against the inclusion of the package._name_._version_.url field in this PEP, as it (inappropriately IMO) couples the “what” and the “where from”.

For context, I routinely work in high security environments that are prohibited (by both policy and network configuration) from pulling “directly” from pypi, instead having to pull from locally-maintained mirrors. Having this field in the spec creates possible confusion from both the perspective of the locker and the installer. From the locker perspective, it opens the door for possible “leakage” of internal mirror endpoints that really should not be exposed outside the organization boundaries (even if the package is intended to be shared outside the organization.) A hash should be more than sufficient to validate that what’s being installed is what is intended.

If you keep the field in the spec, I’d encourage specifying that installers SHOULD NOT use the URL value if local configurations specify a non-default pypi endpoint.

3 Likes

I also agree with this sentiment. If people requesting sdist support cannot think of something that isn’t transformative to the PEP then I say we punt on it to another PEP (the file format is versioned for this exact sort of reason). As such, I’m going to be specifically selective about what I reply to below to purposefully try to cut short additive discussions here and push to have them in another topic (sorry, @stewartmiles ; still happy to answer all of your questions elsewhere).

The coupling is loose; installers are explicitly allowed by the PEP to ignore the url key if they choose to.

This is why the hash is required and the URL is optional. So you could have your locker leave out URLs entirely or you could post-process your lock file and strip out the URLs (although I would advise filling in the filename key if you do).

The PEP says you can use whatever mechanism you want to get a file as long as the file you end up with matches the hash.

Consider the URL a hint on where to get the file. We purposefully put it the url key there so that if you want to download from a known location (e.g. PyPI), you don’t have to use the simple index to do extra network fetches and file parsing to figure out what the actual URL is to download from.

Since the PEP explicitly says you don’t need to honour the URL I don’t think there’s anything to really add (unless I’m misunderstanding your concern).

1 Like

I have moved the sdist discussion to Supporting sdists and source trees in PEP 665 where I will more fully reply to @stewartmiles .

1 Like

I work almost exclusively with end-users of python - analysts who happen to be using python to do their job, not folks building python tooling. I’m not expecting to have “my locker”…I expect we’ll be trying out whatever community solutions come along that implement the spec. I’m just trying to preempt future foot-guns that could develop by tool devs making inappropriate assumptions about what their users have available. Having private mirrors is an increasingly common use-case given all the software supply chain hoopla, so I’d hope that use-case would be readily addressed by any related tooling.

Agree that the filename key is needed in any event.

I did poke around in dist-info a bit to see if the “source URL” was currently captured someplace…I may have missed it, but didn’t find it in any of the files. I guess you’re being intentionally vague on how the locker gets its data, but unless I’m missing something the hashes and filenames could be gathered locally at any time, while the URL would need captured at install-time. But I guess that’s a problem for whoever’s writing the locker to figure out.

I’m definitely in favor of having better mechanisms to tighten up the package supply chain, just leery of creating side effects that drive my users to have weird workarounds (or worse, disabling the security features altogether).

2 Likes

I think it would be useful to hear from people who would be developing lockers as to whether this is likely to be an issue.

Assuming pip ends up gaining “locker” functionality in the form of a “pip resolve” command that wrote a lockfile from a set of requirements, we’d be able to get the source URL easily, as it would be wherever the finder located the wheel while resolving the requirements.

You seem to be thinking more of a locker that reads a set of installed files and creates a lockfile from them. I don’t know if anyone is planning on writing something like that, but yes I imagine there would be problems to work around with that sort of approach.

I’ve not kept up with the pip resolve discussion, just would be leery of introducing a temporal variable into the lockfile generation…that’s why I was assuming part of the lockfile generation might be to leverage the local site-packages. If I have a data scientist that comes up with some whiz-bang algorithm for some Hard Problem, I’d like to be able to reproduce his python environment as closely as possible. To me, that’s best done by looking at what is actually installed in his environment and not what the resolver would choose if run fresh. In other scenarios a “clean” run of the resolver could be totally appropriate. To me, both are valid depending on the context.

1 Like

Unfortunately, it is impossible to generate a lock file solely from site-packages contents. A lock file requires artifact hashes, but those archives are not stored after installation (think the Python tarball you download from python.org, and the resulting Python you get after installation), and it’s impossible to rebuild the exact same archive only from things in site-packages. A lock file generation process can make use of what’s currently available in a site-packages directory, but must require outside input to generate something useful.


I’m starting to think maybe we should just make filename a required field. The semantical information this field holds is always needed by the locker (since hashes is required and a lock must somehow have access to a file to calculate a hash), and it’s probably easier for everyone if we just require the locker to always record filename even if it means some minor duplication when the same information is also available in url.

1 Like

I was actually thinking yesterday if url wasn’t specified that we should probably require filename, but I’m also fine with just always requiring it regardless for simplicity.

Update incoming…

1 Like