PEP 751: one last time

Correct. It’s just to get a different URL for a file. All other requirements about a file (e.g. hashes, etc.) would still stand. So I would imagine installers would either find a URL already set for a file and it 404s or no URL specified, in which case they look for the URL via the index instead. At that point the flow for the file would be the same even if the URL recorded in the lock file worked (e.g. if hash checking failed then just stop, don’t go hunting for another location).

Which is why it’s an open issue. :sweat_smile: As Charlie pointed out, it is more work for installers to support and I don’t know how critical it really is at the end of the day. I know @EpicWink would like this feature, but I don’t know how much farther this goes. How often will a URL for a file be an HTTP 404 and not an HTTP 307/308 redirect? At that point do we say the lock file is broken and people regenerate the lock file instead of having this feature in the PEP?

Does anyone know how stable package download URLs are considered on PyPI? When was the last time those URLs change and if they had redirects how long were the redirects kept working?

Yeah, I don’t want to tackle that in the PEP. If requirements files don’t need it and it doesn’t make installs inherently more secure or the lock file easier to audit, I would rather keep that can of worms sealed. :wink:

Hence why I don’t want to try and pioneer how to record auth details.

I would expect to make it a “MAY” feature anyway, so it would have to be acceptable that not all tools support the fallback.

To me that says you don’t like the feature, in which case pip and uv aren’t fans while Poetry is neutral/fine with it. That pretty much kills it even if Frost and PDM said they are in favour as that doesn’t get us 2 tools who support the idea. So short of a groundswell of user support I don’t see this idea making it into the PEP.

1 Like

Thanks Brett, here are my preferences for some topics:

  • Drop recording the package version: I am also against it
  • File size, and hashes: I strongly suggest that at least the hash be retained.
  • File URL or file name: They are useful in different scenarios, with file URL recorded you can confidently reproduce the whole environment with the same package files(w/ hash validation). However, in some cases users may need to switch to a mirror index which provides exactly the same set of package files with the same file hashes. index url plus file name is just equal to file url and either one should be applied, not both.
  • Restrict the [tool] table to data that is disposable: this is also acceptable to me(and PDM).
  • Recording dependencies and/or dependents: I support recording only dependencies, because that are easier to retrieve than dependents. And tools can build a graph using that information to get the reverse, as dependents.
  • Recording extras and groups: I have an question, since extras and groups serve similar functions in a lock file, should we distinguish between extras and groups or merge them into a single field?
2 Likes

As the PEP is right now, if PyPI were to switch to a different domain (from files.pythonhosted.org) or URL path structure for files, all existing lock files would become unusable and would have to be recreated.

I think I agree with this, meaning the spec shouldn’t say anything about falling back to the simple repo API if the request to the provided URL in the lock file fails. It provides more confidence in the file as a totality.

That would mean the only option remaining is to make the file URL optional (although I suggest wording it as lockers SHOULD include it, to keep things secure by default, expecting lockers to optionally have a command-line option to drop the URLs).

Or even switch between mirrors, eg when switching between private index providers, or if different indexes are used in different environments.

This is not true for the general case, as the file download URL can be anything - even on PyPI, files are eg https://files.pythonhosted.org/packages/a6/ab/7e5f53c3b9d14972843a647d8d7a853969a58aecc7559cb3267302c94774/tzdata-2024.2-py2.py3-none-any.whl, not https://pypi.org/simple/tzdata/tzdata-2024.2-py2.py3-none-any.whl.

Perhaps I misunderstand, and you mean that you can use the simple repo API to get the file download URL given the index URL and the wheel filename (and computing the package name from the filename).

3 Likes

How would you see this interacting with pip’s wheel cache? Would the cache be ignored, or would pip modify the cache to try to handle different files coming from different indices?

There are also tools like devpi (which is something I install on all my systems, because it means that you can work offline/with a flakey connection to PyPI), which as things currently stand would no longer work if the URL is always looked up.

I think if pip were to enforce URLs (which is not an unreasonable thing to do if they’re stored, though I think it makes the lock files less portable/reusable), then it would be nice to be able to have an allowlist of URLs to act as a safety feature to ensure that the URLs are not the expected values (I’d use this to make sure all queries go through my local devpi instance, and I’d expect security package index server vendors would tell users to only allow the company index), and that the user would need to run some tool to fix them (so the hash wouldn’t change, just the URL).

I hadn’t even considered the wheel cache. It’s built based on the principle that any wheel with the same filename (name, version and tags) is functionally identical. That principle doesn’t really seem to apply for a lockfile (where an exact content match is required), so my instinct is that we’d have to bypass the wheel cache.

But this does bring up a more fundamental question. Is a lockfile based on the idea that the installer will fetch the exact same files as the locker used, and install from those? Or is it based on the idea that the installer can use any file that has the same hash value (i.e., content, as long as we ignore the risk of hash collistion attacks) as the original?

In practical terms, there’s no difference, as hashes ensure that the files are identical anyway. But in terms of file acquisition, it’s important - if the specified URL fails, should the installer continue looking elsewhere for a file with the same hash? Is it OK to use a wheel (or sdist) cache to try to find a file with the same hash? Conversely, in terms of audit, is it OK to use a different URL if that means traceability is lost?

Personally, I don’t have an opinion on which is the right answer here. I’ve assumed the “fetch the exact same files” model, because that’s easier to reason about - you just grab a file from a URL and check it meets the conditions (hash, length, etc). But you could argue for either approach.

The problem with leaving this open to interpretation is that tools will be put in the position of having to make their own decision. Pip, for example, could get people arguing that we should use the cache for speed, but also that we shouldn’t use the cache because it’s not clear where the wheel in the cache came from. Both arguments are valid, and the pip maintainers don’t have the context to determine which is right. And if uv does the opposite, will people complain about inconsistency?

@brettcannon - maybe this should be another open question? Put me down as mildly in favour of the “fetch the exact same files” model, but I’m open to changing my mind.

:+1: So choose one, but not both in a single lock file.

So that puts uv and PDM (and @sethmlarson ) in favour of a disposable [tool] table and Poetry as neutral from an export format perspective. That seems like enough to go with the disposable approach!

@charliermarsh didn’t express an opinion, @radoering said they would put in [tool], so it’s kind of hazy as to whether to put this in or not (and @sethmlarson liked having it).

It first depends if we are even going to try and record them. :sweat_smile: But after that I think it’s a question of UX if you are handed a lock file and nothing else. Do people care about distinguishing between extras and dependency groups? Previous discussions suggested people did.

Correct.

Yes. With the index server’s URL you can get the project name from packages.name and tack that on. You can then make the request and then find the file you’re after.

I personally don’t see why the cache would care. You got the file you wanted and I believe pip hasn’t cared about the hash once the file is on disk when the requirements file contains hashes. I don’t see why this would be different.

Because you connect via IP address and it changes every time? So you could never rely on anything be written down in the PEP for where to find something because your IP constantly changes?

I think it depends on how thorough you want to be. If you wrote down the hash(es) that some wheel file has so you can validate before using it from a cache then you could do that to check if things match. But I’m also fine to make it explicit in the PEP that all the security mechanisms are about file acquisition and once you have it on disk then it’s up to your tool to decide if its cached version of the wheel file meets your needs or not (e.g. you can bypass the cache or use it based on name alone).

The latter for me (and if you include file size then hash collision attacks go way down). I think I brought this up way back when PEP 751 was first proposed, but I personally view the lock file as saying what bits you want to get installed into an environment. I think getting pedantic to the point that it has to come from a specific URL instead of relying on hashes and file size to make sure you’re getting what you want isn’t beneficial for the user. If I’m remembering correctly I considered the URL a hint of where to look, but I think you didn’t like the URL being viewed as that and wanted it to be canonical.

I’m old enough to remember download mirrors and BitTorrent (for anyone unfamiliar with mirrors, think of them like CDNs where you had to manually choose which location to download from). In both situations you weren’t downloading a specific file from a specific place on the internet, you were downloading specific bits from wherever you could get them the fastest and you could verify you got what you expected by checking the hash and file size (if you weren’t so lazy as to skip that step which I know I almost always did :sweat_smile:). In that case you only cared about the content.

Sure, but I think we need to be very clear about how far we would be taking this. If we say the lock file fundamentally cares about recording what file contents are expected when downloading a file, then the URL and index are more for auditing purposes as well as a quick way to know where to consider looking for a file. But we could also say installers could in fact have users provide alternative locations to look as well if they so desired (e.g. an internal mirror or CDN). Think of the weekend hobbyist wanting to just grab the files from PyPI and the corporate user who has an internal mirror of PyPI, both wanting to use the same lock file. Do we tell e.g. the corporate user to recreate the lock file because the URL would be different, or just say to check the hashes and file sizes to make sure you’re getting the same thing as the weekend hobbyist from the internal mirror?

If we take a “fetch the exact same file” and it’s all about the URL on the internet, then it has to be the exact URL or the index, but not both. And it also means users can’t use their own mirrors.

I’m personally for the “content” view over the “location” view. @charliermarsh @radoering @frostming do you have opinion on this one? Would you ever support letting users download files from places other than what’s in a lock file as long as the hash and file size match?

6 Likes

For what it’s worth, we don’t record any credentials in uv.lock; they all get redacted (from index URLs, Git URLs, etc.), and we require that users provide them at install-time.

(This is about recording dependencies and dependents.) We do this in uv.lock since we record the entire graph. In the context of this spec, though, that’s less relevant. It is useful for things like uv tree where we should you why each package is included (like, the path of dependencies that led to it being required). I think it’s useful to be allowed to record it, but it doesn’t seem like a requirement.

By the way: are we referring to recording the requirements, or the resolved dependencies? Like flask>=1, or a “reference” to flask==1.0.5 in the lockfile?

We don’t support this, but I think the closest thing is this issue: Request for `uv.lock` to support different index urls across different developer machines and CI environments · Issue #6349 · astral-sh/uv · GitHub. The user says that in their setup, every developer has a different index URL, which is a proxy pointing to the registry. So they want to be able to “swap out” the index URL at install time. I think this is roughly what you’re describing? I had proposed adding some API on our end that lets users declare a URL as a proxy for another URL, like:

[[tool.uv.index]]
name = "private"
url = "https://private.org/simple"
proxy = "http://<omitted>/pypi/simple"

Then, at install time, we’d basically just replace references to https://private.org/simple with http://<omitted>/pypi/simple. But we haven’t implemented it yet.

P.S. Separately, this is sort of pedantic, but is it actually a requirement in the Simple API that no two distributions have the same name? It sounds like a silly question, but I’ve genuinely considered having a Simple API return a concatenated list of distributions from two other registries, which would lead to all URLs being unique but some filenames overlapping. I don’t think that’s a spec violation but I’m guessing I’m wrong. (I think if we allowed users to provide just an index plus a filename, we’d effectively be encoding this requirement.)

I was thinking the latter and to include enough details to know which [[packages]] entry is a dependency.

I think what I’m proposing would allow what you’re talking about: the important thing is the file contents and not the final place you actually download the file from. So what’s in the lock file for where to download is a suggestion, but not a requirement, and can be thought of logging what URL and such was used.

I don’t think so, but there’s also nothing say how to resolve conflicting entries either.

I don’t think it’s any more restricting than having to resolve on the same index returning multiple entries for the same file; which entry do you choose if they have differing details like hash or upload time which the spec doesn’t comment on either? And that’s assuming you don’t check the files to see what metadata they have in case that somehow differs. The spec doesn’t call any of this out so I think it’s a gap. It should probably either say, “only one entry per file name”, or, “you can have multiple entries, but they must all be for the same file contents and the only difference is upload time and URL; tools can choose whichever entry they want however they want”.

But it we allow for downloading from wherever then I will probably require a URL as it’s just data about where the file was found and not a requirement to use that URL.

First off, thank you so much Brett for all of your work on lock files, I really believe it to be one of the most valuable things to standardize in packaging today. I also think the design of the lock file format is excellent and very readable.

As to the open questions I have opinions on:

Drop requiring file sizes and hashes

I agree with everyone before saying the hash should be required. I think it would be good to suggest that tools should provide an escape hatch at install time, e.g. --ignore-hashes, to allow users with changing files to work around the hashes. That is a much cleaner solution in my mind.

Drop the requirement to specify the location of an sdist and/or wheels

I think it is fine to keep this as required if the below escape hatch is included.

Support installing files via a package index

I think this should be a fallback. It would solve the issue of unstable URLs (they would just get thrown away), and if we already check hash and size then I think there is little risk to this. I think if it is included it should be required.

Make packaging.wheels a table

Purely opinion but I think this doesn’t read as well, so I would keep the status quo.

List the requirement inputs for the file

I think this practically won’t be bullet proof, so I don’t think we should include it lest it bite users.

But it may help in auditing and any recreation of the file if the original requirements were somehow recorded. This could be a single string or an array of strings if multiple requirements were used with the file.

Since the file likely won’t record all the information that is fed into the resolver/locker (repo state, PEP 708 info, etc.), I don’t think this can always re-create the lock file, and I don’t want users to have the expectation it could. I think this is a great use for the tool area because the information feeding into the resolver (as well as the resolver behavior) is likely tool specific.

Including index-hosted attestations
I’m leaning towards supporting outlining how to translate the JSON to TOML.

I think this would be great for reviewing changes to a lock file.


Changing pace, I’m a bit concerned that index information for packages being optional could lead to a dependency confusion issue.

If a tool is locking a package that exists on PyPI and a third party index, if the locker omits the index, a user could end up with a different package from what is intended. I think the PEP should require that if a wheel or sdist comes from an index, that index is recorded to avoid dependency confusion attacks. I expect in practice all lockers will follow this behavior, but I think it would be good to enumerate it.

I realized that this is safe due to requiring hashes. Sorry for the noise :slight_smile:


I really hate to bikeshed (sorry!), and I wouldn’t bring this up if I didn’t think this was confusing, but I think metadata-version should be named something that doesn’t clash with other packaging standardized names. Perhaps meta-version, lock-version, or format-version?

Given your time constraints I would be fine if you flat out rejected this, but I thought I would bring it up since it briefly confused me.

1 Like

Think of both wanting to contribute to the same project - in the sense of updating the lock file. Both will create semantically equivalent but still different lock files with a lot of noise due to the different URLs. I assume that is an issue we cannot resolve?

Currently, Poetry does not support installing from a different index[1] than the locked one, but there have already been requests from users to support this. I assume we would be open to something similar to what Charlie describes in PEP 751: one last time - #27 by charliermarsh but it has not been a priority yet.


  1. Actually, we are just locking the index URL and not the file URL. ↩︎

There’s unfortunately no good answer as we have no consistent naming on this.

But maybe the inconsistency is the answer and thus lock-version makes the most sense.

Not without some metadata in the file to restrict domains in future updates. I guess tools could have some ability to block this, or a linting tool to warn against it. Otherwise probably PR reviews are the best we got.

I’m totally happy to say using an index server if the URL doesn’t work is a “MAY” feature as is any other way the installing tool can somehow find the file is also a “MAY”.

2 Likes

Thanks for all the work here. I’m quite excited about having a standardized lockfile for python.

Even if there’s a static IP (such as 127.0.0.1), I imagine this is still useful in case you want to collaborate with other people who don’t have something like devpi installed locally. Being able to “swap out the index URL at install time” (as @charliermarsh mentioned above) seems like a good solution. I personally think it would be nice if this were standardized, but totally understand if this is something that’s tool specific for now.

If it’s useful to hear another opinion: I also think of a lockfile as recording specific blobs of content (as identified by their hash), along with “hints” about where to find that content. I think it’s important for tools consuming the lockfile to be free to look in other places for that content, but I’d expect those tools to enforce the hash requirement.

This really confused me for a while. Taken out of context, this sentence feels quite incorrect to me: I suppose it’s technically possible to install a package (i.e., unpack a wheel into a site-packages directory) without knowing its dependencies, but there’s not much useful I can do with that package if it’s dependencies aren’t installed as well.

I have since read PEP 751, and followed the discussion here, and now I think I know what this means. My rephrasing:

  • PEP 751 expects locking tools to record a full “closure” of package dependencies. Not just the immediate packages you depend on, but everything they depend on, and everything their dependencies depend on, etc (“all transitive dependencies”).
  • PEP 751 only cares about recording a “flat set” of locked packages, it does not care to record the “shape” of the dependencies between packages, because that’s all you need to know to install a complete working environment for those packages.
    • Knowing the shape is useful for auditing purposes (“draw me a graph of the dependencies so I can figure out why this was included”, or “why do I have this old version of package foo?”).

Am I understanding this correctly? If I am, may I suggest rewording this a bit? If I am not, please help me understand things =)

nit: Is there a typo here? Assuming you meant to say “dependents”? This sentence also exists in the PEP, so probably worth fixing if it is a typo.

1 Like

Yep!

Sure, although as you noticed that if you read the PEP the meaning is way clearer and I want people reading the PEP before commenting.

Yes, thanks to copy-and-paste. :sweat_smile:

Only if people come out in support of the feature, otherwise it will get deleted and turned into a rejected idea, so the typo will get fixed somehow. :wink:

3 Likes

I’m fine with that! I think it matches the other names well:
Metadata-Version for metadata
Wheel-Version for wheels
api-version for the simple API
and now
lock[file?]-version for lockfiles!

3 Likes

I updated the PEP in PEP 751: close open issues (#4251) · python/peps@b55b27d · GitHub to close all the open issues. The outcome was …

Accepted

Rejected

Misc

  • I tightened the file name details by making the regex more restrictive by not allowing . in the optional name part and made the sample code more resilient.

I’m going to work on updating my PoC to match the PEP as I’m assuming there are no major changes to the PEP at this point (this is not a challenge to prove me wrong :wink:). Once the PoC is updated and assuming nothing comes up here I will then put the PEP forward for pronouncement.

11 Likes

It is very regrettable to see recording extras and groups entering the rejected ideas, because I didn’t see any strong objections against this, please point me to it if there is any.

If this is the case, PDM may not be able to use PEP 751 as the sole dependency lock format, otherwise we have to put the information into [tool] table which violates the disposable rule.

3 Likes

I’d like to make a general point about naming here. This isn’t a comment on the PEP as such, and I don’t expect anything in the PEP to need changing as a result of this. It’s more of an implication of the fact that we are standardising the idea of a “lockfile” in Python, and what that means for tools.

Assuming that PEP 751 is approved, we will finally have a clear definition of what a “lockfile” in Python actually is. I fully expect that people will start referring to pyproject.lock as a “lockfile”, and tools like dependabot to gain support for “Python lockfiles”. This will be a huge improvement for the ecosystem, but it does mean that if tools retain their own lockfile formats for internal use[1], then continuing to call those files “lockfiles” will probably result in a certain amount of user confusion.

I don’t know how we deal with that. I think that PEP 751 is entitled to use the term “lockfile” - the discussions leading up to this point have been going on for so long that I think it can reasonably be assumed to have established a right to use that term. But equally, changing established terminology in a user tool isn’t practical. I think the best that we can do is to make an effort to qualify what type of lockfile is being referred to whenever possible. That’s going to be largely on tools, to be clear when they are talking about their own lockfile and not the standard form, so it’s not something a PEP can dictate (maybe the upcoming packaging council will be able to make statements on matters of terminology like this, but I don’t think it falls within the PEP process).

I’d say that the terms “standard lockfile” and “Python lockfile” (especially in a multi-language context) should be understood to mean a PEP 751 lockfile. Whenever the term “lockfile” is used unqualified, the default should be to assume “standard lockfile”. Tools will need to come up with their own terms, but “<tool> internal lockfile” (which is a bit verbose, so “<tool> lockfile” could be a common contraction) or “tool-specific lockfile” seems reasonable to me.

Tools should also be explicit when an operation they describe as “locking” produces a standard lockfile, and when it generates a tool-specific lockfile. The filename will act as an indication, as well - pyproject.lock is a standard lockfile, anything else is tool-specific.

I repeat - I don’t think any of this is a problem with the PEP (it might be worth a mention in “How to teach this”, although I’m happy if Brett doesn’t want another round of editing - I can simply mention it in my final decision on the PEP). It’s just food for thought for tool authors as we get closer to the end of this very long journey :slightly_smiling_face:


  1. which seems likely to be the case for uv, may be necessary for Poetry, and based on @frostming’s comment, will be true for PDM unless extras/groups are added ↩︎

6 Likes

I implemented a PoC pip lock command ([PoC] PEP 751 `pip lock` command by sbidoul · Pull Request #13213 · pypa/pip · GitHub) and wrote some notes on the PR. It is of course limited in scope, due to pip not being really capable of cross-platform resolution, but should otherwise be usable for single platform locking use cases.

It was quite easy to add on the basis of pip install --dry-run --ignore-installed.

One assumption I made is that vcs, directory, and archive are applicable only when direct is set, and that sdist and wheels are applicable only when direct is not set. Is that the correct interpretation? Is it worth mentioning in the PEP?

Thanks Brett for this effort!

10 Likes

Here’s the discussion I started about that last quarter Determining cross-platform resolution strategy · Issue #13111 · pypa/pip · GitHub

I would call uv.lock “the uv lockfile” and poetry.lock “the poetry lockfile”.
That suggests that “lockfile” is a generic term which the PEP should feel free to use. But we also need a more specific term for the times when we want to disambiguate.

My preference would be to say that this PEP standardizes “pylock files, a type of lockfile”.

I think that the disambiguation belongs in How to Teach This, and we just need to choose a name.

I hesitate about calling it “a standard lockfile” because I think that for users not on this forum, the implication will be unclear. “Is uv.lock using the standard? Is it a standard lockfile? I thought this tool supported the standard!” It’s easy to see how that terminology could be misread as descriptive, rather than disambiguating. I’d rather give the new thing a name, and let “lockfile” describe it.

5 Likes