PEP 751: lock files (again)

File locks. I haven’t really come across anyone who does package locking like we have.

Give me a minute. :wink:

The problem is unless I create a tool that people are willing to use that implements this PEP then I won’t get that “real-world stress testing”. And the other issue is the pre-existing lock files weren’t proposed as a PEP, along with their creators admitting there were design choices they would re-evaluate.

As for examples, the code at GitHub - brettcannon/mousebender at pep should work for per-file locking if you want to try out other examples. Admittedly, I had to write my own resolver to make that PoC work, so I can’t promise it to be bug-free. :sweat_smile:

So are you hinting you want to propose it as a PEP once you have released it and gotten real-world experience with it?

It’s based on PDM which is based on Poetry, so I don’t think it’s that radical. Really the biggest difference is the flattening of the package/file list instead of following a graph so reading diffs is more straight-forward. And it has been changed based on feedback from e.g., Randy, so while it might be different it isn’t without input from pre-existing implementers.

I was actually planning on updating the PEP to say entries need to differ by name, version, and marker requirement to be as flexible as possible about selecting the right scenario.

The PEP says you have to choose one, but basically it’s there because previous feedback related to needing sdist support flowed into, “I need all possibilities supported as I don’t know what it will take for a user’s platform to get a working wheel”.

Change that to past tense and you’re right. :wink:

Give me a minute and I have an idea that may simplify things a bit.

I’m not sure what you mean by a “revision” compared to a “Git SHA”. What’s the difference?

And the PEP asks for the “commit ID” which “MUST be immutable”, which means Git SHA commit.

Yes.


So, I’ve been thinking about how to simplify per-file and per-package locking and melding them together more as well as supporting multiple groups of dependencies, i.e. locking for all extra combinations for a pypproject.toml in a single file (although some are asking for more files; I’ll never win).

An idea – and I stress this is an idea and I won’t move forward with it unless people show appropriate interest – is the following:

  • Drop [[file-locks]] and [[package-locks]] (and related other bits)
  • Introduce [[supported-environments]] to represent what environments the lock file explicitly supports
    • It’s optional, so that if you leave it out it’s best-effort support for your environment
    • Finding the appropriate thing to install follows per-package guidance in the PEP
    • This is effectively assuming per-file locking is simple enough to not need to be as much of the focus of the PEP and is essentially a simplified scenario of per-package locking
  • Drop dependencies and introduce [dependency-groups]
    • Much like PEP 735, table of named groups of dependencies
    • The default group is the empty string as it sorts to the top no matter what
    • project.dependencies plus project.optional-dependencies named with brackets, extra names are normalized, and extras as sorted, but otherwise list the complete list of dependency specifiers, effectively flattening the extras with the required runtime dependencies
    • If PEP 735 comes to fruition then they just slot right now based on the dependency group name
    • Can either follow Alyssa’s advice and normalize with markers by adding a group marker variable, or introduce a groups/part-of/applies-to/bikeshed-name-ater array in [[packages]] of dependency group names if the package version applies to only a select set of groups

And here’s the PEP example converted (due not you will need to scroll the text box to see the whole thing):

version = '1.0'
hash-algorithm = 'sha256'


[[supported-environments]]
name = 'CPython 3.12 on manylinux 2.17 x86-64'
marker = ""
wheel-tags = ['cp312-cp312-manylinux_2_17_x86_64', 'py3-none-any']

[[supported-environments]]
name = 'CPython 3.12 on Windows x64'
marker = ""
wheel-tags = ['cp312-cp312-win_amd64', 'py3-none-any']


[dependency-groups]
'' = ['cattrs', 'numpy']


[[packages]]
name = 'attrs'
version = '23.2.0'
multiple-entries = false
description = 'Classes Without Boilerplate'
requires-python = '>=3.7'
dependents = ['cattrs']
dependencies = []
direct = false
files = [
    {name = 'attrs-23.2.0-py3-none-any.whl', origin = 'https://files.pythonhosted.org/packages/e0/44/827b2a91a5816512fcaf3cc4ebc465ccd5d598c45cefa6703fcf4a79018f/attrs-23.2.0-py3-none-any.whl', hash = '99b87a485a5820b23b879f04c2305b44b951b502fd64be915879d77a7e8fc6f1'}
]

[[packages]]
name = 'cattrs'
version = '23.2.3'
multiple-entries = false
description = 'Composable complex class support for attrs and dataclasses.'
requires-python = '>=3.8'
dependents = []
dependencies = ['attrs']
direct = false
files = [
    {name = 'cattrs-23.2.3-py3-none-any.whl', origin = 'https://files.pythonhosted.org/packages/b3/0d/cd4a4071c7f38385dc5ba91286723b4d1090b87815db48216212c6c6c30e/cattrs-23.2.3-py3-none-any.whl', hash = '0341994d94971052e9ee70662542699a3162ea1e0c62f7ce1b4a57f563685108'}
]

[[packages]]
name = 'numpy'
version = '2.0.1'
multiple-entries = false
description = 'Fundamental package for array computing in Python'
requires-python = '>=3.9'
dependents = []
dependencies = []
direct = false
files = [
    {name = 'numpy-2.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', origin = 'https://files.pythonhosted.org/packages/2c/f3/61eeef119beb37decb58e7cb29940f19a1464b8608f2cab8a8616aba75fd/numpy-2.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', hash = '6790654cb13eab303d8402354fabd47472b24635700f631f041bd0b65e37298a'},
    {name = 'numpy-2.0.1-cp312-cp312-win_amd64.whl', origin = 'https://files.pythonhosted.org/packages/b5/59/f6ad30785a6578ad85ed9c2785f271b39c3e5b6412c66e810d2c60934c9f/numpy-2.0.1-cp312-cp312-win_amd64.whl', hash = 'bb2124fdc6e62baae159ebcfa368708867eb56806804d005860b6007388df171'}
]

Tomorrow is BC Day so I won’t be reading any responses, and I have stuff going on Tuesday and Wednesday, so there’s a chance I might not reply for a few days.

I will wait to see if this idea gets more or less support than the PEP as it stands. At that point I will take stock and decide what next steps are to unblock progress towards a conclusion (potentially with a poll).

1 Like

Thanks Brett. I haven’t had a chance to read through your idea yet but I will when I can :slight_smile: Just responding to some other points…

In Cargo, they support “features” that can enable extra dependencies anywhere in the tree (like Python “extras”, but global), and they also support platform-conditional dependencies. It has some similarities to Package Locking. (There are also differences though: they read Cargo.toml alongside the Cargo.lock when building, and they also support installing multiple versions of a package at once, like Node.)

It hasn’t been an explicit goal of mine. Do you think I should be considering it? (Asking genuinely.)

I think including the marker requirement would unblock the scenarios I described above. It will make the lockfile a bit harder to validate though, since you now need a robust method to test for marker disjointness. (I think marker algebra in general is an underratedly hard problem… I wrote a whole post about it but decided not to flood the thread with it, but e.g., disjointness which requires that you support range-like semantic operators for version specifiers, to understand whether two version specifiers might intersect).

For example, if the user has a dependency on flask @ git+https://github.com/pallets/flask@3.0.3, it’s useful for us to track that the user requested 3.0.3, even though that isn’t the locked commit. In my previous post, 3.0.3 was the “requested revision”.

This then requires some explicit handling for what happens when a git version tag gets forcefully updated. This is not recommended and is against “git philosophy”, but unfortunately several projects do in fact do that, so it will be hit.

Yeah, if people are force-pushing, then it’s natural that they’re going to run into issues with a lockfile that resolves to precise commits. I think what I’m describing is fairly normal though. Imagine your requirements are flask @ git+https://github.com/pallets/flask@main and anyio, and you have an existing lockfile, and you want to upgrade anyio but not flask, despite the fact that main has some newer commits on it. I don’t believe you can support that operation (at least not reliably) without recording the requested revision (main) in the lockfile.

I could be wrong, but isn’t the intention that the direct dependencies are captured in the “dependencies” array in the PEP751 text?

In the “idea” I think it would go into a dependency group.

Tiny nit in the “idea” example @brettcannon

I think the default dependency group should contain “cattrs” and not “attrs”? The PEP751 text has it correct I think.

Lay user comments

First of all, thank you @brettcannon for tackling this problem. A solution is sorely needed. There’s a WG at my company which is trying to come up with a much more focused, internal solution, and the best that we could agree on so far was to require every team to cook up requirements.txt with == versions using their tools of choice, like pip-compile or poetry export.

Kind of locking

The “per-file” and “package” locking nomenclature quite confusing. Would it be possible to come up with more “human”, end-user-centric words for these concepts?

I would maybe call the earlier “reproducible installation”, kind of in the same vein as npm ci.

I can’t think of a good name for the latter… superset comes to mind, but wouldn’t be understood either. Multi-platform lock file? Though I’m not sure that captures the essence of what it is.

Per-file (ci) example

What is the point of listing individual packages’ requires-python in this scenario? The way I read it, the installer should either install all packages or give up, and if it installs all, then Python must match requirements of every package. Would it not be better to compile the final list of allowed Pythons? Arguably, while that’s trivial in that example, it could get pretty complex if packages require weird python combos, like >3.2; <3.12.0rc3.

Per-package example

What are the semantics of packages.files.lock? These are described for file-locks, but don’t make apparent sense for package locks, or am I totally confused here? The way I see it, file-locks and package-lock are mutually exclusive, and there’s only one bullet point item that’s no file-locks-specific, namely that it’s an array of strings.

Exp for installers

  • Under per-file locking, if what to install is ambiguous then the installer MUST raise an error.

Could be rephrased as “Installer must raise an error, if there’s ambiguity what [or which artifact?] to install when operating under per-file locking [paradigm?]”.

I do have a slight issue with the entire requirement though. Arguably, a lock file that cannot be installed should not be generated in the first place. What are the circumstances for this clause?

Rejected:: only package

I wonder, at what point a package lock files becomes effectively a file lock file.

Is it about environment markers? Or availability of binary artefacts? Or something else?

Perhaps the spec could state something to this end.

In the PEP, how is “URI” defined, and specifically how do the strings look for packages.files.origin and packages.directory.path for absolute and relative paths?

Potentially if the push-back on the per-package approach I’m taking isn’t winning people over.

At this point I’m waiting to see how people react to my new idea. If it doesn’t seem to garner more support, I will ask people if they want me to go back to just handling per-file locking and leave the per-package locking to a separate PEP or keep going with what I already have (I only added per-package locking because people wanted me to try to make it work because they saw each case devolving into the other; I’m okay dropping it and going back to my original lock file approach of files only). And if I do that then we can consider a separate PEP based on what you , PDM, Poetry, or I propose that’s entirely per-package and nothing per-file. And whether that PEP is by me, I’m the PEP delegate, or I’m not even involved at all doesn’t matter so long as we can come up with some standard for that use-case (which admittedly has different requirements compared to per-file locking, e.g., auditing isn’t as critical in my opinion since what could get installed is so open-ended).

I actually went through the PEP and actually didn’t find anything requiring that multiple entries only differ by name and version beyond details around sorting, so I will clarity that up.

You could argue that’s covered by packages.vcs.origin, but the hash is still required.

I don’t know, you tell me. :wink: Seriously, if people have better terminology that would be great to hear, but what I used in the PEP is the best I could come up with.

Documentation. It’s convenient to know which packages require you to have at least some version of Python installed.

They don’t apply in that case. I’ll state that explicitly in the PEP.

If a locker messed up and the installer detected it; security in depth since it could be two separate tools.

It’s not since I thought URIs were already defined.

https://packaging.python.org/en/latest/specifications/direct-url-data-structure/#examples

Thanks! Fixed.

1 Like

As another lay-user, I think the names are actually really clear and so are the descriptions in the PEP. It might be worth a few words in the “How to Teach This” section if we really need to go further. If anyone has time to look at all the previous confusion surrounding the different “flavors” of lock files, you’ll see all kinds of contortions of language attempting to distil the essence of these 2 different locking scenarios.

Naming is hard. I would suggest that bike-shedding on this aspect of the PEP can be handled later, because the contributors discussing the merits of the PEP understand the terms right now. Once the rest of the details of the PEP are agreed, then it can decided if there is any merit to bike-shed on the terms or descriptions. It becomes a find-replace, rather than a distraction.

1 Like

I think give it more time for people to confirm they’ve read it or not and let it sink in.

I’ve read the idea and, as a lay person, it looks good to me.

I like that it’s clear that if there is supported-environments then it feels intuitive that if I try to use the lockfile and see an installation error (“Error: environment not support. Please do XYZ”) I can open the file and it will be obvious that my target environment isn’t supported. At that point I’ll then realise I need to investigate how to either ignore the environment restriction (if my installer supports that feature), or I’ll need to work out how to produce a lockfile for my target environment. That would address a lot of the open-source scenarios I find myself in day to day. It would also work for me in my workplace environment, where supported-environments, particularly for local, dev, staging, prod are very standardized and enumerated to a small target set (aarch64-linux, x86_64-linux, aarch64-darwin). So I like this part.

For the dependency-groups part. I really like that it supports this. Dependency groups are very useful and having first-class treatment of them seems necessary. I find the default dependency group definition '' = ['foo', 'bar'] a bit jarring. But I appreciate the challenges and mess created by giving it a real name. I don’t think it’s really that difficult to teach or document this either and it becomes a lot easier to understand when it sits amongst other named dependency groups. I think if in future, somebody wanted to bike-shed on this empty string and give it a name (e.g. _default) or something, that can easily be added as a follow-up PEP or amendment to document convention. So I like this part too.

Folks at my company want to create a single lock file that is usable on any supported architecture.

I think this means the superset (per-package) flavour in this spec.

In our case, it’s always CPython, 64-bit, Linux, but processor architecture vary.

A couple of caveats have been brought up:

  • some pypi packages ship wheel for some architectures but not others
  • rarely, a different dep version is installed, e.g. amd64 → 2.14.3 and arm64 → 2.14.2

Most importantly, we want to avoid behaviour like pip-tools (?) where our build system needs to run on e.g. RISC-V in order to produce a lock file that works on RISC-V.

In simpler terms, I want to cook up lock files on my laptop.

Context: individual packages’ requires-python for CI (per-file) scenario.

I disagree. The lock file is built for a concrete application, which declares specific supported range of Python versions.

It feels like gold-plating. I would argue that the lock file should only contain what’s needed to install packages. I think that if there’s a field that can be changed without affecting the final outcome, that is installation(s), that field should be omitted. IMHO, the lock file should be stamped with python version spec at the top only.

There’s a precedent, of course, poetry.lock contains python-versions spec for each dependency. Then again, poetry’s approach is a bit redundant, when the lock file has a hash of the “essense” of pyproject.toml and yet copies top-level python-versions field. And, apparently, poetry performs some dependency resolution at install time, doesn’t it?

Anyway, the field is optional in the PEP, it’s up to lockers to choose to emit it.

All in all, my gut tells me that the locks files are not really human-readable, not for large projects. Rather, they are machine-readable and human-inspectable.

1 Like

I’m asking because URIs do not define relative paths: The RFCs i looked at don’t, support in url libraries varies. In the direct url structure, only absolute paths are recorded.

2 Likes

Just noting that I misread the spec page when I posted this comment. The risk is mentioned and handled by these three paragraphs:

When persisted, url MUST be stripped of any sensitive authentication information, for security reasons.
The user:password section of the URL MAY however be composed of environment variables, matching the following regular expression:

\$\{[A-Za-z0-9-_]+\}(:\$\{[A-Za-z0-9-_]+\})?

Additionally, the user:password section of the URL MAY be a well-known, non security sensitive string. A typical example is git in the case of a URL such as ssh://git@gitlab.com/user/repo.

I’ve posted a PR to give these three paragraphs their own subheading: Add Direct URL security heading by ncoghlan · Pull Request #1585 · pypa/packaging.python.org · GitHub

Aside: Are you aware you can do this with uv today? You can produce lock files across platforms with --python-platform and --python-version command line arguments. This will give you lock-time confirmation if a solution to the direct dependencies exists. Might save you some time by avoiding having to build something yourself.

3 Likes

I’ve been thinking of it as: package lock → lockfile, file lock → “environment locked” lockfile or “target locked” lockfile.

Again, i might be misunderstanding something fundamental about the difference, but to me a “file lock” is just a “package lock” that’s frozen to a specific installation command. e.g. poetry export file-lock --group foo --extra bar --environment <whatever>.


As for the new idea. Presumably the point of “supported environments” is to be file-locks. It seems like you’re losing the distinction in o.g. file-locks that froze the set of dependency groups. Is that on purpose? If that was a valuable aspect, could [dependency-groups] be a sub-table of [[supported-environments]]?

An o.g. “file lock” would then just have no dependency groups. An o.g. “package lock” would just have a [[supported-environments]] with no requirements. And they could then be mixed and matched in the same file even.

And this seems to explicitly not support e.g. a dependency that declares a different version based on environment marker? I can’t recall off the top of my head if this was allowed before, but that’s something e.g. poetry supports today.

I definitely like the idea of having “named environments” rather than named locks. Assuming I understand the gist of your idea correctly:

  • lock files would always define a package lock
  • lock files MAY contain a list of specifically supported environments where the artifacts to install are precalculated and noted in the lock file
  • when locking tools aren’t emitting universal lock files they SHOULD populate a supported-environments entry to record the environment markers that were relevant when generating the lock file (they MAY record all defined lock file markers rather than only the markers relevant to the dependency set)

My main question with this idea would be how would a “lock in multiple environments and then deduplicate the results to create a merged lock file” algorithm work with the combined format?

With file locks, you could merge the files without needing to do anything with environment markers, since you were just populating lock names in the relevant arrays.

I think it would be feasible to preserve that property by keeping a packages.files.environments array as part of the specification with the following definition:

an array of supported environment names that allows files for the named environments to be installed without requiring any dependency resolution at installation time

Essentially, the dependency resolution for the named environments would be executed at lockfile generation time and the results recorded in the environments array for each artifact. Installation tools with dependency resolution capabilites MAY ignore the named environments and evaluate the markers directly.
Installation tools without dependency resolution capabilities MAY fail outright if no named environment is found matching the target installation environment.

Merging files would then be a matter of including all the named environments and their respective array entries, while leaving the package level marker fields unchanged.

This feels orthogonal to adjusting the way file locks are defined, but does seem reasonable.

Yes.

That’s doable as a per-file locking scenario since you have a finite set of environments to support.

Correct, hence the support for sdists and locking the build requirements.

Correct, and that’s supported.

Correct. Some people want maximum information to understand why something was included or what influence something had on a lock, while others like yourself don’t. I made non-critical info optional for that reason.

And my PoC can lock for multiple platforms simultaneously, so it’s doable and part of the design that it’s possible.

Yes, because there was an ask to see if they could be closer together to alleviate differences.

That seems right.

I’m still thinking through the best way to represent that situation. It’s probably keying on one and then listing what applies to a file for the second dimension.

The problem I realized last night is the envs ✕ groups matrix; you can’t guarantee that if a file (or package version) makes sense for an environment and dependency group pair that it holds for all groups under that environment (and vice-versa); metadata could vary between files such that it doesn’t hold. So it probably requires two levels to specify the env and group (there’s a reason the PEP currently doesn’t try to tackle dependency groups and expects people to create different lock files per dependency group in that case).


One thing I want to try and clarify is why there’s this distinction between per-file and per-package locking, and it comes down to why you’re locking stuff to begin with.

I think there are two reasons why people typically want reproducible environments: consistency and security. Now, if you just want consistency (e.g., everyone has the same package versions when they build the docs), then you just need a file format that can encode that information. Whether it’s especially readable or not isn’t a major concern as long as the results of using that lock file lead to a consistent outcome. This also lends itself to trying to be open-ended about which environments you support. This is very much a “get stuff done” side of things.

But for security, you need to be able to audit what’s going on in the file. Now you could use tools to help with that and thus continue to not care about how the file format looks, but we all know it’s easy to just say, “eh, the diff looks good” and approve/merge a pull request w/o running some tool. But if you lower the barrier of understanding what’s going on in a lock file, I hope it at least increases how much people would pay attention to what’s going on. This is the “every detail matters” side.

And those two goals of consistency and security line up with per-package and per-file locking, respectively. And since I’m coming at this from a desire for security, it’s why the PEP talks about readable diffs, trying to keep information together so you don’t have to scroll around a file to understand things (and thus miss key details or get lazy about checking something by keeping the cognitive load down), simple install semantics, requiring hash verification, including details to create SBOMS, etc.

My hope is we can somehow come up w/ a format that meets the goal of both reasons people want reproducible environments. I’m giving it this week to see if that’s possible. But if it’s not then I am willing to strip my PEP down to just security-focused, per-file locking like I originally proposed before my parental leave and having a separate PEP for just consistency, open-ended environment support, per-package locking (whether I’m involved in that PEP is an open question). I will probably poll this group next week if there isn’t obvious consensus as to what direction to take by then.

6 Likes

If we stick with the idea of named environments as installation commands that are pre-resolved at lock time so installers can be told to read them out of the file, would we lose anything we care about by having to pre-define the active groups for that lock?

It seems fine to me if lock files declaring defined environments are forced to say things like “dev server has the default & dev dependencies installed, CI has the default and testing dependencies, staging and production only have the default dependencies” rather than allowing the active groups to vary freely (installers would instead need to use the open package lock and resolve at installation time for free selection of groups).

If we take that approach, I think “[[resolved-environments]]” might be a better name for the array of named environments: the significant thing about them is that they’re known at lock time so the locking tool is able to run a hypothetical installation against the declared package markers at locking time and tag the exact set of artifacts that would be installed into that target environment, rather than leaving it up to the installer to work out the details at installation time.

I’d also still be in favour of being able to define resolved environments that differ solely based on OS environment variables (specifically for the dev/CI/staging/production use case where the environment markers are all otherwise the same).

2 Likes