Lock files, again (but this time w/ sdists!)

ncoghlan · March 2, 2024, 3:53pm

For my own use cases, if partial locks are an available option, I’d only need two targets for most work projects (the exceptions would be endpoint projects with multiple deployment environments):

“deploy”: comprehensive lock for CI and production deployments
“dev”: partial lock that only specifies the Python version to use, without locking the platform

Now, in a work context, I don’t think it would be a major problem to have to spell out comprehensive locks for Windows/macOS/Linux dev environments, but for open source projects, being able to define partial locks would be a lot more convenient than having to exactly specify the supported dev platforms (as several folks pointed out earlier in the thread)

With the proposed format, even a partial lock will still provide the “index snapshot” locking behaviour that Poetry & PDM offer today, where new versions and artifacts published will only be included when the lock file is next updated, rather than potentially being picked up as soon as they are published. Partial locks just don’t provide the “No resolution needed at install time” benefit that comprehensive locks give.

So yeah, I think having a fallback partial lock that specifies either a minimum or exact Python version would end up being pretty common if the option is available, even for projects that specify one or more comprehensive locks for CI and production deployments.

pf_moore · March 2, 2024, 4:27pm

Just as a heads up here, I’m expecting to require at least one implementation of a locker and one implementation of an installer before I’ll accept a lock file PR. I don’t want another standard that’s approved but then doesn’t get implemented for months, or even longer…

If we do add the version-pinning / “partial lock” style of locking into the proposal, we’ll need decide what tool to do the reference implementation for. PDM and poetry would have backward compatibility concerns. Maybe pip-tools? Do they do cross-platform locking? Unlike environment pinning, I don’t think it would fit well in pip (although the installation side would be OK, as pip has a resolver so the requirement for a resolver when installing isn’t a problem there).

ncoghlan · March 2, 2024, 5:00pm

The naive reference implementation I had in mind would be based on running multiple locking passes with mousebender or uv for different environments and then combining them, rather than the smart environment merging that Poetry & PDM implement. (using pip-tools or pipenv could be done, but would only generate partial locks since both of them record the hashes of all artifacts, not just the one that would be installed in the current environment)

For example, lock for Python 3.12 on 64-bit Linux & Windows.

Define 3 environment IDs in the lock file: win64, linux64, py312

Populate the “unconditional” section with the distribution versions that are the same in both files, tagging each artifact with the matching linux64 or win64 environment ID, as well as the common “py312” fallback partial lock.

Populate the “conditional” section with the versions that differ, again tagging them based on which locking pass generated that result.

Result: file with comprehensive locks for Linux & Windows 64-bit, and a partial lock which will probably work on other POSIX-based platforms.

I also think it would be reasonable for some lockers that don’t have native multi-environment support to only support generating locks for the current environment, and I also expect some lockers will only generate comprehensive locks without offering the ability to merge the results into partial locks (e.g. if pip were to generate lock files, I’d expect it to have both those limitations).

pf_moore · March 2, 2024, 7:20pm

I’m really concerned about approving this just on the basis of a naive implementation. At a minimum, I think what’s needed is something that is usable in a production environment. The implementation can be naive, as long as the user interface is production-quality.

I’m not quite sure what you mean by “probably work” here. The install will either work or fail, there’s no “probably” involved (installers should never create a broken environment). If by “probably”, you mean “the install might fail with an error” then that’s fine, but it’s not what I thought you were after.

Also, when you say “other POSIX-based platforms” - if the wheel tags are incompatible, things won’t work at all. I thought one of the points with the poetry/PDM form was that they included all wheels, not just ones for the known target platforms. Your approach doesn’t do that.

All of this is very much why I want to see something that people can actually use in their production environment, before we accept a PEP. The devil is very much in the details. With a good enough reference implementation, there’s at least a chance for us to get some real-world feedback. And it’s not as if we won’t need real-world implementations once the proposal gets accepted, so you’re not wasting anything by doing the implementation up front.

Agreed. And yes, that’s the only form I can imagine pip getting a locking capability. I’d be quite happy for pip to be able to install from more general lockfiles, but we’re not going to create them…

I understand that producing a production quality reference implementation is a lot of work. But the point is that at some point, someone has to do the work of implementing the capabilities we’re talking about here, in tools that users can actually use in their production environments. As a community, we’ve become very unwilling to standardise existing implementations, preferring to use the standards process as a form of “design by committee”. And with no-one being willing to implement features until the standard has been finalised, we’re leaving ourselves open to all of the problems and flaws associated with up front design, with no practical way to modify the design during the implementation phase. I don’t think that’s how our standards should work, and I’d rather not continue down that route.

Anyway, if I try to specify what I want from the reference implementation in too much detail, I’ll just be committing the same mistake as I’m describing above. So I’ll leave it there for now. When this discussion gets turned into a PEP, it needs a “reference implementation” section. I’ll look at the code provided and give my comments on whether I think it’s acceptable, and if it’s not, then we can iterate on it until it is.

Also, can I point out that I’m not insisting that @brettcannon does the implementation work. I’d be more than happy for someone else to create a PR adding a prototype implementation of the spec to an existing tool, and for that to be the reference implementation^[1].

In fact, the whole point of my prototype using pip’s installation report was that it could be made into just such a reference implementation (and I would have done that once the questions raised by the prototype were resolved). But that’s only sufficient if the spec remains as it is now, limited to “environment-locking” lockfiles. ↩︎

groodt · March 2, 2024, 8:00pm

+1

We could ask Poetry and PDM for their level of interest in changing their lockfile format and adopting a standard? If there isn’t enough commitment right now, it could be left for a future PEP? It does seem like there is a reasonable idea that shows it would be possible to expand the file to support open-ended / unconditional / unspecified target environments in future.

From the very long thread so far, I think the levels of soft commitment look like this:

original proposal - environment lockfile for specified target environments:

pip - yes
hatch - yes
uv - awaiting response? @charliermarsh
pip-tools - awaiting response?
poetry - yes, as export format only
PDM - yes, as export format only (is that correct @frostming )

discussed idea - partial / constraints lockfile for unspecified target environments:

pip - no
hatch - awaiting response?
uv - awaiting response?
pip-tools - awaiting response?
poetry - awaiting response?
PDM - awaiting response?

ofek · March 2, 2024, 8:19pm

Highly unlikely as I think the environment locking would serve the same purpose if the defaults are broad. Not a no, but I would doubt Hatch support.

groodt · March 2, 2024, 8:34pm

On the topic of the “pylock.toml” filename:

Should there be consideration for pyproject.lock?

One thing that is becoming clearer to me is that it seems there might be some desire for “replacing” or “removing” tool specific configuration state / caches / files.

In the hypothetical scenario where this proposal does decide to take on the scope of standardising constraints/boundary locking, is it understood that Poetry.lock and pdm.lock would still exist and there would be an additional pylock.toml file? I’m not sure what the tooling UX looks like to keep all files in sync.

The existing Poetry.lock or pdm.lock do presently store other (small) items of metadata and I’m not sure they won’t have more in future…

I hope it’s not confusing for users if the standard gets adopted but users still need to keep their Poetry.lock and pdm.lock files?

We may need to put the question forward to people who are asking for “poetry style” lockfiles if they also expect to remove their poetry.lock files. Because that would be something for maintainers of Poetry to consider that if they did want/need something new in their file, where would they keep it?

It might be that if there was a “tool extensible” section in a pyproject.lock (or I guess pylock.toml) that tools can continue to support their existing structures or use standardised structures where appropriate.

This all does seem way out of scope and starts drifting into the “workflow manager” tarpit. But that is what Poetry and PDM are, even though the focus is mainly on their locking approach in this discussion.

It also puts a small nail in the idea of multiple pylock.toml files if we did go this route.

pf_moore · March 2, 2024, 8:46pm

I have a feeling that @ncoghlan is interested in the “partial lock” format instead of Poetry or PDM lockfiles. Which begs the question - @ncoghlan given that Poetry and PDM already exist and provide the sort of lockfile you’re interested in, why are you not using them?

More generally, can the people arguing for “partial locking” (or “version pinning”, or whatever we end up calling it) say where they are currently getting that functionality from? Or is it just something they think they might try if it existed in their current tool, but they aren’t interested enough to switch tools for?

charliermarsh · March 2, 2024, 10:45pm

My current feeling on the proposal (in response to the at-mention above):

If something like the current proposal were standardized, our intent would be to support and implement it in uv, both because it’s a standard and because I’d hope it would be an improvement over the way that we use requirements.in / requirements.txt for locking (by way of: (1) being standardized, (2) being a more complete format, and (3) allowing iterative re-locking on new platforms and a few other nice things). I don’t know if we’d use it as our “primary” format forever – perhaps eventually we would only “export” to it, or only allow installing from it but not resolving to it – but still, I’d expect to support it.
If the proposal were amended to be more of a “universal lockfile” or a “partial lockfile”, it would probably be impossible for us to commit to supporting it right now, regardless of how good the proposal was, since we just haven’t done the work on our side to understand what we need out of such a format and explore the design space. (I say this without fully understanding the “partial lockfile” proposal – how it would work, what the motivations are, etc. – likely due to lack of effort on my part to internalize the most recent rounds of comments.)

charliermarsh · March 2, 2024, 10:50pm

If I ignore any debate around the goals and scope of the proposal (i.e., accept them as-written), then I’m left with two primary concerns:

I felt that the proposal (as originally presented) suffered from correctness issues, which I tried to illustrate in my prior comments. A “Python platform” is not just Linux vs. macOS vs. Windows, even in very common cases. I think Brett made some improvements and I trust the issues can be resolved, but broadly, the proposal should guarantee that you are never able to install the “wrong” set of dependencies on any platform. I don’t think this was true in the first iteration.
I’m still unsure how lockers are supposed to generate the markers and tags for a given lock entry, nor how installers are supposed to choose the “right” entry. I’d personally prefer that it’s standardized in some way. This might be a failing on my part but I thought I’d call it out.

EpicWink · March 3, 2024, 12:09am

There has always been a [tool] section in this proposal. I might be misunderstanding, so let me know how that doesn’t support your idea, but you may have simply missed it.

groodt · March 3, 2024, 12:35am

Thanks. Missed it!

Well, then we need to understand if poetry and/or pdm would be motivated / happy to support their tools reading and writing into this new tool area for all of their purposes instead of their existing files.

That seems like it may be at odds with multiple pylock.toml files, but I could be wrong.

The topic of providing “tool” sections and workflow tooling configuration file (or directory) handling seem related but could also indeed be considered a broader topic for some hypothetical pyproject.lock.

It seems like Brett has intentionally (wisely in my opinion) distanced the proposal from tackling the entire workflow/project configuration space and is only focused on environment lockfiles. It might be that he was intending the tool section in pylock.toml to be used so the locker could write a signature or recipe of how the lockfile was produced (tool name, version, command etc), but I could be wrong.

frostming · March 3, 2024, 9:57am

Sorry for not giving comprehensive opinion on behalf of PDM regarding this PEP.

I don’t have much objection to PDM adopting a standardized lockfile format, especially since this proposal addresses the issue with sdist. I even tend to support it as the lockfile format, rather than an export format only.

While PDM’s lockfile is designed for various platforms and Python versions, I agree that most users will not need to install it on all the platforms listed. I can’t keep up with this thread completely and may have missed something, but there seems to be some controversy regarding markers and tags. Basically, I prefer to have a protocol in the proposal on how lockfile consumers(installers) interpret the markers and tags, and pick which lock entry based on that. Ideally there can be some cascading rules across the lock entries. For example, some packages exist in a lock entry with less specific tags and others in a more platform-specific lock entry. That also means we need to explicitly document how to decide the degree of speciality of a particular tag and marker. Since there should be only one entry for a specific package, the cascading shouldn’t be a problem. In this way we can avoid repeating packages among lock entries.

Of course, this is just my rough idea and there may be things I haven’t considered, At least this can bring some inspiration. Anyway if this PEP is adopted, PDM will make an effort to try to use this as the (sole) lockfile format.

ncoghlan · March 3, 2024, 11:39am

The next time I’m setting up an organisation’s Python workflow from scratch, I probably will (while the annoyances at previous employers were real, they weren’t annoying enough to justify investing in a wholesale toolchain migration).

The way the topic came up in this thread was:

@brettcannon proposed dropping “multi-target” lock files completely
@alicederyn and others pointed out various undesirable consequences of that
@steve.dower suggested that the single file format be viewed as an information de-duplication exercise for the multi-file format
I sketched out a rough idea for a format that could provide the de-duplication that Steve suggested (as well as a way to choose a specific artifact set at installation time)
an aspect of that proposal is that it allows for partial locks, where a single environment ID ends up listed against multiple artifacts or even versions for a single distribution if the environment markers associated with that ID aren’t prescriptive enough to resolve to exactly one artifact

The superficially simplest way to handle that case at the spec level is to simply disallow it and require that each environment ID be listed against at most one artifact for a given distribution (i.e. comprehensive locks only).

Supporting comprehensive locks is sufficient for the format to be useful, and some tool authors have already indicated that they only plan to emit comprehensive locks, so they wouldn’t be hindered if partial locks were disallowed.

My position is that it makes more sense to allow partial locks at the format spec level, since there are existing tools that work that way:

Poetry & PDM lock files are already designed around partial locks that cover multiple environments rather than comprehensive locks that target a single environment
pip-tools and pipenv also only implement partial locks, since they capture all artifact hashes for a given version rather than limiting themselves to an exact artifact match

(The main tool that actually generates a comprehensive lock rather than a partial one right now is “pip freeze”. Everything else at least captures multiple artifact hashes for each version, even if matching multiple versions is disallowed)

Instead of banning partial locks at the spec level, artifact installers that don’t do any dependency resolution would error out if they’re only given partial locks to work with.

I’m also genuinely skeptical of new formats that can’t be readily integrated into existing work flows without requiring major changes to those workflows. Hence my interest in whether a proposed lock file format can be correctly and usefully populated by reformatting existing tooling output.

In this case, while my first naive implementation approach wouldn’t work, I think the following approach would be conceptually viable (as in, the command inputs and the output details would provide the necessary information to populate the proposed format):

use pip-compile to generate a partial lock on an exact Python version (single version of each distribution, but listing every artifact published at the time of locking)
use pip-sync and pip freeze to generate comprehensive locks for selected target environments based on the partial lock generated by pip-compile

There are dependency trees where this wouldn’t be enough, as the first step can definitely miss things that vary by platform, but if a project is affected by those issues, the sync step should pick them up.

This would be a super slow way of doing things, and in practice it would be possible to avoid installing anything at all while generating the lock file (even when it includes comprehensive locks in addition to partial ones), but this is still the conceptual heart of what the proposed format aims to represent: comprehensive locks for specific target environments, together with partial locks that describe supersets of those comprehensive locks (and hence may also cover additional environments that aren’t being specifically targeted)

alicederyn · March 3, 2024, 11:47am

I was mainly interested in clarifying why Poetry could not use the original proposal, as only very vague terms were used; aside from personal curiosity, I hoped that might be useful input. I don’t have a horse in this race though

ncoghlan · March 3, 2024, 11:54am

Your points were compelling enough to change my own opinion, so just giving credit where it’s due

pf_moore · March 3, 2024, 12:11pm

I’m sorry if I’m being dense, but I still don’t really understand what you mean by “partial locks” here.

My understanding is that Brett’s proposal specifies an exact set of “download this file from this URL and install it” instructions that are designed to reproduce a particular environment on one or more targets, with the valid targets being described (somehow - as yet unclear) by a combination of markers and tags. Conversely, the Poetry/PDM form specifies an explicit list of files (identified by filename+index and hash, or maybe by URL, I’m not 100% clear on this) that should be used by a full resolving installer as the only installation candidates for an install, to build an environment, the intention being to protect the resolver from changes in the available candidates on the underlying package indexes.

These seem to me to be two very different mechanisms, and although they are essentially addressing the same (or at least very similar) underlying use case, they do so in fundamentally different ways, and make different trade-offs in doing so.

I don’t honestly see where your “partial lock” fits into this. Is it taking Brett’s “list of URLs to install” approach, then throwing some additional files into the mix, and saying that a sufficiently sophisticated installer (one with a resolver, for a start) could then potentially use that list of files as the input for a Poetry-style “resolve using this view of the world” approach? That would imply that you’d need both types of installer, plus a mechanism for deciding which to use in a particular case, in order to consume such a lockfile.

And yes, I’m deliberately stressing the differences between the “environment locking” and “version pinning” approaches here. You can view them as closer in approach than I’m suggesting, but given the amount of confusion we’ve had over technical details because we don’t have precise terminology, I think emphasising differences is more productive than glossing over them.

Also, I’ll point out that Brett’s proposal doesn’t have to result in highly specific lockfiles. A lockfile that just said “install pip 24.0 from https://files.pythonhosted.org/packages/8a/6a/19e9fe04fca059ccf770861c7d5721ab4c2aebc539889e97c7977528a53b/pip-24.0-py3-none-any.whl” would work on all platforms, for any version of Python >= 3.8. That’s pretty broad, for a format that’s being criticised for needing too many specific environment locks…

alicederyn · March 3, 2024, 12:27pm

I think this is what is being called a “partial” lock, as the resolver has discretion at installation time to pick from among the candidates, so theoretically could pick a different set of wheels/sdists even on the exact same machine, due to differences in algorithm.

For instance, if two sdist versions are selected for the same project, perhaps because the most recent one doesn’t support one target environment any more, when used on a platform that is supported by both, an installer could prefer the most recent one or the oldest one. Its choices are constrained but not fully locked.

mikeshardmind · March 3, 2024, 12:37pm

That’s also been my interpretation, but there’s an interesting bit of potentially confusable semantics. This would be constraining less strictly than to a single version (These are the allowed dependencies, and here’s a set of constraints that partially constrain installation candidates, but don’t reduce to a single solution) but someone could hear that term and reasonably think that only some dependencies were being locked (what the application cares about) and that other things could be installed into the environment by other install commands as long as compatible. (partially specifying the environment, unspecified deps not explicitly incompatible, but not solved for by this install.)

pf_moore · March 3, 2024, 1:54pm

Agreed, and I think this is why I struggle to understand what people mean when they talk about this type of “partial” lock. Because it’s a continuum - you can restrict what files a full resolver is allowed to look at from no restriction at all, down to allowing no files (the latter would clearly fail to resolve ). And nobody’s really explaining how the locker should pick what files to include in the limited set. One extremely reasonable approach would be to say “only consider files that were present on the index at the time this lockfile was created” - that’s not something that can be realistically specified as a list of files^[1] but it would give the same sort of reproducibility that this type of locking is intended to achieve. So the problem with proposals using this approach is that how to decide what’s in the lockfile is not part of the spec, meaning that the proposal is just about how to define a set of constraints, and not really about “locking” at all…

Conversely, the approach of saying “install exactly these files” doesn’t have this issue - it works out in advance what to install and just records that. That’s definitely “locking” in any sense that I can think of. The issue here is working out how to describe what systems the given list of files applies to. A naive approach would be to say “only something that’s exactly the same as the source system”, but that’s pretty restrictive. Loosening the definition requires creating some form of “minimum definition” based on the metadata in the chosen files. It’s this aspect that Brett’s proposal punts on at the moment.

So for me, we still have two approaches:

Constrain what the installer is allowed to consider

Pro - the resulting file is potentially applicable in more situations
Pro - there is prior art in the form of Poetry and PDM
Con - the installer must include a full resolver
Con - there’s no clear spec for how to choose what files to allow
Con - what’s happening is arguably not “locking”

Provide an exact list of files to install

Pro - the installer does not need a resolver
Pro - the resulting installation is completely deterministic^[2]
Pro - the format is easier to audit because the result is deterministic
Con - the lockfile potentially targets a more restrictive set of platforms
Con - determining precisely what platforms a given set of files will work on is non-trivial^[3]
Con - this is a new approach, with no existing implementations, so it’s less clear whether it actually addresses users’ actual problems

For me, the defining difference is whether the lockfile can be used with an installer that doesn’t have a resolver. I know that for all practical purposes there’s currently no Python installer that doesn’t include a resolver, but in principle, this is the distinguishing feature between the two proposals.

I like the idea of not needing a resolver at install time. It makes installing from a lockfile easier to reason about, easier to audit, and just fundamentally simpler. But it does imply that the locker needs to be a lot more intelligent about working out “what platforms does this solve apply to” if we’re to avoid overspecified targets. And as far as I know, no-one’s even looked at that problem yet.

Here’s a thought - if we had a corpus of “typical lock specifications”, that might help us understand better whether the relatively complex cases we’re worrying about here actually happen in practice. I tried to search for requirements.in files in Python projects on github (on the assumption that while that’s not the only way people lock at the moment, it’s probably a reasonable start) but my attempts failed Does anyone know how to do that?

not without other restrictions, at least ↩︎
yeah, sdists, I know ↩︎
and a naive approach is likely to be needlessly, possibly even unusably, restrictive ↩︎