Lock files, again (but this time w/ sdists!)

cemici · February 24, 2024, 12:34pm

Can I suggest that this proposal would benefit from spelling out (some of) the use cases that it is intended to address?

In a previous round, PEP 665 named various tools that have introduced lock files, offering that as evidence that a standard was desirable.

However it is becoming clear that those existing lock files vary not only in their format but also - more significantly - in their meaning and purpose.

I think this proposal no longer aims at standardizing those formats, or even playing in the same space as some (most?) of them.

Rather it has the more modest goal of defining a better requirements.txt (when that file provides exact versions).

Is that about right?

It seems to me that the key point about a lockfile is that it specifies the exact file to download …

I think none of the existing formats typically specify exact files? (They probably could, by providing only a single valid hash for each package, but they usually don’t.)

ie: what you consider to be the key point about a lock file is something that none of the other things calling themselves lock files are actually doing!

To my mind this:

confirms that there is danger of folk talking past one another, if we all are carrying different ideas of what lock files are and what they are even for
shows that there is a gap here that could be filled
maybe casts doubt on the value of filling that gap: how come so much existing tooling has not found this necessary or worthwhile?

alicederyn · February 24, 2024, 1:33pm

The alternative can be that everyone gets an ancient version of a dependency just because it dropped support for one platform of ten that you need to run on, or one old version of Python you still want to support for that org in your company that hasn’t been able to migrate their 100 million lines of code yet. Which you prefer may be situation dependent. Or maybe you want both: everything at the same version except that one frustrating edge case?

It sounds as if this lockfile format can support keeping everything at the same version or allowing them to be different, at the discretion of the lockfile generator (and the flags the user passes it)? I think that’s a good thing.

sfoster1 · February 24, 2024, 2:00pm

It sounds as if this lockfile format can support keeping everything at the same version or allowing them to be different, at the discretion of the lockfile generator (and the flags the user passes it)? I think that’s a good thing.

That makes sense to me, and I think that the way to implement that from a lockfile specification perspective is to have all of

specifiers that define “under which situations is this lockfile valid”
specifiers for individual lockfile elements that define “under which situations ought this package be installed”
the ability to specify multiple dists with non-overlapping specifier sets in a specific lockfile
a strong preference or maybe outright “shall” requirement that tools that consume these lock files allow consumers to specify which lockfile to use (as requirements.txt files, and not as e.g. Pipenv and poetry)

This allows tools to pick specific points in the configuration space that they want to live while letting them stay broadly compatible and lets people choose which side they want to live on. For instance,

This covers the poetry folks’ “universal lock” approach as I understand it: they would put an extremely broad (or perhaps empty) set of specifiers in (1); echo dependency dist specifiers to the dependency lines via (2) and hopefully detect places where that would leave a platform the user cares about broken and emit more copies of the dist requirement via (3) for those situations.
The mousebender approach would be emitting multiple lockfiles with tight specifiers in (1) while not really emitting per-dist specifiers in (2) or multiple dists in (3) and relying on environment creators to implement (4) or autodetect (4).
Something like pipenv blends the difference - its lockfiles specify a python requirement at the top, as in (1), and then platform requirements for individual dists throughout as in (2).

This also gives space via (3) to implement something that people often request of poetry, pipenv, and other environment creators of having a local-edit mode for specific dependencies, which is quite useful in a monorepo environment. Say you have a monorepo in which multiple packages are developed independently (in python terms - separate packages, separate pyproject.tomls) but depend on each other. When you distribute package A, you want its metadata to require some version-specifier of package B. When you’re doing local development, you really want the environment you use to have B installed editably with whatever version you see. Here, with the addition of some supplemental marker, you’d be able to have your cake and eat it too.

charliermarsh · February 24, 2024, 2:00pm

I understand this distinction but I think it lacks some formalism, maybe? Like, even in this proposal, you’re doing “a resolution” in some sense, since you need to select which lock entry to install, and the files will of course vary across entries but the versions could too.

I’m definitely not advocating for Poetry’s lock file format, but is it so dissimilar? It’s effectively an enumeration of all packages that might be installed, and at install-time, it selects the appropriate subset.

Similarly, I think you could argue that Cargo also does a “resolution” at build time, since features and conditional dependencies aren’t encoded in the Cargo.lock at all.

charliermarsh · February 24, 2024, 2:17pm

I agree that some articulation of use-cases and motivations would be helpful here (though wasn’t sure whether that was typically reserved for the PEP itself).

Yeah, I think my mental model has shifted here over the course of the conversation. I’d been thinking about lockfiles from the perspective of a package manager, but the proposal (in my reading) is centered on installers / reproducible installs, and (in my head, at least) could be framed as: requirements.txt (in the manner that pip-compile uses them), but with (1) standardization, (2) the ability to generate entries for multiple platforms, and (3) an encoding of the inputs to the resolution (which isn’t necessary for reproducible installs but does enable other things). So the primary focus is on creating a reproducible install, and not on the broader workflows and user experiences around package management that would intersect with a lockfile.

(For example: should these be published, to enable workflows like cargo install --locked? Should they be checked-in to source control? Are there other inputs that resolvers would need, that aren’t relevant to installers? What’s the intended workflow for generating and updating these?)

This to me is a plausible outcome (and I agree with much of the preceding text in that post).

pf_moore · February 24, 2024, 2:52pm

To formalise a bit, I mean “dependency resolution”. Maybe given the context here I should have said “installing a lock entry should be possible without needing to consider artefacts that aren’t explicitly listed in the entry”.

To be clear here, I’m assuming that a lockfile is more than just a list of projects with pinned versions. We don’t need a new standard for that. The difference, IMO, is that a lockfile references specific downloadable files, and those files can be checked for validity using hashes included in the lockfile. I’m not demanding reproducible installs here (that’s what killed the last proposal, as sdist builds aren’t reproducible) but I do think that being able to validate in advance what artefacts will be installed is a key point.

I haven’t looked at Poetry’s format, but from that description, how is Poetry’s format different from a list of pinned versions? “An enumeration of all packages that might be installed” sounds very broad, and doesn’t sound like it gives any auditability…

To be clear here, I don’t personally have any need for anything more locked down than a “list of pinned versions”. For me, a suitable pip constraint file is perfectly sufficient to get “the same as I got last time” from an install - my days of needing auditable installations are behind me now.

I’ve said it before, but can someone tell me what Poetry, or PDM, lock files provide that isn’t possible with this proposal, and also isn’t possible by supplying a (fully pinned) constraint file and a list of packages to install? I’ll happily ensure that the PEP, when it’s written, addresses any such use case - but I can’t do that unless I know what the use case is!

I couldn’t find documentation for either of the Poetry or PDM lock file formats (and I don’t intend to go reading the source - I’m looking for the design, not the implementation) but I did find an example PDM lock file. From what I can see it looks like essentially just a list of files and their hashes, with no “where to find this” data at all. So as far as I can interpret that, it would be used just to filter the content of a full package index to make only the files recorded in the lockfile visible. I guess that’s a useful thing to be able to do () but it’s not what I’d call locking.

cemici · February 24, 2024, 3:05pm

Do constraints files support markers? If yes: then yes, it probably is possible to use that format as a lock file in a way that is approximately analogous to poetry’s or pdm’s lock files.

no “where to find this” data at all

this is not so, at least in poetry (and almost certainly also in pdm: I’m pretty sure that the pdm lock file is one of the things that it inherited from poetry and hasn’t much changed). Package sources default to pypi and are omitted from the lockfile in that case, but other sources are recorded explicitly.

ofek · February 24, 2024, 3:21pm

Is there a technical reason why UV, Poetry or other tools cannot simply have good defaults if the concern is contributors on different platforms? That is also my concern but as I said it seems like an easy thing to support with good defaults so I’m confused why there is a distinction between the different lock file approaches.

charliermarsh · February 24, 2024, 3:29pm

Just to clarify, my assumption was that in Poetry’s lockfile, you don’t need to “go back to the registry” or whatnot to perform an install – that it does record package sources. (If I’m wrong on that, my apologies.) By “An enumeration of all packages that might be installed”, I meant “An enumeration of all distributions (files?) that might be installed”, which is no different than this proposal, right? The lockfile enumerates all the files that might be installed depending on the platform. It’s still auditable, the difference is just in how it goes about selecting which subset to install. I do not believe that Poetry or PDM need to consider artifacts or any information outside of the lockfile to perform the install. At the very least, I don’t believe that overall format (of a single listing that can include multiple entries for a single package, rather than a listing per platform) is not mutually exclusive with a hermetic lockfile.

charliermarsh · February 24, 2024, 3:36pm

Before I reply, do you mind spelling this out in a little more detail? What would those defaults be, and what would the workflow look like for users whose platforms that don’t match those defaults?

pf_moore · February 24, 2024, 3:56pm

~~They don’t need to - the resolver evaluates markers based on the requirements/dependencies. The constraints just restrict the package versions to the specified values.~~

Update - they do (thanks @fungi). Sorry for the misinformation, I’d misunderstood your question.

Ah, defaulting to PyPI is what caused me to think they weren’t there. Thanks.

OK. It apparently records the filename and the index. So you do still need to look up the actual URL via the index, but I don’t know if you’d consider that significant. I’m not sure if I do - we’re at the point of discussing auditability now, and I’m not an expert in what’s acceptable.

But index lookup, and filtering it to a list of known files, is a side issue compared to my main point here, which is that if the installer still has to do a full dependency resolution (regardless of whether it’s on the full index or a filtered one) then that’s not (to me, at least) “locking” the solution. And I remain sympathetic to the idea that a “just install these files” style of locking, like this proposal, is a valid and useful thing to want. Maybe there’s also a need for a “restrict the resolver to only considering these files” form of “locking” as a separate proposal, but:

No-one has yet put forward such a proposal, and
If we have two separate concepts, we urgently need to change our terminology so that we’re not calling both of them “locking”.

@brettcannon my view (speaking as potential PEP delegate) is that your proposal is based around the idea of a lockfile that says “here’s a list of distribution files - download and install them”, and is focused on auditability and a limited form of reproducibility (where by “limited” I mean “sdists muddy the water a bit”). And that you consider the “here’s a list of distribution files - go and resolve the stated requirements using only what’s on this list” approach as out of scope. I’m happy to support that position, on the understanding that:

You explicitly describe in the PEP the two models, and you clearly state which one your proposal is addressing.
You come up with some formal terminology to distinguish the two cases. I’m not so naive as to think that we’ll stop people calling them both “lockfiles” but I want the PEP at least to address the question of “how do we talk about this” precisely.

If someone wants to propose a “list the files you apply a resolver to” specification, I’ll happily consider that as an independent proposal. I’m not ecstatic about having two separate standards here, but I don’t want this work to fail yet again because we’re talking about multiple different things and never getting consensus or clarity as a result. Also, if someone has an idea for a “merged” approach which somehow addresses both models, please speak up! But be prepared for pushback - I’m pretty certain that trying to reconcile these two different views is what’s caused all of the previous proposals to fail.

@charliermarsh - thanks for pushing this question. I think we’ve spotted an important distinction here that we wouldn’t otherwise have identified, and that’s a significant step.

ofek · February 24, 2024, 4:06pm

The 4 default targets would be (in essence):

aarch64-unknown-linux-gnu
x86_64-unknown-linux-gnu
x86_64-pc-windows-msvc
aarch64-apple-darwin
In future (or immediately?), add Windows ARM64

This covers the vast majority of deployment targets and also developer machines. I would estimate over 99%, likely even higher.

If the application is targeting something specific then the maintainers can add to the default targets or override them outright if they want to for some reason.

If a contributor comes along with a set up that is not locked, maybe macOS with the deprecated Intel chips, then lock operations would add the target as usual and maintainers can decide whether or not they want to accept. They most often would, so then I would assume further lock operations would take into account the targets that exist in the file as the new defaults.

alicederyn · February 24, 2024, 4:11pm

“Environment lockfile” vs “requirement lockfile”?

cemici · February 24, 2024, 4:13pm

Right, but if different environments give different solutions, then I would need markers in my constraints to describe that

eg if my dependency foo has a dependency on bar < 1.0 ; python_version < "3.8" and another on bar >= 1.0 ; python_version >= "3.8", what version of bar should I put in the constraints file?

I would want to put both 0.9.9 and also 1.2.3 (say), distinguishing them by markers.

But I think this digression into “what does poetry do?” is just that: a digression. I am much more interested in what this proposal does and the ways in which it improves - or not - on the use cases that it is actually aimed at!

pf_moore · February 24, 2024, 4:19pm

One other thought here (specifically in regard to the “tags” and “markers” items in the proposal) is that once an architecture has been added to a lockfile, updates can read the tags/markers from the lockfile and won’t need access to the original architecture.

It may also be useful for this type of operation for someone to maintain a registry of names that match to sets of marker/tags values for common architectures. That way, tools could allow the user to say --lock-for Intel_macOS and look the appropriate tags up somewhere, rather than needing access to an interpreter to query them.

fungi · February 24, 2024, 4:32pm

Do constraints files support markers?

They do. Here’s an example from a constraints file being used as a
makeshift lockfile with package versions differing across
interpreter versions:

johnthagen · February 24, 2024, 4:55pm

There are also still many x86_64 Macs in the wild, but that proportion will of course decrease over time.

samypr100 · February 24, 2024, 5:13pm

I’d say at a minimum Tier 1 / Tier 2 per PEP 11 – CPython platform support | peps.python.org

ofek · February 24, 2024, 5:21pm

This is a good idea, thanks! I personally have somewhat of an issue with perpetual support of Windows 32-bit but I understand that’s not a discussion for here.

Therefore, my view is now that the defaults when there is no existing lock file should strive to match CPython tiers 1 & 2 within reason (e.g. I don’t know much about WASI).

sirosen · February 24, 2024, 5:31pm

I believe that poetry takes this approach because it supports dependency groups, which are included in the lock. So the singular lock file may be used by multiple distinct requests from the user to install collections of packages.

(This is all based on my understanding as a user, not a maintainer, so sorry if some detail here is wrong.)

It’s not totally outside of the bounds of this proposal, as this use case can be satisfied by creating distinct pylock files for each combination of dependency groups. But trying to solve it in one file is definitely different.

In terms of terminology, I would call the poetry-style lock a “boundary”. i.e. Tools may be asked to do a limited solve “within this declared boundary”. By creating a sufficiently narrow boundary declaration, poetry achieves locking semantics in most or all cases.

There might be other nice terms, but that’s how I understand it, at least.