Lock files, again (but this time w/ sdists!)

ncoghlan · February 28, 2024, 4:39am

If the directory approach was used, then the different dependency installation scenarios could be different directories with a common prefix (pylock.prod, pylock.dev, pylocks.tests, pylock.doc, etc). That way you’d retain the ability to use regular file diffing tools to hunt for unexpected differences between environments.

mdrissi · February 28, 2024, 5:16am

I’m currently pip-tools user and satisfied with each package having hashes and 1 version and knowing that when I run install I get precisely that version with one of hashes matching. I currently run pip-compile on ~80 dependencies producing ~400 dependency requirements.txt. That file is used both on mac x86 and linux x86.

In theory that shouldn’t work, but in practice it’s very close even with such large dependency list. Difference between compiling for mac vs linux ended up only being 2/3 packages that were mac specific transitive dependencies. While unnecessary for linux those dependencies did install fine there. My simple trick was add them as top level requirements and then both environments produced consistent lock. Usually when I see library have dependency depend on platform/python version it’s whether that dependency is added, not choice between several. Backports being one common case like typing-extensions or importlib-resources. Something like,

typing-extensions; python_version <= 3.10

But it’s mostly harmless to install typing-extensions on 3.11 so adding it as a direct dependency in practice makes the file cover more environments.

The true problematic case would dependency required for one environment, but incompatible and can’t be even installed in another environment. My experience/luck so far is that seems rare.

So overall I’d be very happy with 1 environment lockfile standard. My practical experience is for normalish and even moderately large number of dependencies, few lockfiles for specific environments end up working in a lot more environments in practice.

edit: I also recently tried upgrading to python 3.10 for that codebase. 3.9 vs 3.10 pip-tools lock file differed by only 1 dependency (importlib-metadata) which is harmless to install on 3.10 even though unneeded.

alicederyn · February 28, 2024, 12:01pm

I think this would be an improvement if this is the intended goal, for sure. I think there’s potential here for combining the two use-cases into a single file, if you made the tag/marker specifiers per-dependency instead of a top-level split. But ruling it out completely by splitting the files makes this proposal a lot tighter.

I’d like to suggest the name “installation snapshot” instead of lockfile.

pf_moore · February 28, 2024, 12:41pm

This would transfer the burden of working out which of multiple lock entries was the most appropriate for the current platform from the installer onto the user. My gut feeling is that doing this would be prioritising the convenience of the creator of the lockfile (and maybe the locking tool implementer) over the user of the lockfile. I’m not sure that’s a good trade-off.

This might address my concern, but it feels like it’s veering into “defining tool UI in a standard”. The standard can say “lockfiles contain tags/markers which installers can use like so to validate whether the file is appropriate for the target environment”, but saying “installers should allow users to provide a list of lockfiles and choose the best one from that list” seems to me to be going a step too far.

On the other hand, “lockfiles contain a list of entries, all marked with tags/markers to define what environment they apply to, and it is up to the installer to pick one of the compatible entries for the environment” seems perfectly reasonable for a standard to say.

alicederyn · February 28, 2024, 1:11pm

Why not just pass a directory to the installer? It can work out which lockfile/snapshot is which from the metadata.

tmk · February 28, 2024, 1:17pm

This seems fine to me. The locking tool could have a default list of the most common platforms and if you want to deviate from that, you need to specify the platforms you want to lock on manually.

I often know my project will only be used on macosx_11_0_arm64 and manylinux_2_17_x86_64 and so most of the information in poetry.lock is actually pointless.

As other have stated, I would prefer a single file.

steve.dower · February 28, 2024, 1:49pm

I like the conceptual separate file for each environment, but also incline towards the usability and auditability of a single file.

Is there a model where conceptually there is one lock per “target”^[1], essentially independent lockfiles under their own filenames, but a process to merge them into a single file? So probably a new property that means an installer can trivially filter out entries that don’t apply to the current target (however that may have been inferred/selected).

This would allow “compression”, where identical entries between the separate locks are included once for multiple targets, making it eas[ier] to see when one target differs from others and more obvious in an audit when one platform changes but others do not. And it saves inventing a directory-based scheme. I don’t think we need to invent all the target names ahead of time - recommending installers “use the entries targeting the current wheel tag by default, or a user-specified custom value” ought to cover it. (Which might mean that all the none-any packages get every possible value in the file listed, but I expect these to be user-requested/locker-inferred explicitly, so the list will be known at lock time. Or maybe an empty filter just means “always include”.)

Splitting a merged file into separate ones just requires enumerating all the target filters and creating one unfiltered lock file for each. Combining those files back together ought to be a perfect round-trip.

But the key is that the installer isn’t doing anything more complicated than if it had to choose one file from many. It’s just filtering on contents instead of reading from a different path. It doesn’t have to re-resolve markers or dependencies or anything. It’s just the “independent files” approach saved into a single file.

Later, adding an example:

Imagining one file per target platform/version/whatever:

# lockfile 1 for cp312-win_amd64
[spam]
version=1.0
hash=abc123

[eggs]
version=0.5
hash=xyz789

# lockfile 2 for cp312-win32
[spam]
version=1.0
hash=abc123

[eggs]
version=0.5
hash=xyz789

# lockfile 3 for cp312-manylinux_2_17_x86_64
[spam]
version=1.0.1
hash=efg123

[eggs]
version=0.5
hash=xyz789

Combine them with an added targets property. Then the installer can filter by targets rather than selecting a different filename (though of course I’d still like to be able to provide a filename, but the default will satisfy many more cases):

# combined file
[spam]
targets=cp312-manylinux_2_17_x86_64
version=1.0.1
hash=efg123

[spam]
targets=cp312-win32+cp312-win_amd64
version=1.0
hash=abc123

[eggs]
version=0.5
hash=xyz789

(Note that this is a massively simplified example and not a specification. Just trying to provide more than one explanation of what I’m getting at in the hope that it helps more people understand.)

A made up term encapsulating the entire stack from the Python runtime on down, including Python version, OS, and any other presumptions the user/locker wants to infer under one label/filename. ↩︎

h-vetinari · February 28, 2024, 2:02pm

The above example from @jvolkman seems to cover this, albeit in one file.

ncoghlan · February 28, 2024, 2:31pm

Intrinsic de-duplication as a locking requirement would substantially reduce the scope of the auditing problem, too.

brettcannon · February 28, 2024, 11:15pm

Because you may have different lock files w/ different files that were locked. Think of a lock file for each dependency group ala PEP 735.

What use case do you feel is being ignored?

Correct.

How long would you expect to wait? I don’t know how long it’s been since Poetry tweaked their file format (maybe @radoering knows?).

So are you suggesting to not group by environment requirements and instead embed the requirements per file? Do think that would help or hurt auditing?

So there’s either having a common section of files that applies to every lock entry but keeping the separate lock entry tables as currently proposed, or there’s putting the requirements on each file and having a linear list. Which would you find easier to audit?

mikeshardmind · February 28, 2024, 11:32pm

This specification is good for those wanting reproducible environments for applications, but allowing multiple versions to have hashes provided that a solver should constrain itself to would be useful administration of development environments. While the ideal is in such settings that people run their own index and not allow direct use of external ones, the reality is that this is significantly more friction than having the installer be constrained and often is neglected for development environments, even in situations where developers have been targetted. I see a lot of potential benefit when it comes to defense-in-depth to a standardized format that allows more than just reproducing a specific environment.

Tools that intend to only operate in the case you considering could use this format and only emit a set of hashes that has one solution. This would still have all the benefits that already exist under the proposal while allowing better defense-in-depth measures to leverage the same work and format, and result in less things to be reviewed.

ncoghlan · February 29, 2024, 2:10am

The first one, as the mere existence of target environment specific sections in the emitted lock file would be enough to indicate that there are environment specific dependencies in the resolution tree.

And when you’re subsequently checking to see if the differences are expected and acceptable, you’re only looking at the files that differ, rather than wading through everything.

However, I’m less sure about the better way to handle the files that do differ between environments. Maybe it would make sense to stick with the spirit of the environment marker approach and have the general structure be:

table of unconditional dependencies (no variation between targets)
table of target environment IDs in preference order with their associated environment markers
table of conditional dependencies, with each entry in this section:
- describing a distribution dependency that is only sometimes installed
- listing the target environment IDs were it would be installed (using “*” to indicate presence in all environments, just potentially varying in exact version)
- listing the full details of the distribution versions installed in at least one environment (with the IDs of those environments given - no shorthand at this level)

That gives:

an explicit list of expected target environments (the table of environment IDs)
a way to choose the dependency set to install (first entry in the target table where all the environment markers match the installation environment)
an easy way to tell there are differences between targets (conditional dependencies table has entries)
an easy way to tell if a conditional dependency merely varies in exact version or may be absent entirely (whether the distribution level target entry is * or not)
a reasonably straightforward way to check if the differences across targets for a particular distribution are expected and acceptable

The minimal requirements for a locker would be to be able to generate independent lock results for each defined target environment, and then merge those into the above format. Smarter lockers might be able to generate the desired result directly without multiple locking passes.

As far as the spec is concerned, the keys in the table of target environment IDs would be arbitrary (modulo conventional PyPA legal identifier normalisation), but I see at least a few likely origins:

locker accepts a full target table as an input (e.g. via its tools section in pyproject.toml)
locker has its own default target table (e.g. based on Python version + wheel tags)
locker is relocking an existing lock file

The difference between the Poetry/PDM dependency constraints use case and the comprehensive locking use case would show up in the target environment lists for individual versions in the conditional dependency table.

For comprehensive locking, each target environment must be listed against at most one version of each conditional dependency (not being listed at all is fine). Violating that rule would be an error at both lock time and installation time, since the exact file to install in that environment is ambiguous without re-running the dependency resolution for a more exact target)

For the dependency constraints use case, having multiple versions flagged as valid for a given environment is fine, since the installation process wouldn’t need to check those anyway. For that use case, the lock file would only be used to get the set of acceptable dependencies and their respective versions, the installer wouldn’t care about the list of target environments. (It may even make sense to suggest “resolve”, with no associated environment markers, as the conventional name for using the lock file format to describe dependency constraints)

ncoghlan · February 29, 2024, 4:40am

Thinking about the conditional dependencies a bit more, that may need three tiers of breakdown:

by distribution name
then by version
then by exact artifact

A single version with multiple artifacts would then be a common outcome for distributions with binary extensions.

It would probably also make sense to allow entries in the unconditional section to still list target dependent artifacts, so only cases where distribution versions may differ, or the dependency may be absent entirely, appear in the conditional table.

brettcannon · February 29, 2024, 10:26pm

That’s specifically not a goal of my proposal. If you would like to see something like that you will probably have to write your own PEP unless some compromise comes up that everyone is happy with.

That’s what I thought as well.

But if you have multiple versions for an environment then won’t you have to run a resolver to figure out what version to install? And are you assuming you’re listing all files for a version, or are you locking to a specific file? I feel like what you’re describing here is somewhat of a take on Poetry’s lock file, but w/ an optional way to mark individual files as belonging to a specific environment lock.

Yeah, I feel like this is a take on Poetry where you have one more level of specificity representing the environment lock that applies to a specific file. Am I wrong in my understanding?

mikeshardmind · February 29, 2024, 10:59pm

I don’t really have the time to do more than advocate for it and hope someone is interested at this point in time. I believe it would be a small change that allows more potential uses, but realistically not something I’d champion right now. Having had the time to give it more thought, the format is versioned so revisiting it later to extend this capability into the format without harming the existing case can be done later. I’m mostly just interested in ensuring the two use cases can use the same data to reduce the amount of work done overall (by tools that need to parse this, by those needing to audit changes, etc). I don’t think this concern is worth holding up the process with all of that in mind, and I appreciate the effort you’ve gone into with addressing much of this.

ncoghlan · March 1, 2024, 12:13am

Not wrong. With the target environment ID idea, it’s possible that the markers given might not be enough to specify a comprehensive lock outcome, so you’d end up with the same environment ID listed against multiple artifacts, and potentially multiple versions for the same distribution.

I think that’s OK at the format level, though.

Making up some terminology:

Artifact installer: extracts a flat list of exact artifact references from a lock file and installs them
Resolving installer: uses the lock file as a set of constraints on the distributions, versions, and artifacts that can be installed (~~note: to benefit from the metadata publishing server API, would need version metadata hashes in the lock format, not just artifact hashes~~ Scratch that, PyPI prevents changing existing files, it just lets you add new files, so any discrepancies in results will be picked up via the artifact hashes)
Comprehensive lock: environment ID spec that includes enough environment markers to resolve to one artifact per distribution (e.g. exact Python version and target platform wheel tag)
Partial lock: environment ID spec that isn’t exact enough to resolve to exactly one artifact per distribution (e.g. just a minimum Python version)

The distinction between the lock types could even be made explicit in the lock file format, by having a “comprehensive” boolean flag in the table of target environment definitions (defaulting to “true” if omitted) in addition to the environment markers that define the target environment. (Note that “partial = true” would be a reasonable and shorter alternative to “comprehensive = false”, but I think either spelling would be clear enough to be acceptable in a format spec)

An artifact installer would ignore any defined partial locks and only install the artifacts tagged with the first matching comprehensive lock in the target environment table. It would error out if it couldn’t find an applicable comprehensive lock in the lock file.

A resolving installer would instead build a list of all the matching target environments, distributions, versions, and artifacts in the lock file, and use them to build both a set of input requirements and the set of acceptable outcomes for the resolution process.

brettcannon · March 2, 2024, 4:38am

Alyssa Coghlan:

Making up some terminology:

Artifact installer: extracts a flat list of exact artifact references from a lock file and installs them

Resolving installer: uses the lock file as a set of constraints on the distributions, versions, and artifacts that can be installed (~~note: to benefit from the metadata publishing server API, would need version metadata hashes in the lock format, not just artifact hashes~~ Scratch that, PyPI prevents changing existing files, it just lets you add new files, so any discrepancies in results will be picked up via the artifact hashes)

Comprehensive lock: environment ID spec that includes enough environment markers to resolve to one artifact per distribution (e.g. exact Python version and target platform wheel tag)

Partial lock: environment ID spec that isn’t exact enough to resolve to exactly one artifact per distribution (e.g. just a minimum Python version)

That’s the same conclusion I came to with your idea, except w/o the partial lock idea. What I was thinking was an environment has an unambiguous lock or the lock shouldn’t exist. That doesn’t preclude using the constraints in the file to resolve what to install.

I don’t quite see the benefit of a partial lock. I understand the benefit of a full lock and resolving what to install with what’s in the file being treated as constraints, but the halfway point I’m not seeing a purpose for.

I’ll have a think about this and what would might be need to be specified in the file to make this all work. I think it’s tenable, but specifying exactly what is (not) expected by installers and lockers will be rather important in this situation. Probably the biggest question is would this mean people would want to have fewer, more strict environment locks if they can fall back to resolving?

groodt · March 2, 2024, 5:26am

+1 for still being interested in the unambiguous environment locks if you do decide on a format for specifying constraints in the same PEP.

radoering · March 2, 2024, 9:10am

That’s a good example of a use case why people (my company included) switch from pipenv to poetry.

Not exactly. Poetry builds intersections, unions, differences, … of marker conditions. For example, the intersection of python_version < "3.10" and python_version > "3.10" is “empty” and the union is python_version != "3.10". That way, we are able to determine “equivalent” environments (which can be satisfied by the same package versions).

changes per Poetry version (to the best of my knowledge):

1.1: dunno, that was before my time
1.2: some changes regarding normalized names and pretty names of packages and extras
1.3: major change: store file hashes per package version / lock file entry instead of per package name
1.3: add “@generated” comment at beginning of lock file
1.4: add version of poetry in “@generated” comment
1.4: sort extra dependencies to avoid unnecessary diffs
1.5: drop a field that has not been used for a long time

current version: 1.8

It looks like the format has stabilized over the past year. However, if we want to avoid re-resolving at install time (no plans so far) we have to make at least two (maybe more) changes/additions:

add the resulting marker condition to each locked package / version
add the relevant dependency groups to each locked package / version

alicederyn · March 2, 2024, 9:26am

This sounds like the “compressed” combinatorial combined lockfile we’ve been describing. Is there any other difference between the original proposal and a Poetry lockfile with this extra metadata other than:

Size
Support for restricting the file to only specified target environments?