Lock files, again (but this time w/ sdists!)

brettcannon · February 22, 2024, 2:04am

Two years since PEP 665 was rejected and three years since I started working towards some lock file solution, I present my next (and last regardless of outcome) attempt at coming up with a lock file standard.

Terms

“platform”: OS plus CPU
“environment”: interpreter plus platform
“distribution”: in the PyPA spec sense, i.e. a project
“lock entry”: a set of distribution files locked for a specific environment
“lock file”: a set of lock entries for a specific set of dependency specifiers from a specific set of indexes

Goals

An environment lock file standard for PyPA specifications - Python Packaging User Guide (i.e. I’m not trying to accommodate conda and this is not what Poetry.lock provides which you could consider a boundary/constraint file)
The ability to lock for multiple environments simultaneously for the same set of dependency specifiers
You can update all entries in a lock file regardless of the platform you are running on (i.e. all inputs into the resolver are recorded in the lock file)
You can have different lock files for different reasons
Installation involves determining which lock entry best applies to an environment, then its a linear install of all distribution files in the lock entry, i.e. no evaluation as to whether an individual distribution file should be installed once a lock entry is chosen, so no SAT solver is necessary
The file format is human-readable for ease of e.g. auditing diffs, but is machine-writable (i.e. not meant to be written by hand)
Not meant to introduce new packaging concepts outside of lock files themselves
Make the sdist people happy so this gets accepted

Spec

Lock files should be written out to a pylock.*.toml file w/ a label to help identify the lock file’s purpose. The file format is TOML.

The allowed keys in the file are listed below (all keys are required unless specified as optional). A TOML file with this also written out can be found at mousebender/pylock.spec.toml at resolve · brettcannon/mousebender · GitHub . Examples of lock files are listed later on.

Anything w/ a means it’s a contentious key.

`meta`

Metadata version of the file; initially “1.0”. Since this file is designed to be human-readable but machine-writable, versioning the metadata makes sense as we don’t need to keep a backwards-compatible format for humans to directly work with and instead need a way to help tools migrate to newer metadata versions.

`indexes`

An array of URLs to package indexes to use to find distributions. Recorded in most- to least-preferred order. Recording the indexes used helps when adding new lock entries by making the potential distributions consistent.

`dependencies`

Array of top-level dependency specifiers. This acts as the input of what to resolve for, so all details are to be included (e.g. extras, markers, etc.).

`[[lock]]`

A lock entry for an environment.

`lock.markers`

A table of environment markers used to produce the lock entry.

GitHub - brettcannon/mousebender at resolve-markers-tags-requirements is an alternative that uses a list of relevant markers.

`lock.tags`

An array of wheel tags supported by the environment as used to produce the lock entry. The tags are sorted from most- to least-preferred.

GitHub - brettcannon/mousebender at resolve-markers-tags-requirements is an alternative that only lists the required wheel tag sets.

`[[lock.wheel]]`

A wheel file for the lock entry (optional).

`lock.wheel.name`

The distribution’s normalized name. You cannot solely rely on the wheel filename to calculate this as the file name may not be a valid .whl file name due to direct references (technically a tool could download the arbitrary URL and inspect it to determine the wheel file details if one so desired). Having the name as a distinct key also has the benefit that its easier to read than from a wheel file name.

`lock.wheel.filename`

The file’s name.

`lock.wheel.origin`

A URL or file path (via file://) where the wheel that was locked against was found. The location does not need to exist in the future, so this should be treated as only a hint to where to look and/or recording where the wheel file originally came from.

`lock.wheel.hashes`

A table of file hashes; algorithm:value pairs. This makes sure that one is getting the wheel file that was locked against for reproducibility and security purposes.

`lock.wheel.direct`

Whether origin is the direct URL in terms of direct_url.json.

`lock.wheel.requires-python`

Python version requirement (optional). If an installer chooses to determine environment compatibility that is not as strict as an exact match of lock.markers and lock.tags, knowing the supporting Python versions is important to determine if this wheel file is compatible as this is not necessarily communicated via the wheel file name itself.

`lock.wheel.dependencies`

A list of normalized distribution names which this distribution depends on (optional). Viewing the overall lock entry as the entire worldview of distributions available, each entry can be just the distribution name (a perk of Python not allowing multiple distribution versions simultaneously). This allows for introspection as to why a distribution is included in the lock entry (i.e. calculate the dependency graph between distributions). Details like extras and markers are not necessary as the resolver has already handled them.

Because this is not required to have a successful install, it is considered optional.

`[[lock.sdist]]`

A source distribution file for the lock entry (optional).

`lock.sdist.name`

See lock.wheel.name.

`lock.sdist.filename`

See lock.wheel.filename.

`lock.sdist.origin`

See lock.wheel.origin.

`lock.sdist.hashes`

See lock.wheel.hashes.

`lock.sdist.direct`

See lock.wheel.direct.

`lock.sdist.requires-python`

See lock.wheel.requires-python.

`lock.sdist.dependencies`

See lock.wheel.dependencies for what is recorded in this array (optional). The contents may come from either:

PKG-INFO if Metadata-Version is 2.2 or higher and the appropriate fields are not Dynamic.
From building the sdist.
Etc.

`[[lock.sdist.build-requires]]`

An array of files that can be used or were used to build the sdist based on build-system.requires from pyproject.toml (optional). The acceptable keys are wheel and sdist and their values match what is acceptable under the same name directly under [[lock]]. Any future expansion of acceptable distribution types under [[lock]] will also be supported here.

This effectively makes the table a self-contained lock entry just for this sdist with build-system.requires providing the value for dependencies.

It absent, it is at the installer’s discretion how to go about building the sdist (including refusing to).

`[[lock.git]]`

A Git repository of source code for the lock entry (optional).

`lock.git.name`

See lock.wheel.name.

`lock.git.repo`

A URL to the Git repository; it may be a file:// path.

`lock.git.commit`

The commit of the repository to use. It should be a specific commit and not a tag or branch as those can change.

`lock.git.direct`

See lock.wheel.direct.

`lock.git.requires-python`

See lock.wheel.requires-python.

`lock.git.dependencies`

See lock.wheel.dependencies for what is recorded in this array (optional). The contents may come from:

pyproject.toml if project.dependencies (and project.optional-dependencies as appropriate) exists and is not dynamic.
From building the repository based on its pyproject.toml file.
Etc.

`[[lock.git.build-requires]]`

See lock.sdist.build-requires (optional).

`[tool]`

Same as pyproject.toml (optional).

Examples

A lock file with entries for multiple environments.
A larger lock file w/ accompanying dependency graph of the lock file.

Proof of Concept

The resolve branch of my mousebender project has a wheels-only, requires-PyPI-PEP-714-metadata lock generation tool (the restrictions are because I only have so much time and we now have alternative installers showing up, so I don’t need to aim for completeness). You can look at the shell script that I use to generate the examples listed above to see how to play with it. The install subcommand doesn’t do anything but list out the wheel filenames that would be installed since installation isn’t interesting thanks to the 'installer` project and the only decision at install time is which lock entry to use.

The key point, though, is I was able to write a proof-of-concept that produces and consumes lock files based on this spec.

PEP 665 Comparison

The most obvious difference is the inclusion of sdists from the start. But that was facilitated by making each distribution file type their own concept which is also different. The concept of lock entries is also different.

brettcannon · February 22, 2024, 2:07am

cc’ing folks for various tools:

@ofek for Hatch
@frostming for PDM
@Secrus and @radoering for Poetry
@konstin and @charliermarsh for uv
@pf_moore sort of for pip, but also since he will either have to delegate or decide on the eventual PEP

EpicWink · February 22, 2024, 7:05am

Are you wedded to having pylock at the start of the filename? Having *.pylock.toml makes more sense in terms of hierarchy (more specific first), and puts the lock’s purpose at the start. I couldn’t find any discussion on the change from PEP 665 in the last topic.

Also, this change better supports an idea of a default lock-file pylock.toml, where the purpose isn’t named.

Are installers free to pull from an index not specified in this list? PEP 665 seems to suggest this.

Even if the answer is no, this is still effectively impossible to enforce by installers due to proxies and routing configuration.

Does that allow PEP 517 prepare_metadata_for_build_wheel?

The actual build requirements would come from PEP 517 hook get_requires_for_build_wheel (backends may inject their own after reading pyproject.toml).

Any support for non-Git VCS? Version specifiers support arbitrary VCS.

To be the same as pyproject.toml, this would be a table, not an array of tables:

[tool.foo]
bar = 42

[tool.spam]
eggs = "beans"

flying-sheep · February 22, 2024, 8:02am

I think git is the only non-pypa standard that appears in a key here. (Compare hashes as opposed to a specific hash algorithm)

pf_moore · February 22, 2024, 9:35am

I’m happy to be PEP-delegate. I feel I owe you another go at this after PEP 665

Overall, this looks reasonable to me - I’ll avoid bikeshedding on details, as I’m sure plenty of people will do that

pf_moore · February 22, 2024, 9:49am

One point I was unsure about was how a lockfile consumer picks which lock entry to use - specifically around the tags. Checking the mousebender implementation I see you have “strict” and “compatible” matching defined. I think that in the actual PEP, it would be useful to document those two matching modes, even if you prefer them only to be examples, and explain when you’d expect each of them to be appropriate.

pf_moore · February 22, 2024, 10:52am

I just wrote a very rough draft of a function to take a pip installation report and convert it into a lockfile. It seems pretty straightforward, but there are a couple of places where I had some questions.

The installation report doesn’t record the wheel tags pip used, so I can’t populate the lock.tags array. The spec says this is optional, so I assume this isn’t technically an issue. A missing (which I assume is the same as “empty”) tag set in the lockfile is compatible with any environment, so it’s simply up to the user not to use the lockfile in an inappropriate environment, which I think is fine.
I have to use the filename from the URL to determine if the entry file is a wheel or not. I don’t think that’s a major issue, but it does make the whole “wheel vs sdist” split feel a little artificial. The PEP should probably include a rationale for why it’s important to have separate wheel and sdist tables^[1].
Pip doesn’t record build environment details, so I have to omit lock.sdist.build-requires. I assume this would mean “installers should use their default mechanism for creating a build environment, and so can’t guarantee reproducibility”, and I think that’s fine. Is that your expectation?

I also note that the lock.git type doesn’t include a build-requires section. I assume it should, as a git repository contains source.

Also, while I see the practical reasons for having a specific “git” lock entry file type, I can see people pushing back on git getting “special treatment”. Maybe the PEP should describe this in terms of general support for VCS file types, with git being the only one defined in this iteration of the spec, but additional VCS types can be added in future spec versions as needed. That doesn’t need to be anything more than a general statement that [[lock.OTHER-VCS-NAME]] is reserved for future use, at this point.

beyond "everyone got up tight last time about sdists, so I wanted to keep them separate ↩︎

cemici · February 22, 2024, 10:53am

Do I understand correctly that these lockfiles contain one [lock] entry for every possible (distinct) environment?

eg an entire description for “python 3.10 and x86-64”, another entire description for “python 3.9 and pypy and not extras=foo”, and so on?

for something like poetry, that would seem to require it to examine all of the markers that it encountered during locking, enumerate all of the exponentially many possible combinations, optionally merge where possible, and then write complete solutions for all of those possible combinations?

Edit: a package with n extras - even if it supported exactly one environment - would have 2**n entries?

aragilar · February 22, 2024, 11:02am

Some additional things you may want to consider (or defer/ignore):

How should get_requires_for_build_sdist and get_requires_for_build_wheel (I’m guessing they appear in build-requires tables, though maybe that should be called out explicitly).
What happens when the metadata produced during a build differs from the lock file (e.g. sdist that produces different dependencies given a different environment)?

jeanas · February 22, 2024, 11:03am

What algorithms are allowed?

pf_moore · February 22, 2024, 11:43am

I’m not sure pip can determine this accurately. We know the dependency metadata from the distribution file, and we can strip out dependencies that don’t apply in this environment, but I’m not sure we can tell what extras apply for a transitive dependency. We know the extras requested by the user for a top-level dependency, but not necessarily for transitive ones. For example, A depends on B[foo], B depends on C, and if foo is specified, B also depends on D. If we’re asked to install A, I don’t think pip knows that D is in the final resolve because of the extra foo - clearly resolvelib knows that, but I’m not sure if pip can recover that information after the resolve. I do know that the pip installation report doesn’t currently contain that data, so at the very least we’d need to modify the report output.

If it’s not possible to calculate the dependencies accurately, what’s the best thing for lockfile producers to do? Omit the field altogether, report the minimum (in this case, B depends on C), or report the maximum (that B depends on some of C and D)^[1]? Basically, I don’t really know what the intended use is for this field, and therefore whether partially accurate information is better or worse than no information at all.

Maximum might be difficult, actually - the marker API in packaging is rather limited in that it doesn’t really allow for “evaluate, but ignoring extras”. ↩︎

Kound · February 22, 2024, 12:10pm

It might be worth considering also conda as a package lock source here.
There are already several projects that try to combine conda and pip already, i.e.

GitHub - basnijholt/unidep: Single source of truth with requirements for pip and conda
GitHub - macro128/pdm-conda: A PDM plugin to install project dependencies with Conda
GitHub - OldGrumpyViking/hatch-conda: Hatch plugin for conda environments

And I also understood that uv is intended to be included within pixi so that would also profit from this.

I am aware that this is a controversial topic but given the scientific community is still heavily depended on conda for their work and they need reproducible environments I think this is worth considering.

groodt · February 22, 2024, 12:31pm

conda has been explicitly mentioned as a non-goal in the initial post.
“(i.e. I’m not trying to accommodate conda)”

conda already has lockfiles for environments and dependencies that users can benefit from, so I don’t think it makes sense to expand the scope of an already complex area (that has had multiple false starts and failures in its history) to accommodate a broadly isolated and not generally compatible ecosystem that doesn’t really suffer from lack of lockfiles.

Sure, perhaps in future once both ecosystems have lockfiles, interoperability between ecosystems could become a noble goal. However that feels like an entirely different problem and an entirely different evolution would be required if there was motivation for such a large overhaul of both ecosystems.

groodt · February 22, 2024, 12:34pm

What’s the PEP number going to be?

hugovk · February 22, 2024, 1:35pm

It’s a bit early for that. Let’s have a discussion first, and when Brett is ready to submit a PEP it’ll be the next number available.

radoering · February 22, 2024, 3:27pm

Let me try to assess this with my Poetry hat on (without saying that something is generally good or bad):

“lock entry”: a set of distribution files locked for a specific environment

[Goal] The ability to lock for multiple environments simultaneously for the same set of dependency specifiers

I think this goal is subtly but significantly different from Poetry’s goal. Poetry does not lock for a specific environment, it creates an environment-independent lock file. It’s more like one lock file for all possible environments than a lock file for multiple specific environments.

[lock.marker] A table of environment markers used to produce the lock entry.

Poetry does not use a set of markers to produce an entry in the lock file. A lock file entry in Poetry’s format cannot be mapped to a set of markers like {python_version = "3.10", sys_platform = "linux"} but to a marker condition like "python_version >= '3.9' and python_version < '3.12' or sys_platform == 'linux'" (even though that’s not written to the lock file).

Further, we just lock all available dists for a locked package version and decide at install time which dist to use. I assume Lock files, again (but this time w/ sdists!) - #8 by cemici is right and it will result in an exponential explosion when we try to create a lock file in the proposed format.

Maybe, I’m missing something but so far, I don’t think that Poetry could adopt this format without giving up its key features. I’m not saying that Poetry’s lock file format is better (it definitively has flaws), it just has a slightly different goal.

sethmlarson · February 22, 2024, 4:39pm

Exciting! Love the approach so far.

I love the goal of this, but I’m concerned about this value being based on serialized TOML content instead of the values encoded in the TOML. I’m imagining interactions between TOML auto-formatters and lockfile tooling causing frustration for users. To avoid this we could base this hash value on the values themselves instead of the serialized TOML? A simple example to illustrate my suggestion being: hash(json.dumps(..., sort_keys=True))

Should we include the wheel/sdist files’ version for similar reasons?

Love the definition for lock.sdist.build-requires, do you have an example using this feature?

For hashes it’s common to require at least one algorithm always be present for interoperability, I recommend sha256?

RazerM · February 22, 2024, 4:41pm

I think what I’ve been hoping for, is as @radoering put it, an “environment-independent lock file”.

I can’t tell if the idea here is that lock file generators are expected to lock multiple environments automatically based on the environment markers and wheels they come across, or if the idea is that a user needs to configure the environments they’re targeting. I assume the former, because things like manylinux versions can be confusing for beginners or developers that don’t follow packaging PEPs.

willingc · February 22, 2024, 5:01pm

cc’ing science folks for awareness:

@lwasser for pyOpenSci
@rgommers @trallard

pf_moore · February 22, 2024, 5:19pm

In pip’s case we could create either a lock file with a fully-specified environment matching the environment pip was run against, or a “generic” environment where we leave the environment markers empty and it’s the user’s responsibility to only install the lockfile against the correct environment. PIp isn’t a multi-environment installer and so couldn’t do multi-environment locks.

I’m not sure what installers are intended to do here. Brett’s mousebender implementation has two modes:

Strict match, which checks if the markers and tags in the lock entry match the current environment exactly.
“Compatible” match, which checks if the lock entry tags is a subset of the environment tags, and ignores markers.

These seem reasonable options, although I feel that strict matches could be fragile, especially given that we now have packaging utilities written in both Python and Rust, and we can’t assume that the exact values produced by the packaging library will be used by all tools (the precise list of tags supported by a given Python implementation isn’t standardised, for example, nor is it obtained by querying the interpreter).