PEP 751: lock files (again)

brettcannon · September 9, 2024, 9:30pm

You might have a lock file that uses only that latest versions and then another lock file that uses all the oldest versions. Lock files are not meant to be layered if that’s what you were thinking multiple lock files

Same here.

If the PEP bothers taking a stance on this (and it’s a bit tricky as it skews into UX), it would probably says installers MUST support installing into an empty environment, SHOULD support syncing an environment to match the lock file (as an optimization to avoid some I/O, but otherwise can just clear out the environment and then do a clean install), and MAY support some way to install into a pre-existing environment that tries to keep packages not listed in the lock file working.

pf_moore · September 9, 2024, 9:50pm

Given where that conversation ended up, I tend to agree.

That seems like a reasonable statement. Yes it’s tending towards UX, but the MUST condition is little more than saying “installers must be able to install in at least the simplest case” so it’s hardly controversial…

I do think there’s enough variation in people’s expectations that a statement like this would help to set some boundaries. Speaking as a pip maintainer, I think this would give us some useful context to state what support we’d provide for lockfiles.

EpicWink · September 9, 2024, 10:49pm

There was lock-file inclusion, where one lock-file could refer to another and extend the list of packages, discussed in this thread, but it was deemed too complex.

brettcannon · September 10, 2024, 10:19pm

I created a gist using groups, which are non-standard but similar to PEP 735. I also locked using both stable Poetry and the PR that @radoering has done where Poetry has a linear reading of the lock file (got it working with pipx run --spec git+https://github.com/radoering/poetry.git@lock-markers-and-groups3a poetry lock).

I created 2 groups (when possible): dev1 had "sphinx>=8.0.2" and dev2 had "sphinx>=8.0.2", "packaging>=24.1". This allowed for overlapping and disparate dependencies.

PDM

[metadata]
groups = ["default", "dev1", "dev2"]
# ...

[[package]]
name = "anyio"
groups = ["dev1", "dev2"]
# ...

[[package]]
name = "packaging"
groups = ["dev2"]

[[package]]
name = "trove-classifiers"
groups = ["default"]
# ...

Poetry

New

[[package]]
name = "anyio"
groups = ["dev1", "dev2"]
# ...

[[package]]
name = "packaging"
groups = ["dev2"]
# ...

[[package]]
name = "trove-classifiers"
groups = ["main"]
# ...

Stable

Group details are not recorded in the lock file.

uv

[[package]]
name = "lock-example"
# ...

[package.dev-dependencies]
dev = [
    { name = "httpx" },
    { name = "packaging" },
    { name = "ruff" },
]

[package.metadata.requires-dev]
dev = [
    { name = "httpx", specifier = ">=0.27.2" },
    { name = "packaging", specifier = ">=24.1" },
    { name = "ruff", specifier = ">=0.6.4" },
]

PDM and Poetry support multiple groups, uv only has a dev group
PDM and Poetry have a default name for the group representing anything not in a group
- PDM has “default”
- Poetry has “main”
- Because of uv’s graph traversal, it only lists the members of the group once

brettcannon · September 10, 2024, 10:31pm

OK, I don’t have anymore examples planned. That means it’s time for me to start thinking about what I have learned from these examples and update the PEP accordingly.

How is this coming along, @charliermarsh ? Do you have anything that you can share? Or is it really close to uv’s current format?

So, it’s time for the next poll question! One of the key differences from uv to PDM and Poetry is how uv’s lock file breaks out more data into the TOML format. For instance, uv separates out the sdist file from all the wheel files while PDM and Poetry group all files together. The approach uv takes saves having installers have to do that repetitive separation of the sdist from the wheel files while searching for a fitting wheel file. But PDM and Poetry’s approach doesn’t necessarily require updating the file format if e.g. a wheel 2 format rolls around (although a wheel 2 format would be big enough that I suspect updating the lock file spec is the least of anyone’s concerns ).

E.g., uv is:

sdist = { url = "https://files.pythonhosted.org/packages/e6/e3/c4c8d473d6780ef1853d630d581f70d655b4f8d7553c6997958c283039a2/anyio-4.4.0.tar.gz", hash = "sha256:5aadc6a1bbb7cdb0bede386cac5e2940f5e2ff3aa20277e991cf028e0585ce94", size = 163930 }
wheels = [
    { url = "https://files.pythonhosted.org/packages/7b/a2/10639a79341f6c019dedc95bd48a4928eed9f1d1197f4c04f546fc7ae0ff/anyio-4.4.0-py3-none-any.whl", hash = "sha256:c1b2d8f46a8a812513012e1107cb0e68c17159a7a594208005a57dc776e1bdc7", size = 86780 },
]

while PDM is:

files = [
    {file = "anyio-4.4.0-py3-none-any.whl", hash = "sha256:c1b2d8f46a8a812513012e1107cb0e68c17159a7a594208005a57dc776e1bdc7"},
    {file = "anyio-4.4.0.tar.gz", hash = "sha256:5aadc6a1bbb7cdb0bede386cac5e2940f5e2ff3aa20277e991cf028e0585ce94"},
]

So, what do people prefer? Separating out data as much as possible in the lock file or keeping it more general?

How should data be recorded?

Separate out the data
Keep it general

0 voters

ofek · September 10, 2024, 10:39pm

I haven’t had time to follow closely, could you please add a minimal example of what you mean for voters?

brettcannon · September 10, 2024, 11:39pm

I added an example to my original post.

ncoghlan · September 11, 2024, 3:18am

I voted “separate the data”, as I’d like to see as much complexity as possible left to the locking stage.

It also makes it easier to check a “binary only” lock actually is binary only.

I suspect it won’t make a huge practical difference either way, though.

brettcannon · September 11, 2024, 11:21pm

Thanks to everyone who voted to provide feedback! Not a landslide either way, so I’ll see what sort of format feels the best to me and update the PEP accordingly.

I’m going to start on a reworking of the PEP that implements everything I said I would change – I have a list -- as well as focus on universal / per-module lock files. Once I have that new draft I will also pull back in the security experts to see if we would still need a modification for strict / per-file lock files. I will also review any changes @charliermarsh / @konstin / @zanie might have w/ introducing a linear lock file while I rework the PEP.

charliermarsh · September 12, 2024, 1:04am

Sorry for the delay. I have a PR here that modifies the lockfile to include the “fully-resolved” markers for each node. I spot-checked some of the simpler examples but I should do a bit more work to ensure that (e.g.) the transformers markers are correct.

The largest marker that I see just from scanning visually is


Just scanning visually, the largest I see is `greenlet` in the `transformers` lockfile, which is 463 characters long:

markers = "(python_full_version < '3.13' and platform_machine == 'AMD64') or (python_full_version < '3.13' and platform_machine == 'WIN32') or (python_full_version < '3.13' and platform_machine == 'aarch64') or (python_full_version < '3.13' and platform_machine == 'amd64') or (python_full_version < '3.13' and platform_machine == 'ppc64le') or (python_full_version < '3.13' and platform_machine == 'win32') or (python_full_version < '3.13' and platform_machine == 'x86_64')"

(Edit: Ibraheem from our team pointed out to me that this marker is actually very simple as CNF but our normalization uses DNF by default. We could probably toggle to CNF in some cases.)

frostming · September 12, 2024, 3:46am

FWIW, packaging’s marker parser refuses to parse marker string longer than a certain length(1000 IIRC), or at least the parsing will be extremely slow. So PDM and Poetry do both CNF and DNF and pick whatever is shorter.

You may also want to test locking pyobjc which has a large number of conditional dependencies.

charliermarsh · September 12, 2024, 1:31pm

One thing I realized after putting up this PR, though, is that adding the markers to the nodes isn’t sufficient for our use-case, because we support installing from multiple different “roots” in the lockfile (because we support “workspaces”, groups of local packages that depend on one another).

With uv, you could have a lockfile that includes both local package A and local package B, which may not depend on one another (they can, though – A could depend on B, etc.). And you can install just A and its dependencies, or just B and its dependencies (uv sync --package A vs. uv sync --package B, from the same lockfile). So to support that, you can’t have a flat list of packages with markers that tell you when to install them, since A could depend on C with sys_platform == 'win32' while B depends on C with sys_platform == 'darwin'. The “combined” marker would be sys_platform == 'win32' or sys_platform == 'darwin'… But you don’t want to always install C on those platforms – it depends on whether the user request the dependency tree rooted at A or B.

So, from that perspective, the markers kind of represent… the superset of platforms on which the package might be installed? Like, C would never be installed on platforms other than win32 or darwin, but it wouldn’t be installed on win32 unconditionally.

As long as we (uv) want to support this, I think we need to track markers on edges, not nodes. (I don’t have a strong objection to including them on the nodes, but we wouldn’t use those markers for anything.)

brettcannon · September 12, 2024, 9:59pm

For those who don’t know what “CNF” and “DNF” are:

And this is where things get tricky. If you are constructing lock files where the markers on files weren’t accurate and instead were documentation, then we have to make sure everyone follows the same practice or else you could end up with incorrect installs.

Now, this seems similar to lock files with extras. In PDM’s case they record which extras/groups a package belongs to so they know when to even care about a package based on where you entered the file from. So in this instance you could have a group representing each of your possible entry points into the lock file and still rely on the markers being accurate per-package. Or am I misunderstanding the concern?

pf_moore · September 12, 2024, 10:12pm

Is this not (in effect) simply having multiple lockfiles in a single file? With each root basically defining its own lockfile?

charliermarsh · September 13, 2024, 1:51am

Yeah that’s a reasonable mental model, though they share a dependency graph. So, for example, if local projects A and B both depend on package C from PyPI, the lockfile ensures that they both depend on the same version of C.

(Also, in general, it tends to be the case that A depends on B, and so B and its dependencies are a sub-graph of A. So they’re typically not entirely independent.)

pradyunsg · September 13, 2024, 8:36am

Oooh, could you file an issue with an example or two of this in packaging? I’d be interested in exploring ways to make this quicker.

frostming · September 13, 2024, 9:50am

Sorry, my knowledge needs to be updated. The latest version of packaging doesn’t have this problem, thanks to the new parser implementation. Anyway, the following code would raise a max recursion error on packaging<22:

from packaging.markers import Marker

versions = [f"{n}.0" for n in range(1, 16)]

and_markers = " or ".join([f'platform_release=="{v}"' for v in versions])
or_markers = " and ".join([and_markers for _ in range(10)])
print(len(or_markers))

m = Marker(or_markers)
print(m.evaluate())

ncoghlan · September 14, 2024, 12:26am

The issue noted here could also come up via groups: there may be parts of the combined marker that only apply if a particular group is installed, or a particular install root is requested.

Which means we may want to revisit the idea of adding a marker syntax for groups (in this PEP, not the groups PEP, since the base PEP doesn’t have a use case for it). Then the path dependent parts of the specifier could be qualified with things like group = "dev" (and roots could get synthetic group names like group = "via-root-A").

pradyunsg · September 15, 2024, 3:48pm

Aha! I’m happy to hear this!

This wasn’t something I’d set out to resolve with the new parser implementation but, like the performance improvements, this is another one of the unintended positive effects of the hand-written parser! I checked and this works with a recursion limit of 20 (which is 1/50th of the default 1000). ^.^

brettcannon · September 16, 2024, 7:32pm

That would be the most composable, but also a bigger lift for the PEP to make (which isn’t a showstopper, but more for the PEP to get approved).

Another way to approach it is to have markers potentially work at the group level by specifying markers per group in the TOML, e.g.:

groups = {name = "A", marker = "sys_platform == 'win32'"}

But doing a marker as group == "A" and sys_platform == 'win32' isn’t cumbersome either. The question is whether markers would get too big in this instance?

Is that meant to be a new marker syntax or TOML?

Potentially putting more into markers does bring up the question of whether the flat, linear format that PDM and future Poetry use is preferred over the graph traversal of uv (remember, the graph traversal is not doing a resolution; it’s just evaluating markers)? The flat, linear file has package details more contained so things can be read in-place more to understand if something would be installed by just looking at the details for any individual package, but it can be noisier due to the marker for a package having to do more heavy lifting and thus potentially becoming large and hurting readability (e.g., PEP 751: lock files (again) - #250 by charliermarsh). Compare this with the graph traversal which is more compact (and potentially more flexible; see PEP 751: lock files (again) - #253 by charliermarsh), but would require hopping around the file or using a tool to help explore when a package may get installed on a platform (I’m purposefully not making a judgment over which is easier to implement as I think both are doable).

What do people prefer? And I think this ties into how important human readability of the entire file is to people (I don’t think diffs would be too bad as you would see the overall effect for a package pulled together in the diff).

Linear list of packages or a graph?

Graph
Linear list

0 voters