Lock files, again (but this time w/ sdists!)

pf_moore · February 26, 2024, 9:44pm

The direct URL data structure abstracts VCS URLs. It would be reasonable to follow that specification here - it supports the major SCMs (specifically, the ones pip supports).

pf_moore · February 26, 2024, 9:50pm

In case anyone wants to play with it, I have a very prototype implementation of a locker that uses pip to lock a set of requirements. It only locks for the single platform the script is run on, and it’s probably not up to date with the latest iteration of the spec, but it at least gives an example of how this might work for pip.

gist.github.com

https://gist.github.com/pfmoore/d595e2bfc97fabb7a62e8c29368277cd

lock_pip_install.py

"""Lock data from pip install report."""
import json
from datetime import datetime
import subprocess
import sys
from tempfile import TemporaryDirectory
from urllib.parse import urlparse
from packaging.requirements import Requirement
from dataclasses import dataclass
from textwrap import dedent

This file has been truncated. show original

Please excuse the state of the code - I knocked this together very quickly and it’s far from polished…

charliermarsh · February 27, 2024, 12:10am

My first reaction is that something like this makes sense. One clarification: if I’m on Python 3.8, how do I know which entry to use, since they’re both valid?

brettcannon · February 27, 2024, 12:57am

I actually think it is very hard for them to do that if pip-tools is implemented the way I think it is. As neither of us work on the project, though, I don’t think we should speculate how hard something would be for them.

So you want to know which index provided the file, correct?

Fair enough if others prefer that idea.

I don’t have a good answer to that w/o recording all the inputs like I was doing before you asked about this idea. If you lock all the lock entries simultaneously you could test each marker you accumulated and see if they would fail for the other entries, and if they do then invert the marker. You could also select based on how many markers you do meet (i.e. the more markers the stricter the match and thus higher the chance its the more appropriate fit).

If none of these seem feasible then either we are back to the strict matching of markers and tags or users choose the right lock file for their needs and lock entries get tossed as an unworkable idea.

BrenBarn · February 27, 2024, 3:39am

Is that certain? Based on the examples @cemici gave it’s not clear to me that it’s going to be as simple as just “Windows”.

What I meant (and what I interpreted your list of three cases to mean, but maybe I misunderstood) is that in these “private” cases the person creating the lockfile knows not just the platform but very nearly knows the exact specs of the system where the lockfile will be used (e.g., it’s going to be this specific dev/production server, not just “linux”). And they know that because they know that they will also be the person installing the lockfile^[1]. I agree that it is possible this proposal will work for other cases but I’m still unsure about the combinatorial explosion of environment parameters. Will people actually be able to create a lock file that will just “work on Windows” regardless of anything else (for instance, regardless of what Python version the lockfile-installer-user may have installed)?

(As an aside, this is yet another example of a problem that would be reduced or eliminated if we had a manager-first system. I think it’s important to be aware of how much this assumption contorts our thinking about these proposals. Perhaps somewhat more on-topic-ly, it makes me think that how well pip can support this may not be as important as how well manager-first systems like rye or uv can support it, because those systems, in theory, could fully instantiate a locked environment, including the Python version. But I know everyone disagrees with me about this so I won’t say more about it. :-))

Well, that’s a bit of a tautology. Obviously people asking for something else won’t change the scope of your proposal, but the question is whether your proposal has the “right” scope. That is, the issue isn’t just the scope of your proposal but also the scope of “what do we mean by a lock file and what do we expect it do”. And I think it’s reasonable to think about whether the proposal covers enough of what people want from a lockfile that it will lead to less confusion in the future (i.e., “if you want a lockfile, use this”) and not more (i.e., “if you want a lockfile use one of these ten tools — oh you mean that kind of lockfile? in that case use one from this different set of ten tools”). Maybe some of this can be resolved with a better name, though, one that more clearly bounds what is and is not handled by this particular type of lockfile.

Well, yeah, that’s exactly why it’s important to think about the scope and purpose of lock files as a concept, not just the scope and purpose of this particular proposal. Because the issue is not just different file formats, its different use cases that may be covered. If there are too many things people want to do with lock files that they can’t do with this proposal, then yes, we will have different formats: we’ll have the ones from this proposal, and the other ones that already exist or will be created to do the things that can’t be handled by this kind of lockfile. And yeah that will be a pain. But people won’t stop wanting to do some things just because this kind of lockfile lets them do some other things.

or it will be someone from the same organization, etc., but not just some unknown person from the world at large ↩︎

notatallshaw · February 27, 2024, 6:33am

I had a quick try of this script.

One quick note requires-python = {self.requires_python!r} shows None for a lot of values, looks like toml has no concept of None or Null, so I guess it would be an empty string.

Testing apache-airflow[all] resulted in a ~4.5k of non-empty lines and a ~250 kB file, which was smaller than I expected.

alicederyn · February 27, 2024, 7:16am

One thing I’ve found hard in following all of this is that nobody has said (or I missed) what Poetry actually does that prevents it using this proposal. When I used it, it just seemed to capture a single version per requirement (direct or transitive) plus hashes. But from the descriptions it seems to capture a potentially exponential amount of data from the index, as it supposedly locks for any target environment.

Perhaps this question will clarify things. If I have a dependency on project A, and that has two versions, version 1 (Windows only) and version 2 (Linux only), and I make a lockfile on Linux with Poetry, what versions and hashes will it capture in the lockfile?

mikeshardmind · February 27, 2024, 7:48am

I’m not sure if I’m missing something here, but with my knowledge of what’s allowed on indexes (for instance, on pypi, releases are open ended) I don’t think the current version here does anything for me which I don’t already get with --require-hashes requirements file that has hashes. or at least if this does more for me, it isn’t obvious how.

What I’d want out of a lockfile standard would be the ability to lock as strongly or loosely as appropriate for my use case. If the goal is to lock for a specific platform and be sure the exact same sources were used each time on that platform, I’d want to generate a lockfile with only the relevant hashes to what would be installed.

If I wanted to lock to a set of “known good” dependencies for versions that have been tested to allow more flexible use while still getting the supply chain security benefits, then what I want is a set of hashes for each version/platform-specific wheel for each dependency. (this is closer to what poetry does)

So, I guess what’s missing here for me is why I would want this specific standard? Why would I want multiple separate lock file paths instead of just a specification that says “these are the versions/artifacts allowed, here’s where they should be sourced from, and these are the hashes they should match with”? If I want strict matching, I have a tool emit exactly 1 entry (or 1 per platform for non-universal dependencies, which will still have a single unique solution per platform), if I want permissive, I emit more, and then whatever is consuming this lockfile either has or doesn’t have options to resolve an environment.

cemici · February 27, 2024, 9:11am

I am quite sure that pip-tools could resolve to individual files if they wanted to.

Specifically, they could transfer the Link from the InstallationCandidate onto the InstallRequirement here, and then use that when getting hashes here.

I do not think it is damaging to your proposal that pip-tools does not do this, only interesting.

If you think it would be helpful to your proposal to maintain that pip-tools wish they were doing file-level locking: you could offer them a merge request along those lines and find out whether it is a thing they want or not.

groodt · February 27, 2024, 9:54am

It’s been an open issue since 2021, so if it’s easy, it’s not trivial or been implemented yet. There does seem to be some level of interest based on the thumbs up and comments on the issue.

github.com/jazzband/pip-tools

Provide a way to limit to a single hash per requirement when generating hashes

opened 12:55PM - 23 Feb 21 UTC

cjerdonek

feature hashes

#### What's the problem this feature will solve? Currently, when the `--gener…ate-hashes` option is passed to `pip-compile`, `pip-compile` will include potentially many hashes per requirement, even if only one is needed / desired. #### Describe the solution you'd like This feature request is for `pip-compile` to expose a command-line option that would limit the output to including only a single hash, namely the hash of what would be installed in the environment in which `pip-compile` is being run. This option would be useful in situations where the exact deployment target is known (e.g. when using containers). For one, the requirements files generated by `pip-compile` would be shorter and easier to review since they wouldn't include extraneous info. Secondly, I believe this would provide greater determinism / reproducibility. For example, currently, if a release that was previously being installed from a requirements file generated by `pip-compile` was [yanked](https://pypi.org/help/#yanked) from PyPI, then the result of what would be installed from that requirements file could change, even though the requirements file didn't change (because the requirements file currently includes more hashes than what was originally installed).

I don’t personally understand the fascination on this topic. The proposal supports locking for multiple distributions (files and hashes) for different target environments. pip-tools does the same, although it’s not as capable as the proposal because it’s not able to produce lockfiles for different target environments if there are any differences in the dependency closures. And it’s not able to produce lockfiles for a single target environment without the PR linked above (or unless the dependency closure is only universal wheels or sdists).

pf_moore · February 27, 2024, 10:59am

Thanks! That’s just a dumb bug on my part - the requires-python line should just be omitted in that case. I’ll fix this, but doing so while preserving the file layout (e.g., without accidentally including unnecessary blank lines, and while keeping the code clean) is surprisingly non-trivial^[1].

(Edit: Fixed now)

One thing this prototype has shown me is that hand-formatting the TOML output is possibly the most difficult part of the job. I could have just used tomli-w, but I wasn’t particularly keen on its formatting choices. Unlike JSON, TOML has a lot of formatting choices, which is what makes it human-readable, but conversely, it’s what makes it annoying to machine-generate, assuming you want human-readable output.

There’s a part of me that wants to say that the file format should be JSON, because the file is machine-generated, and the stdlib includes a perfectly adequate JSON writer. Is there actually any good argument for lockfiles being human-readable?

Other lessons I learned:

Pip doesn’t retain the information about which index a candidate was found on, so writing index data would be non-trivial. And given that the same URL could be linked from multiple indexes, with it being arbitrary which one the resolver actually used to find the URL, I’m not actually sure how useful the index information is in practice. @sethmlarson how does “getting the package’s identity” rely on the index? My feeling is that if I have the URL to a precise installable artefact, and that’s not enough to establish “the package’s identity”, then we’re doing something wrong with how we record identity, rather than needing extra data for the installer.

I’m still struggling to understand how I can meaningfully write marker or tag data. Sure, I can blindly follow whatever instructions the PEP ultimately gives to say how resolvers should compute the values (and I can give pip’s perspective on whether those instructions are achievable or not). But if the PEP leaves things “up to the resolver”, or I want to understand why I should compute the values a certain way, I’m lost. Without knowing what the values are intended to be used for, I have no intuition about how to calculate them. And something as vague as “to help the installer pick between different lock entries” is no use, as that just means the locker and the installer need to agree between themselves, and that’s precisely what an interoperability standard should cover!

+1 on this. I think the proposal here is pretty solid, in terms of technical details (and where it isn’t, that can be fixed). It addresses the glaring issue with the previous PEP, which is that not including sdist support wasn’t considered acceptable. But it still suffers from the issue that it is trying to solve the problem “we need a lockfile standard”. But we don’t need a lockfile standard, actually. What we need is a standard solution to issues that are currently being solved by individual tools, in incompatible or adhoc ways, with features that are described as “lockfiles”, or “locking”, or “pinning” or similar. And yet, instead of looking at the underlying issues, we’re looking at the existing solutions and trying to invent a standard based on them.

For this PEP to succeed (and I’d really like it, or something like it, to succeed!) we need to look at the problems people are solving with the existing solutions. That means we need to talk to people who are using those solutions, not people who are implementing them! It’s no use knowing that Poetry implements multi-platform lockfiles. It’s not even useful to know that multi-platform lockfiles must be important because Poetry users are using them. What’s important is knowing what Poetry users are using them for. How many platforms do the users actually care about? What are the practical edge cases, not the theoretical ones, that the Poetry implementation addresses for them? Etc.

@brettcannon - for PEP 722/723, you arranged a user survey to determine which option was most acceptable to users. Do you have resources to do a survey on what users want from lockfiles? It would be incredibly useful for this proposal to have that sort of user-focused data.

and I don’t have much time this morning ↩︎

EpicWink · February 27, 2024, 12:36pm

Two. The first is that reading Git diff on a lock-file is how we perform quick auditing in merge-requests. TOML isn’t that much better than JSON for human reading however, and pip-tools format (with required-by comments) is also okay (the huge number of hashes for eg cffi hurt).

The second is explained by example (first is JSON, second is TOML):

 "deps": [
   "apackage",
-  "bpackage"
+  "bpackage",
+  "cpackage"
 ]

vs

 deps = [
   "apackage",
   "bpackage",
+  "cpackage",
 ]

I just spend the last 30 minutes dealing with the merge of a package-lock.json, most of the conflicts were due to trailing commas.

If you want a format that’s easy to write without a library, try YAML. Seriously, you only need a recursive for-loop and a few print statements. Of course, the parser (even without code execution) has to be fairly complex.

A problem here is that the community at large, and even some of the members of this discussion, are conflating the tools which have different purposes. A lot of ink has been spilled due to the overloaded nature of the term “lockfile”.

I would guess that in open-source, version-pinning files are more common due to the distributed nature and unknowable target environments; whereas environment-locking files are more common in closed-source where the environment is known better (and there is more likely more time spent on auditing and reproducibility). This means that many of the popular (and open-source) projects mentioned in this discussion lean towards comparing with the version-pinning formats.

So the trick here is to extract the requirements for environment-locking from the closed-source devs. The version-pinning format is not this proposal (its goal is reproducibility without dependency resolution), and would have to be another proposal, so the open-source devs’ requirements need to be considered in the context of this proposal’s goals.

My 2c here, with internal tooling to produce, consume, and audit lockfiles: this proposal looks good, I like the non-comment form of following the dependency trail, I have no use for source indexes (we only care about the hash), the sdist compromise is fine, TOML is fine, and we would generally only have one lock entry.

A thought I just had, I wonder how I would extend a lock entry with more dependencies? In essence, when I run testing, I want to use the same dependencies as in production, but with the testing dependencies as well (I know the proper solution here is to include the testing dependencies in the production deployment, so the testing environment is as close as possible, but sometimes I don’t want to be proper). Another lock entry would be completely independent of the first entry, right?

ofek · February 27, 2024, 12:57pm

When Hatch supports this there will be optionally one file for each environment under a .pylock directory so you would just have to modify the direct dependencies for the environments with which you choose to lock.

pf_moore · February 27, 2024, 1:50pm

Thanks. The use of diffs for auditing was the key reason I’d forgotten.

That’s a very good point. A lot of potential users for this feature are working on closed-source, or otherwise private codebases, and making sure their views and needs are represented is hard. Conversely, though, it would also be bad if we standardised something that’s focused entirely on “internal” use cases, to the detriment of open-source use - if only because open source developers won’t be motivated to build or maintain the code needed to support this feature if they gain nothing from it themselves.

I’m not sure there’s a compromise between version-pinning and environment-locking, so we probably have to view them as two separate features. But give that a lot of the existing work in this area has been around version-pinning, we have to consider how to frame the environment-locking proposal in such a way that it’s clearly of general use.

(By the way, having just used them, I quite like the terms “version-pinning” and “environment-locking”. They may not be 100% accurate, but they are easy to understand, and to distinguish).

sethmlarson · February 27, 2024, 3:13pm

My thinking was that libraries implementing Package URLs would need to maintain a list of mappings for “files.pythonhosted.org → PyPI”, but after checking around it appears that’s already being done so this might be a moot point.

ncoghlan · February 27, 2024, 4:37pm

The most recent variation on this topic that I perpetrated relied on a combination of pipenv and pip-tools (I say “perpetrated”, because it kinda sucked, but it also usually worked well enough that we generally couldn’t justify spending time on making it better).

This was an in-house system for a number of Linux-only Python projects, with the deployment targets involved being:

local dev machines (various flavours of x86-64 Linux machines and VMs)
CI build environments (mostly 64-bit Linux, but also 32-bit Linux running in a container)
actual hardware (mixture of 64-bit and legacy 32-bit Linux platforms)

pipenv handled the 64-bit build & test environments OK, and we managed to make it work for the 32-bit CI environments by running it inside the containers, but for the full hardware deployments we relied on pip-compile to generate locked-requirements.txt files and prepackaged everything as a set of precompiled wheels (also relying on containers to deal with the 64-bit vs 32-bit distinction at build time).

While some devs did attempt to work on the components without a strong Linux dependency on their regular Windows laptops, they were only able to do so by ignoring Pipfile.lock and the locked-requirements.txt files and running directly off Pipfile instead (so they occasionally got burned by unexpected dependency upgrades).

A standardised target environment lock file, together with a locker that could target platforms other than the currently running one, would likely have helped clean up a few different aspects of that system:

less reliance on containers to handle the distrinction between 64-bit & 32-bit targets
potentially the ability to generate a Windows target profile to make that a genuinely supported dev environment
we could have more easily added ARM systems to the potential deployment hardware mix (instead, the related Python build & deployment challenges ended up as one more item in the downside list for adding the new hardware variant)

We would have wanted some of the potential locker features that come up in this thread (like disallowing version drift between target environments without specific approvals), but those are UX features of the locking tools rather than something that needs to be baked into the lockfile format itself.

So, by and large, I think pip-compile with hash generation actually works pretty well when you genuinely only have a single target environment. It just gets painful fast when you have multiple potential target environments (even if it’s only a handful of them), and that’s where this multi-target environment locking proposal comes in.

tmk · February 27, 2024, 5:37pm

Sometimes poetry has to capture two versions. For example if you have this pyproject.toml file:

[tool.poetry]
# <metadata stuff>

[tool.poetry.dependencies]
python = ">=3.7,<3.11"
ray = "^2.8.1"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

and you generate a lock file (for example with poetry lock --no-update), then (at least in my experiment – I hope this reproduces), the lock file will have two versions of numpy:

[[package]]
name = "numpy"
version = "1.21.6"
...

[[package]]
name = "numpy"
version = "1.26.4"
...

I can’t quite figure out why it’s necessary in this case, but I think it has something to do with the fact that certain numpy versions are only compatible with certain Python versions, so if you declare a wide range of Python versions (>=3.7,<3.11) then it can happen that you need two different numpy versions to make it possible to construct the environment for all cases.

cemici · February 27, 2024, 7:28pm

thanks, I had looked for an issue like that but missed this. There was even a pull request (more or less as I suggested) - Generate a single hash by plannigan · Pull Request #1406 · jazzband/pip-tools · GitHub - but it seems to have fizzled out through lack of love.

I am happy to see this as evidence that there is demand for per-file locking, though the slow death of that PR does weaken that somewhat.

I don’t personally understand the fascination on this topic.

Fascination is a bit strong!

For me this is tied up with: we should talk about use cases. The proposal is introducing this novel-ish feature of locking individual distributions, I am trying to understand whether there is precedent, whether this is something that is wanted: if it is wanted then what it is wanted for, and why existing tools are not doing it.

pf_moore · February 27, 2024, 7:49pm

It’s hardly that novel. Locking to URLs was how the previous lockfile PEP worked, and no-one (as far as I recall) objected then. And we’ve had pinning and hashing of versions for years, in the form of things like requirement files generated by pip-tools, and yet people have still been saying they want “proper lockfiles”. The latter isn’t quite a precedent, but it does suggest that locked versions is not what those people want.

I’d much rather people focused on what is possible with the proposed approach and what is not possible, rather than getting tied up in the implementation details. My impression is that locking to URLs can potentially result in the need for more lock entries in complex multi-platform cases^[1]. But it’s not like you can’t generate them, just that it might be more work. (Or something? I’m not actually sure…)

As a potential implementer, I’m also interested in the user interface side of this. I’d always assumed that when locking, the approach would be to ask the user to specify what targets they want to lock for (that’s essentially what pip does right now, although we’re installing, not locking). But cross-platform lockers like PDM and Poetry seem to just “magically” determine all the possible target configurations - I’m not clear how they do that, and as a result I’m finding it difficult to understand the practical impact of the “combinatorial explosion” being talked about here.

IMO, it’s important to remember that the “typical” case may well involve nothing more than a bunch of universal, pure Python wheels ↩︎

groodt · February 27, 2024, 8:03pm

In terms of the indexes field, I would prefer keeping it, but changing it slightly. Not a blocker, but I would prefer it.

I don’t love the idea of using the indexes as a list of ordered fallbacks. I would prefer that the lockfile producer decided which order to use when producing the lockfile.

I would like to have each lock.{wheel|sdist} to have the origin index id to indicate which index the distribution came from (so indexes would need to change to a name mapping).

Then at installation time, the original indexes and urls would be used by default, or I could provide alternative mappings (or transform the file) to remap to mirrors etc if needed.

If this is too complicated or not seen as desirable then it’s not a dealbreaker.