Lock files, again (but this time w/ sdists!)

thanks @willingc watching the thread now! @ofek is hatch planning to support lock files in the future? i wasn’t aware of that feature currently but do know pdm and poetry each have a version of lock files.

Note: for conda i normally use condalock fwiw. i am not sure how or if those plugins for pdm and hatch would work into this equation. my experience is that they allow you to use your conda environments during development. my suspicion is that focusing on pip related locking vs trying to do conda as well might be cleaner because a full conda build will not happen using hatch or pdm on it’s own currently.

Disclaimer: take my thoughts with a grain of salt!

1 Like

Yes, but as a wrapper around tools that will actually write that and then Hatch will appropriately map the lock files to specific environments. Most likely UV is what I’ll use.

This is a good distinction and I think the trade-off that Brett has made is worth it (and is in fact what the Pixi folks were going to do). Lockers would most likely default to whatever most developers use in addition to where services are most likely to be deployed. In my mind this means 4 targets: Windows 64-bit, macOS ARM, Linux 64-bit & ARM. Users can set specific targets to override the defaults if they wish.

This minimizes the work during installations which in turn limits what can go wrong in terms of bugs and security. Installation becomes iterating the entries to find which has tags that are compatible and then installing every URL. This format is also more amenable to security scanners and dependency updaters because the format is so simple.

In my opinion, this is The Way :tm:.

I on the other hand don’t love the goal of that and also think this particular implementation would make it hard to contribute. I think there should be no hashes of the file contents nor timestamps. If you include either of these then multiple engineers contributing to the same code base will encounter merge conflicts frequently. I believe there was an open issue with Poetry about that that I think is still not fixed.

edit: found the Poetry issue Make the lock file more merge-friendly · Issue #496 · python-poetry/poetry · GitHub

as long as there is a line content-hash we will never be able to conveniently use dependabot. When dependabot runs, it creates 1 branch and merge request per update, w/ the idea that you let your CI run, and then auto-merge.
However, because of content-hash , with poetry, every single one of these branches has a merge conflict and must be manually dealt with, increasing the human time from a few seconds to ~10-minutes per, and, causing another run of the CI to be required.

edit 2: a cursory glance at other lock files appears to show no precedence for storing such data. I looked at Cargo.lock, yarn.lock, go.mod/go.sum, and Gemfile.lock

Other than this critique, I love the proposal and thank you very much for doing this Brett :smile:

13 Likes

Good to see some ideas on this matter! I will have to think about it more but have a couple small points:

Like @sethmlarson I’m a bit concerned about the hashes being based on the raw text of the file. It means that formatting distinctions which are semantically meaningless in TOML (such as whitespace or the use of different string delimiters) will affect the hash.

Notably, the hash itself relies on one such formatting distinction: it is specified that it must be an inline table. This makes me a bit nervous as it seems like an obvious pitfall for machine-writing. It looks like the POC implementation doesn’t use any TOML-writer lib but rather a textual template into which values are placed. That is okay for the POC, but in real life, if we expect this file format to be consumed and written by programs, a lot of those programs are going to want to just use a TOML-writing library. As far as I can see, toml (the main TOML lib I see on PyPI) doesn’t have any support for stipulating that a particular table be output as inline.

It seems like the spec would be more robust if it could be described purely in terms of the semantic units of TOML (e.g., keys and values), without requiring a layer of special format handling (on input or output).

My other comment is that it would be helpful to have an example lockfile that includes sdists, since including sdists was the impetus for this proposal. It looks like the current examples don’t have any.

1 Like

Citing from the Poetry issue, you linked since this makes sense to me:

In general it is not possible safely to merge two independent changes to a lockfile without re-solving. One can construct examples in which an upgrade to A is possible, an upgrade to B is possible, but the upgraded A and B are incompatible with one another. So if some bot has proposed both upgrades then it would be an error to merge both, even if they are compatible so far as git can tell. In that sense, being merge-friendly would actually be an anti-feature!

I don’t know how other ecosystems without a hash in the lock file solve this issue or if they just accept to get invalid lock files after merging. :person_shrugging:

2 Likes

It doesn’t have to be that string, but I do want a prefix so any lock files sort together. I don’t want to ls a directory and have to hunt for all of the lock files.

I think “default” is a fine name to use. :wink:

I purposely didn’t prescribe how tools were to interpret the data. I wouldn’t expect lockers to suddenly use different indexes as that would cause drifts between lock entries.

Implicitly, yes. Just consider the listed sources as examples, not definitive.

You can consider the Git section as an example if you want, but I thought that Git made sense to start with. Plus the format is flexible enough that adding more type’s later wouldn’t be hard standards-wise.

Installer tool’s choice. I’m assuming the installer will decide which lock entry makes the most sense for the environment. Since each lock entry is meant to represent the same outcome, albeit in a environment-specific way, it shouldn’t lead a semantically different outcome if there’s more than one option.

:+1:

You could fake it with the wheel tags from the wheel files themselves.

:+1: And it’s for flexibility for future additions of types by making all distinct types, well, distinct. :grin:

Sure. I wasn’t expecting pip to throw that info out, so I had not explicitly considered it. But as I said earlier in this post, I’m trying to not be too prescriptive for at least installers.

Yep, that’s an oversight on my part.

:+1: I can also leave lock.git out of the PEP if that’s easier. It just seemed the only other thing beyond wheels and sdists that people might ask for in a 1.0 spec.

Yes except for the extras part; the top-level dependencies keeps the requirements consistent.

If they want to keep their environment-agnostic lock file approach, probably.

I don’t see why the number of extras matters unless every extra is used somehow? And each extra are like a completely different projects for resolving, so it’s no different than requiring yet another distribution.

Any. I don’t think any of the packaging standards that involve hashes mandate anything, e.g. Simple repository API - Python Packaging User Guide . If you wanted to be specific here I would say that’s a separate PEP to update all standards. (and as I said in my opening, I am not trying to revolutionize anything here).

Huh, I wasn’t expecting that as I seem to have it from my resolver.

I honestly don’t know because I wasn’t expecting to be told that info was in no way available.

The use case is to know why a distribution file is there. I thought recording the parent → child relationship would be the easiest, but would inverting the direction be easier? Basically I want to be able to generate a dependency graph of what ends up in a lock entry so I can see who pulled in a certain distribution.

Correct. Poetry seems to want to generate a lock file that shrinks the world down to what’s possibly needed, but still do resolving at install time. That is different than what I’m proposing.

Fair enough. I would not be offended if you said you would propose Poetry’s lock file for standardization if my idea gets rejected.

It’s a reasonable idea to shift that way. That’s what conda-lock does. It’s obviously a bit more involved thanks to dicts not being hashable, but I would assume we would convert them to tuples of two-item tuples which are sorted (and lists to tuples).

:person_shrugging: I honestly only added lock.wheel.name and lock.sdist.name because of direct references.

Nope. :sweat_smile: Adding sdists to the resolver I had to write for wheel files would not be a small undertaking.

See my comment earlier on hashes and packaging standards.

I expect the strict match is more for when you write a lock file to cache what you installed for faster installs in the future.

Fair enough. A hash could also be managed outside of the file itself. As I said earlier I got it from conda-lock, but I am not strongly attached to it being required or even supported.

That’s a massive undertaking, so people may need to use their imaginations for that one. I already had to write my own resolver which means I have partially implemented an installer already. Source distribution support would easily double that workload for me.

3 Likes

If I have an an application with n optional extras, and I want to build a lockfile associated with that package, then it seems like a solution for each of the 2**n possible combinations of extras would need to be written separately in that lockfile.

Perhaps I have just not understood what the use case for these lockfiles is. Is that not a thing that someone would want to do with them?

2 Likes

Proposal updates:

  • Made file-hashes hash the contents the file instead of the raw bytes (I have not updated my proof-of-concept as some are suggesting to drop file-hashes entirely)
  • [[lock.git.build-requires]] was added (it was an oversight to leave it out)
4 Likes

Only if they are used or you are trying to emulate Poetry. Othewise you would have to do this today anyway w/ e.g. requirements files, so I don’t see how what I’m proposing is any different in this regard.

Depends on what you’re trying to accomplish. Lock files are typically used to record/cache what was installed for faster future installs or making sure everyone is using the same thing.

I guess so. But that makes the tags field very unpredictable, and makes strict mode matching essentially impossible for installers. I’m fine with the “keep it flexible” principle, but I think we have to consider the realities here as well, and that might mean being a little more specific than you’d ideally like.

Let’s come back to this later, though - I’m sure people will start weighing in with use cases, which will probably make it easier to decide what works best.

It does mean that you need to mark build-requires as optional. You said everything is required unless stated otherwise, but I think the only thing you marked as optional is requires-python.

I’ll look into this a bit more, but it’s definitely the case that it isn’t available from the current pip installation report. It might be possible to determine it from the resolvelib output - that’s the part I haven’t checked yet.

It seems like a “nice to have” bit of information, rather than something essential for an installer, so maybe it should be marked as optional anyway?

My pip report to lockfile code (when complete) can produce a lockfile containing sdists. But of course it’s only as correct as my interpretation of the spec, which is no better than anyone else’s :wink:

Correct; I will make that change.

2 Likes

I’m fine w/ that.

2 Likes

I would expect that you could normally provide a single requirements file which included entries like foo=1.2.2 ; extra = "whatever" - rather than having to provide two requirements files, one with and one without that entry.

Perhaps I was confused by the invitation to poetry / pdm / uv folk to comment on this - it seems like that kind of package management and lockfile is not what this is for. Fair enough!

Updates:

  • Made lock.*.dependencies optional
  • Made lock.*.build-requires optional, stating that installers can then do what they want to build the sdist, including outright refusal
  • Added :fire: to the controversial keys
5 Likes

Nope, you would need to separate lock files because you effectively said the “whatever” extra was required by specifying it.

They were cc’ed so they can:

  • Know this is being proposed
  • Provide feedback
  • Say if they would support it or not

Poetry said they won’t. We haven’t heard from PDM yet, and uv has no lock file, so I can’t comment on whether they would adopt it. But Hatch seems onboard (sans Ofek not liking the timestamp and whole file hash).

1 Like

Yeah, the exponential scaling in markers (including extras) is perhaps only a problem for the poetry-like general-environment case.

AIUI uv intends to have poetry-like capabilities in due course, so I would guess they also will find that property unattractive.

I have lost track of whether pdm lockfiles are supposed to be cross-platform or not - I think perhaps currently they are not but this is considered a bug?

Anyway those projects either will or will not chip in for themselves.

In my discussions with them they have not suggested that, so I would rather wait for them to say what their plans are before any of us make any more guesses.

1 Like

Yes, we absolutely want to support platform-agnostic lockfiles. Very important t... | Hacker News (and similar comments in that thread) is I think where I got this idea, but of course it will be better to hear from the source

1 Like

Another source to confirm this Can we use uv for dependency management like poetry · Issue #1870 · astral-sh/uv · GitHub, but probably better to hear from directly here wrt the specific concern.

Small nitpick from me (and I know the field is controversial anyway), but if the file-hashes field is retained, perhaps consider a more descriptive name such as integrity-hashes or integrity-checksum.

I can see an argument for retaining this field as a prompt to regenerate the lockfile. I think things like “go tidy” get run regularly enough and are fast enough due to the checksum serving infrastructure they use that perhaps it’s not needed in go.sum.

I guess it’s not a deal breaker if it’s not included either. If folks do try to manually merge files in a way that results in unnecessary lockfile.entry or broken environments, the errors will eventually be fixed with regeneration of the lockfile.

I expect if I had a lockfile, I would have tests in CI that regenerate the file and fail if the version in vcs differs.

1 Like

For clarification, is the primary focus on whether the format is sufficient for installers to install into a sufficient number of environments for the user?

In other words, we are assuming the happy scenario where a user is installing from a lockfile into an environment they know must be supported (strict)? And then possibly an environment that they think is close enough so should work (compatible)? (I realise these semantics are at the discretion of the installer to offer)

Or should we also be considering how one can produce these lockfiles as well? Because it may be tricky for a lockfile producer to produce a lockfile for the desired number of environments? Unless they limit themselves to non-Dynamic metadata? I’m guessing it’s intentional to not focus on this part so that we can agree on the format itself before worrying how on earth such a thing could be produced?

1 Like