Lock files, again (but this time w/ sdists!)

pf_moore · March 7, 2024, 9:46pm

OK, in that case I thought there was an issue with older data (for example uploaded before twine was in common use). But most importantly, what you’re looking at is not guaranteed to be what installers will be looking at, and that’s the real issue. I tend to not use BigQuery (in spite of the fact that I am a database specialist by trade) because the metadata historically was very problematic. Maybe it’s better now, I’ve not looked in a long time.

Ah, that would make a significant difference. Looking at just the latest version isn’t that useful, because (a) the cases I’d expect to cause problems would be ones where dependency conflicts meant that older versions were needed, and (b) older versions are the ones most likely to have bad data.

11k sounds a lot more plausible. I’m at 20% of the way through and have around 2,000 problem cases.

If we say 10k out of 150k possibilities, that’s about 6% of all project/versions that could have inconsistent data, which do. Of course, only 3% of project/versions have more than one wheel, so that 6% is 6% of that 3%.

You can look at this two ways:

Nearly all project versions don’t have inconsistent metadata (mostly because they only have one wheel, probably pure Python), suggesting that the problem is vanishingly rare in practice.
A non-trivial number of projects which have the potential to be inconsistent, actually are, suggesting that consistent metadata across wheels is not something the community is making a significant effort to achieve.

There’s also the fact that 99% of the projects on PyPI are probably garbage. There’s a huge long tail of experimental, unused, unmaintained and otherwise useless packages on there^[1], and ideally we should only focus on “useful” packages. Pretty much any analysis that ignores this is going to be flawed (either too optimistic, or too pessimistic, it’s impossible to tell…).

some of which are mine ↩︎

jsirois · March 7, 2024, 10:39pm

On “useful” + problematic front there is the known-problematic instance of torch, which is maintained, packaged, published by a ~platinum sponsor no less. They are known to publish disjoint metadata across their platform specific wheels for a given version and this causes much pain trying to create a multi-platform lock in the Poetry / PDM style. I maintain a locker / installer in Pex that handles both that Poetry / PDM style (we call it “universal”) as well as platform specific locks (one or more in same file). Pex also takes the - seemingly necessary - step of using the metadata of a single distributiuon when forming a “universal” lock out of expediency. The Pex lock functionality has existed sine January 2022, which means to grab all possible applicale metadata for a lock, it would need to pull down many torch wheels for a given release (GBs worth of data). That, on top of the fact it uses Pip under the covers to do the lock resolve (with 3 small runtime patches), which also picks just 1 instance of a release’s metadata IIUC, means this sort of popular outlier if a real concern in practice.

brettcannon · March 7, 2024, 10:43pm

Instead of making it a publishing requirement we could just make it a requirement of lock files. That way we aren’t forcing it (yet), but we can provide a powerful nudge when lockers detect an inconsistency. The way I could write it is:

If locking for an environment (i.e. per-file locking):
1. The locker MUST read the metadata for each file, assuming it may be unique to the file.
If creating a set of constraints (e.g. the PDM case):
1. The Requires-Dist and Requires-Python values MUST be consistent across wheels, else the locker raises an error.
2. If the core metadata version in the sdist is < 2.2, the same metadata is listed as dynamic, or differs from the wheels, then the sdist MUST be left out of the lock file. Lockers MAY raise an error if they so choose.

Otherwise we could have a way for projects to opt into consistent metadata as part of the future PEP. Something like introducing core metadata 2.4 which adds a “Consistent-Metadata” key that if set means all released files of that project for that version can be assumed to be consistent. Or we get fancy and have “all” represent sdist and wheels and “wheels” to mean the sdist differs but the wheels are all the same.

Yep, that’s the worst case if we can’t figure something out.

Actually PyPI at least now serves the metadata for all files, so the file size shouldn’t be a concern anymore (the backfill finished earlier this week).

jsirois · March 7, 2024, 10:44pm

It’s a concern since PyPI is not the only index as everyone here is aware and has pointed out.
The torch folks also use / require a custom index or find-links depending in the vintage of torch you require.

brettcannon · March 7, 2024, 11:56pm

Very true, but hopefully people are asking their alternative indexes to support the appropriate standards to help with this. But if you’re standing up your own server then it’s more directly under your control and/or you may not care about the bandwidth costs.

Once again, hopefully they are doing the right thing here and exposing the metadata.

jsirois · March 8, 2024, 12:08am

Nope: https://download.pytorch.org/whl/torch/

Putting myself in their shoes though (and ignoring their larger than average resources), the standards here are taking a while to get invented and settle. There has been steady progress in the packaging space since I started to be heavily involved maintaining Pex from 2018 on, but that’s still 6 years and more before that. With at least 1 significant PEP in the packaging space per year, that’s alot of thrash I’d imagine to keep up with in an attempt to be a good citizen.

Suffice it to say - the Poetry / PDM style lock would greatly benefit from your Consistent-Metadata proposal, but it will take a while for that to be useful in practice while it becomes eventually prevalent.

brettcannon · March 8, 2024, 12:30am

If the standard empowering this wasn’t over 2.5 years old and it wasn’t about serving static files then I would agree.

That’s fine. We measure uptake in years.

jvolkman · March 8, 2024, 12:46am

Given that

“consistent metadata” seems to be already true for the majority of wheels (see discussions/queries from earlier today)
popular tools like Poetry, PDM, and now Pex assume this style today to make the locking process feasible and seem unlikely to change

…maybe a new metadata field should be opt-out (i.e., Variable-Metadata) rather than an opt-in? Otherwise, if it’s an optional, opt-in field, it seems likely that a large number of wheel authors will neglect to set it even though their wheels likely do have consistent metadata.

BrenBarn · March 8, 2024, 5:46am

I think this is a vital point. It’s especially important because broken/useless packages may be most likely to have metadata that’s broken in various ways. We don’t want to bend over backward with this particular system to support packages that will remain broken in other ways.

pf_moore · March 8, 2024, 4:13pm

Unfortunately, the job has now been running for over 24 hours, with my PC at 100% CPU throughout. It’s about half way through, but I don’t want to leave it running for another 24 hours or more. And I wrote the code fairly naively (expecting to only have to leave it running overnight) and so I haven’t persisted the data as I went along. I won’t kill it immediately, but I’m not promising I’ll let it run to completion.

I may have another go, but I’ll probably do a bit more research on writing scalable, restartable large scale jobs first (pointers to good tutorials would be welcome).

I think we got the main information, though - there’s enough problem cases to be an issue, but not so many that we can’t get useful results if the spec doesn’t handle them.

Edit: And a few minutes after posting that, after 27 hours of processing, the job crashed with an “index out of range” error. Sigh.

tmk · March 8, 2024, 4:42pm

Poetry has a content hash that is computed from the values of certain fields in the pyproject.toml (everything under tool.poetry.dependencies, tool.poetry.extras, and tool.poetry.group, I believe). But I thought the proposal here was to compute the hash from the lock file? That seems very different. The former won’t lead to merge conflicts when I run github’s dependabot on my lock file; the latter will.

charliermarsh · March 8, 2024, 6:11pm

Yes, we assume that a package will produce consistent metadata across all wheels, and also that the metadata we produce when building a source distribution locally will be consistent with the metadata produced when building on other Python versions, etc. I’m comfortable with it because (1) other, popular tools were already relying on this assumption; and (2) without this assumption, it (IIUC) becomes impossible to resolve for any system other than that which matches your markers exactly. I would of course support encoding it in a spec, though honestly I’m unfamiliar with the counter-arguments.

pf_moore · March 8, 2024, 7:26pm

While I can see the practical point of taking that view, I’m not particularly comfortable with the potential this gives for lockfiles to install broken environments. Someone needs to validate that the requirements are met, and in the context of a standard, neither the locker nor the installer can assume “the other one” will do the check, unless the standard explicitly states whose responsibility it is.

I like Brett’s formulation:

as this makes it completely clear that it’s the locker’s responsibility (which I think is the right choice, because in some scenarios, such as audit scanning, there may not even be an installer involved).

I don’t understand why this would be. That’s at least in part because I’ve never seen anyone explain the high-level process involved in creating a cross-platform lock. All the descriptions I’ve seen leave me feeling that there’s some unexplained “magic” involved. Or maybe it’s simply that anyone doing cross-platform locking is assuming that dependency metadata is a project version level thing, and I’m failing to follow the description because I can’t get past the obvious (to me, at least) fact that this is an invalid assumption.

But even if that’s the case, it doesn’t mean that it’s impossible to do cross-platform locking with file-level dependencies, just that it’s hard, and no-one has tried to do it yet, preferring to work with simplifying assumptions that are “good enough”. Which is fine, but if you’re going to do that you should validate your assumptions.

To be 100% clear, I don’t like the fact that metadata can vary at the file level. I think it’s a horrible misfeature of the Python packaging ecosystem, that we would never have allowed if we’d realised. And I’d happily change the rules to require that metadata is defined at the level of package+version, and must be the same in all files built for that combination. But getting to a point where that is the case, not just for current versions of packages but also for any older versions that can participate in dependency resolution, is a huge undertaking. And until we do it, we have to live in the reality we have, like it or not.

cemici · March 8, 2024, 7:41pm

unfortunately this would be horribly impractical, requiring the locker to download and inspect every distribution associated with a release, just in case there is variation. As noted earlier, this can be a lot of distributions eg charset-normalizer · PyPI.

PEP658 makes this somewhat more feasible: but still inspecting a hundred metadatas instead of one is… unattractive.

Not to mention that if the answer is not “they were all the same all along”: the challenge of building a cross-platform solution just got a lot harder! Should the locker try to infer a set of markers to express the variation, as it wishes the author had written in the first place? If not, then what?

The poetry issue tracker probably provides a pretty good proxy for the amount of dissatisfaction users find with “assume consistent metadata”.

Mostly an answer along the lines of “please advise the creator of the package you are trying to install to use PEP508 markers” seems to go ok-ish.

The projects for which this has been most painful are, I am pretty sure, torch and tensorflow - though it looks as though they both are coming round in their latest or upcoming releases…

charliermarsh · March 8, 2024, 7:48pm

I’ll try to clarify by asking some questions. Imagine I’m trying to generate a lockfile, and I need to include a source distribution. So, I need to build that source distribution in order to extract its metadata, to include its dependencies in the output resolution.

What assumptions should I be able to make about the stability of that metadata? If the answer is “nothing”, then the output resolution doesn’t really have any meaning, does it? If the answer is, “it will generate consistent metadata for your Python platform”, then I’d ask, what does that mean? If it’s “consistent metadata for your markers”, then the only assumption you could make is that the resolution is correct for your exact markers.

pf_moore · March 8, 2024, 9:05pm

That if it is metadata 2.2, and the dependencies are marked as “static”, then you’re good. If not, then you can’t use the sdist because the metadata may not be stable. This is the reality we’re in - a sdist that doesn’t declare its dependency metadata as static can do anything, up to and including changing its dependencies based on the time of day the build runs. This is basically what Brett suggested, in the comment I quoted.

Sdists are problematic, because they do arbitrary code execution at build time. We’re already making an assumption that a sdist isn’t pathological, in the sense that it doesn’t build a completely different wheel based on (say) the user doing the build. However, I don’t think we can reasonably say^[1] that sdists which calculate dependency data at build time^[2] are pathological in that sense. We can say that changing dependencies based on time of day is pathological, but that’s not what we’re talking about.

Can I point out, though, that we need to be very careful here. This is very close to the issue that caused the previous lockfile PEP to fail. As a community, we have consistently failed to come up with a clear definition of what these “Poetry-style” lockfiles are, or how they should behave (I get the impression that the details aren’t even consistent between Poetry, PDM and uv). And we’re heading down the same path we did last time, of claiming that this behaviour needs to be covered in the standard, but not being able to say what “this behaviour” actually is.

Personally, I’d much rather see a successful, but more limited scope, proposal that leaves Poetry-style locks for a later iteration, over a proposal that tries to make Poetry-style locks work, and as a result fails because it can’t find an acceptable middle ground. As PEP delegate, I think that a standard lockfile format would be an overall benefit to the packaging ecosystem, even if it doesn’t support all of the use cases that people currently describe under the heading of “lockfiles”. The key is does it support enough use cases, which means we should be focusing on what we can do with it, rather than what we can’t…

much as we might like to ↩︎
such as torch - see here ↩︎

charliermarsh · March 8, 2024, 9:32pm

One clarification: I think I don’t quite understand why this is limited to “Poetry-style” lockfiles. Can we ignore those for a second, and assume we’re just focused on the goals outlined in the initial post of the thread? (To be transparent, I would also prefer a proposal that is limited in scope.)

If you allow source distributions to produce dynamic metadata based on arbitrary conditions, why does the same problem not apply to the “locking for an environment (i.e. per-file locking)” case?

Alternatively, if you’re saying that source distributions can produce dynamic metadata, but that it must be consistent modulo some reasonable conditions (e.g., on the same machine), what are those conditions?

charliermarsh · March 8, 2024, 9:32pm

(We don’t have a lockfile in uv today – we read requirements.txt and pyproject.toml files, and output requirements.txt files.)

brettcannon · March 8, 2024, 9:53pm

Maybe.

I think that assumes it’s a manual setting and not one the build back-end sets.

Yes, but a content hash was also brought up and not entirely loved either.

Correct as they have different goals. Poetry tries to restrict the world, but still has to a resolve to figure out what to install. PDM has a linear list of package versions that can be individually skipped based if a marker is provided on that individual package version (and it only supports one version per package). The way I think of it is Poetry is thorough by trying to handle any situation, but PDM is pragmatic by assuming most conditional dependencies are just a yes/no question (e.g. “include on Windows” or “ignore if running on PyPy”).

Because people insisted on sdists being supported and thus it requires somehow allowing for all the muckiness it brings. Since you can’t really build an sdist to dynamically get the dependencies unless you’re build on that platform, you can probably assume it will be okay. But the key thing is the PEP is going to have to spell out what an sdist can cause in terms of relaxing of promises if certain conditions are met (and to allow lockers to error out if they don’t want to loosen those promises).

BrenBarn · March 8, 2024, 10:05pm

It’s not clear to me that the lockfile has to live in that reality, though. Couldn’t we say “sure, you can do the old-style bad thing where your metadata varies per file, but that is explicitly not supported for lockfiles, and if you do that, your package will break lockfiles, and if you want people to be able to use your package with lockfiles, you better fix your package”? It seems to me that the way to get towards that future world is to start now, by explicitly un-supporting the old way for new proposals, like this lockfile proposal.

I suppose the worry is that people will consider their lock tool broken, rather than the package? If we are able to eventually find all PyPI packages that do this per-file-metadata thing, could those be marked on PyPI, and then the data updated for new packages, so that we know which ones aren’t supported and can early-error on those? Of course this still wouldn’t support the case where someone’s using a non-PyPI index that doesn’t have that info. But basically my point is that I feel like the way to move away from the bad decisions of the past is to just be up front about the fact that we are doing so, make new tools and standards that shave off the sharp edges of the old ones, and clearly document that. In fact, it seems that that is already what is happening, since based on what others have said in this thread, it seems many existing tools already don’t support the per-file-metadata case. If we think it was a good idea and it already isn’t supported in practice, why bend over backwards to continue supporting it in theory (i.e., in a standard like this)?

This is something that’s been nagging at me throughout this discussion. In almost every post someone will make a statement and then say something like “modulo sdists” or “yes there are sdist issues”. But a major motivation for this entire proposal (maybe the motivation) was to handle sdists, since the previous lockfile proposal didn’t and was therefore rejected. But if we think we can never be sure, without building, what the dependencies of an sdist are, how can we hope to ever reliably generate lockfiles for them?

I don’t really use sdists at all so I’m not clear on what cases fall into that middle ground between “pathological, this isn’t supported” (like the time-of-day thing) and “easy to support” (static metadata).

Well, but the earlier proposal was rejected based on what it couldn’t do (namely sdists) so there’s a natural desire to worry about what needs to be covered. From my own perspective (which as I say is quite naive about some stuff like sdists), it looks like it’s possible the “sweet spot” might be far less comprehensive than this proposal: for instance, just drop sdists entirely and make lots of simplifying assumptions, resulting in a lockfile that works for 90%+ of useful packages and just punts on the rest. (And, again, that seems to be sort of what we have, in the sense that current tools are apparently leaning towards that.)

Topic		Replies	Views
PEP 665, take 2 -- A file format to list Python dependencies for reproducibility of an application Packaging	180	14354	April 16, 2023
A file format to list Python dependencies of an application without strict reproducibility guarantees Packaging	9	1280	January 6, 2022
How should a lockfile PEP (665 successor) look like? Packaging	105	7483	October 20, 2022
Structured, Exchangeable lock file format (requirements.txt 2.0?) Packaging	109	16875	August 1, 2021
Supporting sdists and source trees in PEP 665 Packaging	64	3782	November 25, 2021

Lock files, again (but this time w/ sdists!)

Related Topics