Pre-PEP: Patch files as releases

mikeshardmind · November 25, 2024, 12:56am

I’ve been considering various ways to reduce the impact of packaging on file sizes and bandwidth. Things like compression of a wheel can help, but there are likely even bigger gains than this available for some libraries by only sending what’s changed on some releases. This is likely exceptionally true for libraries that wrap and bundle native code in cases where the native code being wrapped has not changed since the previous version, or the inverse, an update only to update the bundled native code.

Patch files as a release.

It’s relatively simple, but the idea here is that for a small change, especially in libraries that bundle large binaries that may not have changed, a patch release can consist of

new metadata.
a prior version that had a full release.
a mapping of hashes of wheels to binary patch files that should be applied, and the hash it should result in.
The patch files must result in the same platform tag for before and after.

In terms of uploader and index actions:

An uploader may (but does not have to) upload, a full wheel alongside only if it matches the resulting hash of the patched wheel (should be bit for bit identical)
A multi-uploader index should not allow other users to provide patches for packages they did not upload.
Indexes may (but do not have to) generate patches if an uploader has not, so long as the uploader uploaded a wheel.

Installation of this would be:

use an already locally cached wheel or fetch the wheel
Verify the starting hash
Create a new wheel by applying the binary patch
Verify the resulting hash
Install the new wheel
(up to installer) choose how to cache this (it’s possible to discard one of the wheels if the patch file and knowledge of the patch is kept, or just cache both)

Considerations:

If we haven’t gotten to “you can’t delete from pypi” an existing patch release should block deleting the version it patches for as long as it exists.
It should not from a safety standpoint require a marker if a patch was generated by an index given the position of trust and the requirement of resulting equivalence, but this might need discussion.
Getting the most out of this from a maintainer side may result in needing to change build processes some, so that native dependencies are built in a way that they are cached separately and re-used or otherwise built reproducibly. Changes along these lines are likely a win for these projects in terms of build time even without this pep, and some people’s build systems already look like this.

jamestwebber · November 25, 2024, 3:13am

This is a really cool idea. I do wonder if it’s possible to estimate the potential impact, or somehow gather relevant statistics from users or indexes.

My assumption is that in terms of file size, [previous wheel] + [patch] > [a fresh wheel] almost every time, even if it’s a very small relative difference. So how often are users installing a package into an environment with an older version that can be patched^[1], versus how often a package gets installed into a fresh environment? In the latter case, they end up downloading wheel + patch [ + patch ...].

The potential downside here is that the developer starts to upload patch releases because it’s easier for them, but a substantial majority of the installs just have to download everything from scratch. It would still reduce file storage on the index, but there’s an overall increase in bandwidth usage.

Maybe an addition that would help is: an index may generate a full wheel from the corresponding patch release if it decides that’s worth doing^[2]. This seems like a natural flipside to the “indexes can create a patch if they want to” suggestion.

or have an older version in their cache ↩︎
maybe they notice that a lot of requests are for both the previous wheel and the patch ↩︎

gpshead · November 25, 2024, 3:43am

Dropping some bread crumbs: Chrome and ChromeOS use a binary delta patching mechanism to drastically reduce the size of the updates billions of users download: Software Updates: Courgette (and a link to the code)

Binary delta compression is the typical search term for these things.

mikeshardmind · November 25, 2024, 4:28am

Without a doubt, yes, at least when only considering a single version. The calculus gets more interesting for disk space than bandwidth for only a single version, especially with frequently updated libraries. For libraries that are further along in stability and may have months to years between updates, it’s more likely a user will have the previous version already, and more likely to also help with bandwidth.

It also gets much more interesting when considering multiple successive versions which are a patch: if you have {1.0, 1.1 (patch), 1.2 (patch)} because minor versions are changing less.

So it comes down to less stored in places that cache it (this is not just pypi, but CDNs and every mirror) It could also be used in local wheel caches to make them smaller on disk if installers generate or store existing patches.

This seems reasonable, but I’d want to leave this and the other entirely up to the index to decide when it is worth it. Especially if done dynamically, this could become a space/memory/cpu/ and bandwidth tradeoff.

I would have thought to link to xdelta or HDiffPatch, but Courgette is probably a better example showing off the gains in software people are familiar with, so thank you for that.

zstd also has a patching mechanism. Given the interest that exists in having wheel be compressible with zstd in the future, I thought there’d be a benefit in using a tool already being considered for inclusion, and it was what I was considering building on top of. With that said, only a slight lean towards as a result of that, and open to other options; I haven’t compared existing options for implementation performance in enough detail yet to ensure this is the right choice, and won’t be doing that till I’m more sure that I’ve covered the design constraints different parties would need or want.

mitsuhiko · November 25, 2024, 9:10am

Minor note here: courgette has been replaced by Zucchini. I’m not familiar with those algorithms but my understanding is that those rely on decompiling PE/ELF files and diffing the instructions. The underlying goal however is that given an original file + a patch you end up with a perfect match to the file you would have downloaded from scratch.

I like that concept in general because it means that it does not complicate anything for package authors or the packaging tools. It would be up for PyPI to generate those patches, and an installer could leverage them if they already have a base line download they can use.

I think for this system to work, it would be necessary to be able to apply the patches to a structure that tools like uv, poetry, pip already have cached. For instance uv today does not cache wheels but unpacked wheels which already removes some amount of information to retaining the original wheel. Ie: one would have to run the diff on the files contained in the archive instead of the wheel. My suspicion is that this will also be easier than to do it on the container. One could use courgette/zucchini for contained .so files and basic text diff for python files.

mwichmann · November 25, 2024, 3:30pm

Just to make sure the complexity of doing this well isn’t missed, let me
note that Fedora supported a package delta scheme (Delta RPMs) which,
just from watching upgrades over the years, didn’t seem to do much after
some early successes - the numbers the package manager reported at the
end up an upgrade were usually minimal (single-digit percentages), and
not infrequently, some deltas failed to apply and then the whole new
package had to be downloaded anyway. Not part of the Fedora team, just a
long-term user, so didn’t worry too much about it - in only just dawned
on me I wasn’t seeing any deltarpm trafffic any longer.

Turns out this wasn’t just anecdotal - Fedora has dropped generating
these just recently. Some thoughts justifying the change here:

https://fedoraproject.org/wiki/Changes/Drop_Delta_RPMs

and the discussion page:

Of course every situation is different but it may be worth considering a
case where it went wrong for learnings.

steve.dower · November 25, 2024, 3:59pm

I also love the idea, but want to point out that MSP (the patch format that applies to Windows MSIs) doesn’t get anywhere near the use or love that it deserves. Even the incremental update features of MSIs^[1] tend not to be used, and leads to more confusion and broken installs than anything else. So it’s hardly been a massive success there.

That said, Steam saves huge amounts of time and bandwidth by patching games rather than replacing entire files, so there are clearly some contexts where it is very beneficial. I think if those can be identified and then justified as applying to wheels, then the complexity argument can be overcome.

In the meantime, I don’t think there’s any harm in building the tooling independent of PyPI itself. For organisations maintaining entire clusters there’s certainly going to be value in binary diffing an environment once and deploying a patch, and possibly that’s going to be most efficient at the package level? This looks like an idea that can be implemented first without needing to standardise everything, and if it’s valuable, then people will beg for it to be standardised.

Updating a specific file/component only if the MSI contains a newer version. ↩︎

bwoodsend · November 25, 2024, 4:18pm

I’d also add Arch Linux to the list of packaging ecosystems that tried this for a while but were underwhelmed by how little bandwidth was actually saved and gave up. From mucking around locally with these binary delta tools, I’ve generally observed that even small changes in source code lead to massive deltas in the executable once the compiler/optimiser has done its work.

If we’re looking for size optimizations, I’d personally suggest doing what Fedora does with debug symbols – namely strip -ing the main binaries then banishing the debug symbols into separate optional subpackages (or would be sub-wheels in our case). I’ve seen extension modules go down an entire order of magnitude after using strip.

bwoodsend · November 25, 2024, 4:21pm

Another factor to consider is that the majority of downloads are likely made from CI/CD or containerized environments with no cache or previous version installed so none of them would benefit from this.

jamestwebber · November 25, 2024, 4:25pm

Games are very well-suited to this, since they’re often a huge initial download that is persistently installed for a long time, and many of the assets won’t change in a patch.

Patch releases seem most useful for wheels that include large binaries, but I wonder how much those binaries can change in a patched version. A useful experiment would be to build patches for some of the biggest wheels on PyPI (pytorch and tensorflow come to mind) and look at those diffs to see if this would improve storage and/or bandwidth.

mikeshardmind · November 25, 2024, 5:04pm

On my radar, but something I think python wheels are positioned to handle well. I will have to consider this more in-depth as part of the analysis. I played around with some wheels I had on hand a few months ago, and found meaningful gains in about half the ones that had binaries, and nearly all of the ones that did not, but that was only simple testing I did for myself to determine if this was actually worth working on in spare time. I’ve now reached the point of considering the design enough to reach out and make sure I’m covering adequately what other people would need before implementing.

I’m also fine if we get to the point of me implementing it and the community doesn’t find the gains worth the added complexity. That isn’t wasted work, may still be usable as a non-standard tool in some cases, and still would serve as an exploration and documentation of approaches that have been tried before, and why they were ultimately not enough.

So please do mention more things like this where you are thinking of them, I want whatever succeeds to be as good as reasonably possible, or if it fails, to fail because as good as reasonably possible wasn’t a worthwhile improvement.

This is something I considered, but there are two problems IMO

Courgette/Zucchini are much more expensive to compute and harder to integrate into the existing ecosystem. If the gains are high enough, and the ecosystem is open to it, it’s worth considering, but my initial “playing around with the idea” months ago mentioned above seemed worth it with simpler options. I’ll look into it more for comparison.
I think this also results in more overhead, reducing the number of cases it would be useful for. This might indicate that to get the most out of this, we need a flexible patch format and multiple defined strategies within the format, along with tooling smart enough to pick a strategy intelligently. I’m willing to work through the concerns of a flexible patch format that sufficiently allows different updates to use different strategies, but I’m concerned about this part in particular:

For instance uv today does not cache wheels but unpacked wheels which already removes some amount of information to retaining the original wheel.

I’m inclined to think that discarding the original in a non-recoverable way is a speed/soundness tradeoff that may usually be correct for uv. While it may not be incompatible with one of my design goals (reproducibility at each step), I’ll really need to look into how well specified the spreading step of wheel installation is, and this may require tightening that up and possibly even coming to a consensus on things like directory hashing in python packaging, both of which are things that I did not want to predicate this improvement on.

This might need to interact with some other concerns you’ve brought up about what is safe to cache and not in other contexts.

fungi · November 25, 2024, 6:52pm

I’d also add Arch Linux to the list of packaging ecosystems that tried this for a while but were underwhelmed by how little bandwidth was actually saved and gave up. From mucking around locally with these binary delta tools, I’ve generally observed that even small changes in source code lead to massive deltas in the executable once the compiler/optimiser has done its work.

Debian does it for package indices (not package files), but even then it’s only on by default for testing/unstable where the indices have a lot of churn, and past a few days worth of incremental patches it falls back to just downloading the entire file because the diffs in aggregate wind up being larger anyway.

encukou · November 26, 2024, 9:19am

For reference, I’ll note that Fedora uses zchunk for package list files. It’s a chunked format, allowing clients can only download changed parts of a file – like rsync, but doesn’t need dedicated software on the server (clients use HTTP range requests, so it’s compatible with HTTP proxies/CDNs). It only needs one file to be hosted, which decreases the load on mirrors compared to having separate “delta” files (AFAIK that was a major issue with deltaRPM).
But it’s probably no-go for PyPI: it needs a bespoke client library, and there’s some size increase. It’s suitable for frequently-changed files with lots of unchanged parts; I doubt that applies to wheels.

mitsuhiko · November 26, 2024, 12:38pm

(To say the obvious: the best way to avoid downloading a lot of stuff from PyPI will be to make sure that metadata is easily accessible and correct. A ton of downloads today come from catastrophic backtracking in resolvers and them fetching the entire archive)

dustin · December 2, 2024, 5:41pm

Note that PyPI supports HTTP range requests, which allows installers to only download the metadata and not the entire archive when backtracking (e.g., pip’s implementation).

mikeshardmind · December 6, 2024, 1:03am

Thanks everyone for your initial feedback. I don’t think this is going to be an “every package should use this” tool, both based on feedback, prior history in other ecosystems, and my own exploration since.

In line with:

I’ve gotten permission to work on this at work for public consumption. The details of this agreement aren’t important since it won’t negatively impact my ability to contribute or publish it.

The focus my workplace is most interested in using this is actually for local disk caching and not our generated internal indexes, but I think any benefit there has the potential to translate to index benefits as well.

I’ll keep an eye on this if anyone comes up with other concerns or use cases, but probably not going to see much of this one way or the other from here till I think I have an implementation worth publishing and discussing or reach a conclusion that it isn’t worth it (I’ll follow back up for either of those outcomes, a failure here would still be worth documenting the circumstances of for others)

mikeshardmind · March 4, 2025, 4:06pm

To not bury the lede, preliminary results from exploring this at work seem to indicate that this is worth pursuing, but that the extent to which it is is extremely context-dependent, and requires more careful thought about impact.

A summary of my findings from various explorations

Patch updates were extremely valuable when only documentation, metadata, or strings in a library needed updating. This was especially true when the only thing we changed was updating translation files.
Metadata only changes were the most valuable, but also we only had one instance of this in months.
Patch updates were expensive when attempting to use zucchini, zstd, and courgette. Expensive is relative, storage was nearly a wash (though this was due to our circumstances, we don’t keep all prior versions in our internal wheelhouse, only in-use versions), install time was slightly more, download time was significantly less. Overall, time to deploy updates went down, but total energy consumption during updates went up. There’s probably more value in this for a public index, especially with largely stable binary dependencies only getting infrequent changes, but this suggests that this might be an index feature that clients should have to negotiate so that the index can determine when it’s appropriate to use, and that it should be optional for indexes and clients to support.

I’m still working through many of the details as time allows, but this is not my main priority right now, and can’t be sure of if and when I’ll have something ready for proposal, and may split it into multiple proposals based on what was observed for basic text diffs being more obviously valuable.

ofek · March 4, 2025, 4:44pm

Somewhat germane: I think we should provide guidance on how to reproducibly build extension modules. I found this very difficult at work and I’m still unsure of best practices. To be clear I’m not talking about the literal archive wheel/sdist but rather the compiled artifacts within.