PEP 777: How to Re-invent the Wheel

Wanted to say this is a really cool and exciting PEP! ^_^

I do not have answers for the compatibility questions raised upthread, but wanted to note a few points about zips and wheels I’ve found from personal research. I was particularly impressed at the care taken to define forwards and backwards compatibility, and I believe the stability requirements defined in the current draft would be more than sufficient for me to perform some of the experimentation I describe below.

HTTP Range Requests in the Wild

First, I have some very positive and lengthy comments regarding this section:

This PEP relies on resolvers being able to efficiently acquire package metadata, usually through PEP 658. This might present a problem for users of package indices that do not serve PEP 658 metadata. However, today most installers fall back on using HTTP range requests to efficiently acquire only the part of a wheel needed to read the metadata, a feature most storage providers and servers include. Furthermore, future improvements to wheels such as compression will make up performance losses due to inspecting files in the wheel.

I’m super glad this practice has been noted in a PEP! It turns out doing this robustly is complex (https://github.com/pypa/pip/pull/12208), and pip currently performs a naive version, but the implementation in that PR was improved based on feedback from the poetry maintainer.

It turns out there is actually some additional standardization that could be useful here–fastly’s PyPI CDN doesn’t support negative range requests, which requires performing an additional request. But pip needs to support the whole range of possible backends beyond PyPI anyway, so supporting this quirk is not an additional burden except in PyPI’s bandwidth usage. It may be worth mentioning how the range request approach needs to read from the end of the file, so supporting negative range requests can be an optimization for the backend.

I’ll note that I’ve seen some wheels put out by google put the METADATA file at the front (perform 1-3 HTTP requests for each wheel using fast-deps by cosmicexplorer · Pull Request #12208 · pypa/pip · GitHub), which is noncompliant and required further workarounds to make the pip implementation of range requests work against all of PyPI. This practice appears to be very uncommon, and is already covered by the existing wheel standard, so no change is needed, but it may be another reason to mention the negative range request optimization.

You’ve very effectively described it here already, but I also wanted to note how HTTP range requests are specifically useful for achieving metadata-only resolves against a remote --find-links repo (i.e. the simplest possible HTTP server, that just serves a folder of wheels). This is the approach Twitter employed for several years, and it’s extremely convenient for maintenance. It can particularly be used in tandem with a standard simple repository API as an additional index for testing or staging versions of a specific package, so it’s an excellent component of internal developer tooling. I think packaging standards should support self-hosted indexes, so I think it’s great that this practice is finally codified in a PEP.

Because the range request approach takes advantage of existing HTTP features (as you’ve noted already), and standardized features of the zip file format (the index at the end), I don’t think it deserves more treatment than you’ve given it here. But I’m very glad to see it finally identified as a valid approach to resolve against wheel repos.

Complex Zip Functionality

These are some mechanisms I’ve identified which leverage standardized features of the zip file format to achieve greater performance, space/bandwidth usage, or both. I am mentioning them here to motivate further progress in this area and to describe specific alternate wheel formats I would like to generate from a build system, if that capability were available.

  1. Zip extraction can be parallelized.

    • I’ve been working on this in rust (https://github.com/zip-rs/zip2/pull/236), but it requires a lot of calling directly into libc (which should really be in the stdlib), which makes it hard to ship as a crate.
      • Python happens to have more platform support for e.g. os.pipe() and os.pread() built in on POSIX platforms, so it would be easier to build in support there, and pip would be able to make use of faster extractions.
    • As it pertains to the wheel format, I think there are no constraints on the content of a zip file to employ this technique (however, avoiding symlinks makes it much easier). A conforming wheel file should already be prepared for parallel extraction.
      • But since we’re considering other mechanisms like HTTP range requests, I thought it might be useful to raise this as well, since it would give all Python code the ability to extract a wheel more quickly.
      • I think maybe I should post the “parallel extraction in stdlib” in Ideas?
  2. zstd dictionaries can be employed to reduce the size of a wheel archive more than using zstd alone.

    • (This one is more complicated, but I wanted to raise it here because it would motivate a wheel format with additional metadata files.)
    • For many large codebases, creating a zstd dictionary (using block size 1000 or so) of all text files (ones which can be parsed as UTF-8) enables greater compression ratios for the resulting files. (For CPython, creating a dictionary of all text files in the git repo reduces their combined compressed size by 15%.)
      • This can also be applied to binary files like .so or .a outputs (I was able to shrink a numpy wheel from 17M => 13M by creating separate dictionaries for text and compiled binary files).
    • This leads to the possibility of intentionally tagging certain types of files in a wheel, so that they can be incorporated into a dictionary, and used to reduce the overall output size. The tagging process could be performed by a build system when generating a wheel.
      • To complicate matters further, this can be more efficiently encoded by making use of zip extra data fields to tag classes of outputs. But that would begin to make use of (standardized) zip features that wheels haven’t accessed yet.
      • Note that (iiuc) dictionaries are also required for decompression, so this would be incompatible with clients which expect zip files without extra steps.
    • This is a very complicated idea, and I’m currently making a prototype for it now, so I’m not proposing it as a standard at all.
      • I mostly wanted to raise it to describe one particular way extending the wheel format would lead to direct efficiency savings and reduced PyPI download bandwidth.
      • If a build system could provide alternate wheels upon upload (or whatever approach is decided upon here), it would enable experimentation like this.
  3. Wheels can be used to generate zipapps without decompression.

    • When generating conformant zipapps from a set of wheels (as is done by the pex tool), it’s possible to directly copy entries from a wheel file into the zipapp (see e.g. https://github.com/pex-tool/pex/pull/2175).
      • This in particular means that decompression and filesystem interactions can be obviated entirely, which is really useful for performance and disk usage. (Ideally, generating a zipapp composed of cached wheels should take < 200 milliseconds.)
      • This approach could also be extended to deduplicate entries with the same content across e.g. multiple versions of the same wheel, to reduce disk space used by very large cached wheels like tensorflow.
    • This is already supported by the python stdlib without any changes. But it particularly motivates the requirement you’ve stated here:

Finally, future wheel revisions MUST NOT use any compression formats not in the CPython standard library of at least the latest release.

All three of these designs are still in the prototype phase, but the PEP as it stands seems to enforce sufficient compatibility guarantees for them to be employed. I have spent many many hours with the zip file format over the past few years and I would love for this PEP to reach standardization so I can make use of these techniques to make packages smaller and faster. I understand the above was a lot of information, but I hope it provides useful food for thought as to what this PEP would enable. Sorry if not!

4 Likes

It is…? You mean this part:

Recommended archiver features

Place .dist-info at the end of the archive.

That’s a pretty significant misreading, or misrepresentation, of what the standard says.

Please be a bit more careful tossing around “non-compliant” here - it’s often taken as a slight, and can offend people (especially when they are actually compliant).

5 Likes

(I was going to edit that term out but am having trouble finding the edit button now, will do after sending this.)

Thank you for your patient and clear response. I agree that “non-compliant” was not a correct nor useful description of the behavior, and it is very clear to me why that’s a very harsh phrasing. I used it carelessly there and won’t do so again.

1 Like

Also, there is genuine utility to that specific approach as well (.dist-info/ at front), in that it doesn’t require any parsing logic for the zip format to get to the METADATA file. I can actually see it being possible to parse out Requires-Dist lines from a wheel purely with grep for testing.

And in the context of remote requests, having the metadata at the front also seems easier to optimize, and it’s actually more likely to be supported by HTTP caches than negative range requests (as we’ve found). And if anyone’s going to be called “non-compliant”, I would think my pip implementation before I fixed it to handle this case would be a prime candidate, since it made non-standard assumptions about the structure of a wheel, instead of sticking to the stable semantics of HTTP range requests and the zip file format itself (which I appealed to above).

So thanks again for calling me out and giving me the opportunity to re-evaluate this!

1 Like

On the topic of data layout in zip archives: while it’s convenient for streaming purposes that the zip format has its index at the end (you can create zip files in parallel recursively this way), I think @zwol noted that that’s an example of optimizing for ease of writing (which happens just once), rather than reading (more than once, usually).

I believe this complaint can also be levied against tar archives, which don’t even support seeking to an entry without being augmented with an index like estargz provides (stargz-snapshotter/docs/estargz.md at main · containerd/stargz-snapshotter · GitHub).

In fact, even the recursive parallel merge approach for creating zip files just throws away the index entries for every intermediate zip except the final one. Which would seem to imply that the problem isn’t which end you crack your egg on, but that we have to crack the egg in the first place!

Which is to say that encoding the entry index inline with the entries themselves seems like an assumption worth revisiting. In discussions around 2020 or so with Jon Johnson I recall it was exciting to realize --fast-deps and e?stargz had independently discovered the same principle, but at the time I was convinced it was an advantage over tar archives that zip files already have that index available. Now, I see that estargz can claim:

This extension is a backward-compatible so the eStargz-formatted image can be pushed to the registry and can run even on eStargz-agnostic runtimes.

and even just the ability to specify the TOC format like this as a generic name-value mapping is a huge advantage stargz-snapshotter/docs/estargz.md at main · containerd/stargz-snapshotter · GitHub

APPNOTE.TXT is reasonably parseable, but the zip index is really inflexible because it’s also part of the binary file format, which sets restrictions on it for versioning and just technical parsing reasons. Understanding it requires understanding both the structure of the index as well as the encoding of the index into a serializable form.

estargz supports identifying the TOC separately from the tar itself: stargz-snapshotter/docs/estargz.md at main · containerd/stargz-snapshotter · GitHub

{
  "version": 1,
  "entries": [
    {
      "name": "bin/",
      "type": "dir",
      "modtime": "2019-08-20T10:30:43Z",
      "mode": 16877,
      "NumLink": 0
    },
    {
      "name": "bin/busybox",
      "type": "reg",
      "size": 833104,
      "modtime": "2019-06-12T17:52:45Z",
      "mode": 33261,
      "offset": 126,
      "NumLink": 0,
      "digest": "sha256:8b7c559b8cccca0d30d01bc4b5dc944766208a53d18a03aa8afe97252207521f",
      "chunkDigest": "sha256:8b7c559b8cccca0d30d01bc4b5dc944766208a53d18a03aa8afe97252207521f"
    },
    {
      "name": "lib/",
      "type": "dir",
      "modtime": "2019-08-20T10:30:43Z",
      "mode": 16877,
      "NumLink": 0
    },
    {
      "name": "lib/ld-musl-x86_64.so.1",
      "type": "reg",
      "size": 580144,
      "modtime": "2019-08-07T07:15:30Z",
      "mode": 33261,
      "offset": 512427,
      "NumLink": 0,
      "digest": "sha256:45c6ee3bd1862697eab8058ec0e462f5a760927331c709d7d233da8ffee40e9e",
      "chunkDigest": "sha256:45c6ee3bd1862697eab8058ec0e462f5a760927331c709d7d233da8ffee40e9e"
    },
    {
      "name": ".prefetch.landmark",
      "type": "reg",
      "size": 1,
      "offset": 886633,
      "NumLink": 0,
      "digest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8",
      "chunkDigest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8"
    },
}

Honestly, my first thought about the TOC json is that it looks quite a bit like the json format for a simple repository API page… Simple repository API - Python Packaging User Guide

{
  "meta": {
    "api-version": "1.4",
    "project-status": "active",
    "project-status-reason": "this project is not yet haunted"
  },
  "name": "holygrail",
  "files": [
    {
      "filename": "holygrail-1.0.tar.gz",
      "url": "https://example.com/files/holygrail-1.0.tar.gz",
      "hashes": {"sha256": "...", "blake2b": "..."},
      "requires-python": ">=3.7",
      "yanked": "Had a vulnerability",
      "size": 123456
    },
    {
      "filename": "holygrail-1.0-py3-none-any.whl",
      "url": "https://example.com/files/holygrail-1.0-py3-none-any.whl",
      "hashes": {"sha256": "...", "blake2b": "..."},
      "requires-python": ">=3.7",
      "dist-info-metadata": true,
      "provenance": "https://example.com/files/holygrail-1.0-py3-none-any.whl.provenance",
      "size": 1337
    }
  ],
  "versions": ["1.0"]
}

(And I realize PEP 658 works great because it too takes this approach to adding metadata—putting it adjacent and not inline!)

I’m not 100% sure what the purpose of this subthread is, but I’d emphasise that the only actual requirement on a wheel is that it’s in zip format. There’s an implied constraint that the format used must be something that the stdlib zipfile implementation can handle, but that’s it. Files can be compressed or uncompressed, ordered in any way you like, and it’s still a valid wheel.

Putting the .dist-info directory at the end is a recommendation, to allow “potentially interesting zip tricks”. But those tricks are only optimisations - conforming tools are still required to work on any wheel[1], not just ones that are structured for ease of use.


  1. Adhoc code can be as casual and slapdash as you like of course, the standards don’t care :slightly_smiling_face: ↩︎

1 Like

Thank you for the reminder and the summary of above discussions. I think my intent from the above subthread (since my first post) was to understand if there were other constraints still being worked out that I should be aware of as an implementor.

There are general questions of caching and data architecture for python resolvers that are almost completely irrelevant to the wheel format, which is a format for local and network distributions of python dists. I was trying to understand that distinction above—thanks for grounding me.

Ambiguity in Wheel Versioning

I’m trying now to work out the mechanics of how to publish a wheel that uses PEP 777 to mark its Wheel-Version, and there are a few points that remain unclear to me:

Check that installer is compatible with Wheel-Version. Warn if minor version is greater, abort if major version is greater.

or:

Wheel-Version is the version number of the Wheel specification.

If it can be any valid string conforming to the current version (Version specifiers - Python Packaging User Guide) specification, great! But the specific identification of “minor” and “major” version (unchanged from the wheel 1.0 PEP) implies a much more restrictive X.Y format.

But also now that I think about it, I don’t think I’m actually allowed to just make up a version string, right? Because the capital W Wheel specification refers to the result of a standards process?

Cyclic Dependency

I am probably missing something here, but I don’t see how to get out of the chicken-and-egg problem if I can’t test against or even build an experimental wheel format until it’s already an accepted PEP. I could certainly add my own ad-hoc marker to my experimental wheels, but I would also then need to recognize that marker in my resolver/installer too, and get everyone else who wants to try the format to recognize that ad-hoc marker.

The constraints defined in Limitations on Future Wheel Revisions and that Paul kindly reiterated just above seem to give free rein to experimentation, but I don’t understand how to close the loop and actually resolve or install a funky new wheel format without adding another separate field besides Wheel-Version alone, and that makes me nervous. Maybe experimental wheel formats would just be a separate proposal? But that seems to defeat the purpose of “Re-inventing” the wheel!

In pre-PEP: User-Agent schema for HTTP requests against remote package indices, I proposed turning pip/25.2 <json> into pip/25.2 (PEP NNN) <json>, because the name/version convention for User-Agent is already established, but we needed a generic way to signal conformance to the proposed standard that was separate from client name and version. We have even less flexibility than that here, because the Wheel-Version is not something we can change on a whim.

But let’s say I just bite the bullet and add my own ad-hoc field to identify my experiments–PyPI only speaks Python standards, so it wouldn’t understand any upload that’s not backwards-compatible with its declared Wheel-Version (maybe that’s fine?). And like my telemetry PEP proposal, it seems to benefit PyPI to know if users provide input that’s supposed to be experimental.

Have I missed an obvious answer that would allow deploying my funky new wheel format e.g. to PyPI as opposed to my own index? Right now, it seems like the way to do that is:

  1. Ensure it’s backwards-compatible with a declared wheel version,
  2. Add a separate field (into .dist-info/WHEEL? but what if I want to resolve against it too?),
  3. Use a modified resolver which recognizes the ad-hoc marker from (2).

I think I’m missing something here! Sorry!

I’m not sure you’re missing anything, or at least anything that actually enables this. Part of why the improvement of the wheel format has consistently stalled is that there isn’t a way to encode new features that remains backward compatible. Compliant tools are supposed to reject wheels they haven’t been updated for. In a related way, there’s no way to ship multiple files with multiple wheel versions for the same library version, platformtag tuple, as the wheel info is in the file itself.

1 Like

I was originally going to quote many things you’ve observed about possible improvements on compression and file format @cosmicexplorer , but the post was turning into a gigantic wall.

The short of the matter is that If we want to get the most out of a new wheel format, we probably do need, at minimum, a clean break on the file extension because of the interaction with indexes and old tools. And if there’s actually support for that, it’s definitely worth reconsidering the use of zip files at all. For one, as you’ve noted the index location and format are more optimal for write and update than read. Another reason would be to allow for better iteration going forward.

With this, and changing the filename scheme to allow multiple versions to be uploaded side-by-side for compatibility reasons during a transition, (something along the lines of [library, libraryversion, wheelx version].whlx , where wheelx version will also be in the file, but exposed in the filename on indexes to allow iteration without immediately dropping old tool support)

We could have a very simple “exterior binary format” that the only structure of is: wheelx version, metadatalocation start, remainder of data start.

The actual binary format of the metadata and the content of the library itself can then change in the future. We could keep the inner format as a zipfile in the first iteration of this to minimize work needed, and get immediate benefits most people can agree on now, like zstd compression.

1 Like

I dunno if this is going to get any more traction than it did a year ago. :sweat_smile:

The zip format is very simple (not really much overhead from using it) and it’s universal (i.e. can be extracted with common OS tools). From the last conversation, I don’t think there are sufficient advantages to inventing a maybe more-flexible but Python-packaging-specific format

1 Like

I still don’t see a reason to care if OS tools can open a wheel as long as python can. If version 1 the inner format is a (unchanged current) wheel, and the metadata is toml, unzipping this format is less than 10 lines of python with no new library support needed and could be trivially included somewhere in the standard library, and is also trivially vendored. If the inner format always remains something that python itself knows how to handle, wherever that is included can be updated as needed, making the lag time to update 1 python version instead of “well, every tool breaks now”

1 Like

Shrug it’s a pretty small convenience that some people expressed a desire for. There has to be some justification for losing it. The theoretical ability to do something better in the future isn’t very compelling.

But we’re just going through the same discussion from last October

1 Like

The fact that we’re still having the discussion about how to get the other improvements people want is exactly what this would enable not waiting on again in the future. By putting the data needed to tell if we even support the file both in a stable location not dependent on other future improvements (can’t be a compressed file in the data section of an archive format…) in the file, and in the filename on indexes, that problem goes away and we can actually have improvements.

It’s also not really unconvienient to run python -m some_module --extract filename if we were to go this route. Anyone who has business for any reason caring about what’s in a wheel should be able to do that.

1 Like

I frequently open wheels to look at their contents directly, probably more often than I actually install them, and any typed command is far less convenient than selecting the file in my browser downloads list and having 7-Zip/equiv. open immediately.

6 Likes

I find it far more convenient to type

$ unzip foo<tab>

than go round clicking on things in any browser. Yes though, I also use OS utilities to extract wheel files. For manipulating wheels you need more like wheel pack/unpack but for manual inspection it is nice that you can just unzip as for any other zip file.

1 Like

This decision isn’t what’s blocking progress, though[1]. The existing proposal, still using zip, would similarly make it possible to make future updates without much disruption.


  1. frankly, I think it was settled before ↩︎

Any correct zip file utility should be able to view zip files with prepended data, as that’s a key feature of the zip format.

2 Likes