Wanted to say this is a really cool and exciting PEP! ^_^
I do not have answers for the compatibility questions raised upthread, but wanted to note a few points about zips and wheels I’ve found from personal research. I was particularly impressed at the care taken to define forwards and backwards compatibility, and I believe the stability requirements defined in the current draft would be more than sufficient for me to perform some of the experimentation I describe below.
HTTP Range Requests in the Wild
First, I have some very positive and lengthy comments regarding this section:
This PEP relies on resolvers being able to efficiently acquire package metadata, usually through PEP 658. This might present a problem for users of package indices that do not serve PEP 658 metadata. However, today most installers fall back on using HTTP range requests to efficiently acquire only the part of a wheel needed to read the metadata, a feature most storage providers and servers include. Furthermore, future improvements to wheels such as compression will make up performance losses due to inspecting files in the wheel.
I’m super glad this practice has been noted in a PEP! It turns out doing this robustly is complex (https://github.com/pypa/pip/pull/12208), and pip currently performs a naive version, but the implementation in that PR was improved based on feedback from the poetry maintainer.
It turns out there is actually some additional standardization that could be useful here–fastly’s PyPI CDN doesn’t support negative range requests, which requires performing an additional request. But pip needs to support the whole range of possible backends beyond PyPI anyway, so supporting this quirk is not an additional burden except in PyPI’s bandwidth usage. It may be worth mentioning how the range request approach needs to read from the end of the file, so supporting negative range requests can be an optimization for the backend.
I’ll note that I’ve seen some wheels put out by google put the METADATA file at the front (https://github.com/pypa/pip/pull/12208#issuecomment-1667384444), which is noncompliant and required further workarounds to make the pip implementation of range requests work against all of PyPI. This practice appears to be very uncommon, and is already covered by the existing wheel standard, so no change is needed, but it may be another reason to mention the negative range request optimization.
You’ve very effectively described it here already, but I also wanted to note how HTTP range requests are specifically useful for achieving metadata-only resolves against a remote --find-links repo (i.e. the simplest possible HTTP server, that just serves a folder of wheels). This is the approach Twitter employed for several years, and it’s extremely convenient for maintenance. It can particularly be used in tandem with a standard simple repository API as an additional index for testing or staging versions of a specific package, so it’s an excellent component of internal developer tooling. I think packaging standards should support self-hosted indexes, so I think it’s great that this practice is finally codified in a PEP.
Because the range request approach takes advantage of existing HTTP features (as you’ve noted already), and standardized features of the zip file format (the index at the end), I don’t think it deserves more treatment than you’ve given it here. But I’m very glad to see it finally identified as a valid approach to resolve against wheel repos.
Complex Zip Functionality
These are some mechanisms I’ve identified which leverage standardized features of the zip file format to achieve greater performance, space/bandwidth usage, or both. I am mentioning them here to motivate further progress in this area and to describe specific alternate wheel formats I would like to generate from a build system, if that capability were available.
-
Zip extraction can be parallelized.
- I’ve been working on this in rust (https://github.com/zip-rs/zip2/pull/236), but it requires a lot of calling directly into libc (which should really be in the stdlib), which makes it hard to ship as a crate.
- Python happens to have more platform support for e.g.
os.pipe()andos.pread()built in on POSIX platforms, so it would be easier to build in support there, and pip would be able to make use of faster extractions.
- Python happens to have more platform support for e.g.
- As it pertains to the wheel format, I think there are no constraints on the content of a zip file to employ this technique (however, avoiding symlinks makes it much easier). A conforming wheel file should already be prepared for parallel extraction.
- But since we’re considering other mechanisms like HTTP range requests, I thought it might be useful to raise this as well, since it would give all Python code the ability to extract a wheel more quickly.
- I think maybe I should post the “parallel extraction in stdlib” in Ideas?
- I’ve been working on this in rust (https://github.com/zip-rs/zip2/pull/236), but it requires a lot of calling directly into libc (which should really be in the stdlib), which makes it hard to ship as a crate.
-
zstd dictionaries can be employed to reduce the size of a wheel archive more than using zstd alone.
- (This one is more complicated, but I wanted to raise it here because it would motivate a wheel format with additional metadata files.)
- For many large codebases, creating a zstd dictionary (using block size 1000 or so) of all text files (ones which can be parsed as UTF-8) enables greater compression ratios for the resulting files. (For CPython, creating a dictionary of all text files in the git repo reduces their combined compressed size by 15%.)
- This can also be applied to binary files like
.soor.aoutputs (I was able to shrink a numpy wheel from 17M => 13M by creating separate dictionaries for text and compiled binary files).
- This can also be applied to binary files like
- This leads to the possibility of intentionally tagging certain types of files in a wheel, so that they can be incorporated into a dictionary, and used to reduce the overall output size. The tagging process could be performed by a build system when generating a wheel.
- To complicate matters further, this can be more efficiently encoded by making use of zip extra data fields to tag classes of outputs. But that would begin to make use of (standardized) zip features that wheels haven’t accessed yet.
- Note that (iiuc) dictionaries are also required for decompression, so this would be incompatible with clients which expect zip files without extra steps.
- This is a very complicated idea, and I’m currently making a prototype for it now, so I’m not proposing it as a standard at all.
- I mostly wanted to raise it to describe one particular way extending the wheel format would lead to direct efficiency savings and reduced PyPI download bandwidth.
- If a build system could provide alternate wheels upon upload (or whatever approach is decided upon here), it would enable experimentation like this.
-
Wheels can be used to generate zipapps without decompression.
- When generating conformant zipapps from a set of wheels (as is done by the pex tool), it’s possible to directly copy entries from a wheel file into the zipapp (see e.g. https://github.com/pex-tool/pex/pull/2175).
- This in particular means that decompression and filesystem interactions can be obviated entirely, which is really useful for performance and disk usage. (Ideally, generating a zipapp composed of cached wheels should take < 200 milliseconds.)
- This approach could also be extended to deduplicate entries with the same content across e.g. multiple versions of the same wheel, to reduce disk space used by very large cached wheels like
tensorflow.
- This is already supported by the python stdlib without any changes. But it particularly motivates the requirement you’ve stated here:
- When generating conformant zipapps from a set of wheels (as is done by the pex tool), it’s possible to directly copy entries from a wheel file into the zipapp (see e.g. https://github.com/pex-tool/pex/pull/2175).
Finally, future wheel revisions MUST NOT use any compression formats not in the CPython standard library of at least the latest release.
All three of these designs are still in the prototype phase, but the PEP as it stands seems to enforce sufficient compatibility guarantees for them to be employed. I have spent many many hours with the zip file format over the past few years and I would love for this PEP to reach standardization so I can make use of these techniques to make packages smaller and faster. I understand the above was a lot of information, but I hope it provides useful food for thought as to what this PEP would enable. Sorry if not!