Yes, just like PyPI might require DEFLATE only in .zip’s or .gz only for sdists, even though it is possible to create and install different kinds of archives with pip, it should limit the uploaded wheels until everyone is very comfortable with any change. The nested format would be either standard compression of the nested archive by the outer zip, or zstd compression (and no other nonstandard zip compression) in the nested zip.
“Debian packages are standard Unixar archives that include two tar archives. One archive holds the control information and another contains the installable data.”
"Conda has been based around .tar.bz2 files since its inception. The actual file format doesn’t matter much to conda. Really only the relative paths within the container are what matter. Thanks to this simplifying assumption, Anaconda has been developing a new file format that will speed up conda. The new file format (.conda) consists of an outer, uncompressed ZIP-format container, with two inner compressed .tar files. These inner compressed tar files are free to use any compression that libarchive is built to support. For the inaugural .conda files that Anaconda is creating, the zstandard compression format is being used. The zstandard-compressed tarballs can be significantly smaller than their bzip2 equivalents. In addition, they decompress much more quickly. "
“Package metadata (the “info” folder) makes up one of the inner tarballs, while the actual package contents (everything but the “info” folder) make up the other. By separating the metadata this way, we take advantage of the indexed nature of zip files for rapid access to package metadata.”
RPM separates the metadata and contents a little differently.
“An RPM package is simply a header structure on top of a CPIO archive. The package itself is comprised of four sections: a header with a leading identifier (magic number) that identifies the file as an RPM package, a signature to verify the integrity of the package, the header or ‘tagged’ data containing package information, version numbers, and copyright messaging, and the archive containing the actual program files.” - https://blog.packagecloud.io/eng/2015/10/13/inspect-extract-contents-rpm-packages/
The reason for the nested archives is because the zip format compression operates on a file by file basis, which means you lose a meanginful amount of possible compression because it’s not acting over the entire archive at once. The flip side of that, is it’s actually situationally useful to be able to get random access to some of the files inside of a Wheel (notably the metadata files inside of the .dist-info directory). So a nested archive lets us get the best of both worlds, we can use per file compression for the .dist-info directory, enabling random access, and then we can compress the rest of the payload as a single stream of data, allowing greater compression wins.
For the record, I’m not sure that zstd is the right win here in general.
I do think that the recompression hook is kind of an odd duck, at least for long term support for a hypothetical new wheel format. Ideally I think if we change the format, the tooling would just learn to natively support it, not rely on recompressing it to different formats on the fly.
I changed my mind on the recompression hook as well. I had assumed that the baseline, DEFLATE-compressed “flat” wheels, would be faster than a more-compressed “nested” version. If that is the tradeoff then you would fill up your local wheel caches with old-style wheels to speed many-local-installs e.g. virtualenv-per-deployment style workflows, and only recompress for deployment. Turns out faster and smaller at the same time is an option so the extra step would waste time.
I think the download then decompress once math looks like:
If Algo A = 70 MB, decompresses in 5 seconds, Algo B = 75 MB, decompresses in 2 seconds, then
A is faster if your download speed is less than (5 MB / (5 - 2)s) * 8 bits / byte = 13.3 Mbps otherwise B is faster.
We’ve gone pretty far, but I don’t think the questions asked in the OP got answered directly. I’ve just read through all the posts here and, the discussion basically transitioned into “what would work well”, which is great.
Seems like the answer to the first question here is a yes, since we’re considering doing that right now.
These are the questions that haven’t been answered directly; can someone please shine some more light in this direction?
I’ve been interested in the idea of improving compression in wheel for a long time, but it was one too many features at the beginning. Excuse my enthusiasm that there is some traction now.
We need a test harness so that critics can download, say, the top 30 or 100 wheels, re-compress them with all variants of the idea and see a report about the change in time and space. A Jupyter notebook with wonderful charts? I have a compressor in my dholth/wgc repository but it does not do all compression algorithms, and a patch to install with pip. If we agree that we like a 25% reduction in bandwidth/storage or an increase in speed for the wheels built locally then,
We build the prototype up to a stage where it is useful locally and in private repositories, just like the original wheel which was used for local build caches long before it was used for public distribution on PyPI. Interested parties use the technology for ~6-12 months.
If there is consensus that this version is an improvement then PyPI makes rules for the subset of the spec that it will accept. Which ZIP compression algorithms are allowed? How much memory may decompression use? and allows the new version.
A similar gradual rollout should happen for “installable with pip without passing a flag” with “generated by bdist_wheel as the default” going last.
What version number should the prototype wheels have? We will choose a version number for testing wheels and increment that for a final spec. As long as the WHEEL file is in the same place as 1.0 all existing wheel installers will notice these and alert the user if they are installing an unsupported version.
I don’t think there’s an exact criteria where if it falls under X line of improvement then it’s no go, but if it’s above X line then it’s a yes. Ultimately it comes down to how big the change is, how backwards compatible it is, and how big the improvement is. I suspect the real answer here is going to be a very vague “when the person who wants to make the change can convince enough people, particularly the BDFL delegate that the win outweighs the costs by a big enough degree”.
Write a PEP, get it approved.
With wheel specifically if we keep the wheel itself a zip file, and we keep the .dist-info directory as normal zip file members stored with delate or with no compression at all, then we would retain the ability for installers to support warnings/errors on unknown wheel version types and give useful error messages. Otherwise we’d be limited to putting the wheel version in the filename somehow.
Given the original wheel spec was designed with a wheel version embedded precisely so that new versions could be introduced cleanly, I’d be very uncomfortable with a new wheel version that throws away that versioning scheme before it’s even been of benefit for a single release bump. Feel free to consider that a BDFL-delegate pronouncement, if you want
(To put it another way, if we don’t follow the wheel versioning standard defined in PEP 427, then the new format should probably be given a new name and be treated as a replacement for wheels, not a new version of the wheel format).
Very neat… I was confused about some of the results in this discussion because I was using the OSX tensorflow wheel instead of the larger Linux one.
I tried again using wgc2.py on the 402MB tensorflow-2.1.0-cp37-cp37m-manylinux2010_x86_64.whl and an inner zip. I’m testing on a recent dual core MacBook Air, I’ll also try a Raspberry Pi 3.
With default ZIP_DEFLATE, 399MB, 1m20.254s.
With zstd -3 -T0, 275MB, 0m28.405s.
With zstd -19 -T0, 146M, 4m49.152s.
I thought it would be interesting to rank PyPI packages by downloads * size as a way to choose wheels to test instead of selecting them at random. Here’s a notebook that compares bandwidth versus popularity at the file level. https://nbviewer.jupyter.org/github/dholth/wgc/blob/master/pypipopular.ipynb . We could develop a corpus of wheels by taking the top 32 or 64 or so and compare the results running them through various schemes.
I’m more interested in making the format better for the end user than reducing bandwidth usage on the server, but those are related.
What if wheel zip did not use any internal, per-file compression, but the pypi server stored the artefacts raw, gzip, and br-compressed and thus served the artefacts according to Accept-Encoding request header field?
The actual wheel on PyPI seems to have been generated with a compression level of 3 or so, by its size. Perhaps wheel should configure zlib deflate with a slightly higher level of 6 or so?
Indeed. The compresslevel option was added to ZipFile relatively recently (Python 3.7) which explains its absence in the wheel project. It’s not a big deal to add the same option to the WheelFile class. Perhaps a warning should be emitted on earlier Python versions where the option is not supported?