Making the wheel format more flexible (for better compression/speed)

You’re only talking about differences in (de)compression speeds but ignoring the complexity of the wheel format changes entirely. Merely improving the decompression speed is, in my opinion, not a good enough argument to change the format in a backwards incompatible way. Were zstd supported on the same level as lzma in the standard library, I would have no argument against this.

After spending so much time working on making it easy to run code outside the standard library, it would feel wrong to not use some :grinning:

The proposal will be unpack wheel; zstd -d {nested}.zip.zst ; unzip {nested}.zip and then install the current directory with the same old ‘unpacked wheel’ logic. Sure, it’s more complex, but not very much. Then your wheels are magically both smaller and faster and pypi costs less to operate. If an old wheel installer encounters the new wheel, it will see 2.0 in the WHEEL metadata and let you know you should upgrade.

Still an improvement with the nested.zip and without the .zst step, but not as effective. In either case the nested archive is optional, so any wheel generator could skip doing them.

Thank you for your feedback! I am going to write a proposal, but I will make doubly sure that no one will ever be forced to use zstd if they don’t want it. At least until a decompressor is available in the standard library.

1 Like

Not sure what you’re referring to, but since we’re talking about core PyPA tools here, using software outside of PyPA and the stdlib does not make a great case here (again, in my opinion).

If reducing PyPA operating costs was the intended effect then that would argue for LZMA compression, not against it.

Does that mean that PyPI will not be accepting zstd wheels? Will the nested format be limited to zstd compressed wheels only?

Yes, just like PyPI might require DEFLATE only in .zip’s or .gz only for sdists, even though it is possible to create and install different kinds of archives with pip, it should limit the uploaded wheels until everyone is very comfortable with any change. The nested format would be either standard compression of the nested archive by the outer zip, or zstd compression (and no other nonstandard zip compression) in the nested zip.

Having nested archives seems pointless to me regardless of the compression algorithm. I could be convinced otherwise by good arguments though.

Nested archives is not an unusual design choice.

“Debian packages are standard Unix ar archives that include two tar archives. One archive holds the control information and another contains the installable data.[2]

"Conda has been based around .tar.bz2 files since its inception. The actual file format doesn’t matter much to conda. Really only the relative paths within the container are what matter. Thanks to this simplifying assumption, Anaconda has been developing a new file format that will speed up conda. The new file format (.conda) consists of an outer, uncompressed ZIP-format container, with two inner compressed .tar files. These inner compressed tar files are free to use any compression that libarchive is built to support. For the inaugural .conda files that Anaconda is creating, the zstandard compression format is being used. The zstandard-compressed tarballs can be significantly smaller than their bzip2 equivalents. In addition, they decompress much more quickly. "

“Package metadata (the “info” folder) makes up one of the inner tarballs, while the actual package contents (everything but the “info” folder) make up the other. By separating the metadata this way, we take advantage of the indexed nature of zip files for rapid access to package metadata.”

RPM separates the metadata and contents a little differently.

“An RPM package is simply a header structure on top of a CPIO archive. The package itself is comprised of four sections: a header with a leading identifier (magic number) that identifies the file as an RPM package, a signature to verify the integrity of the package, the header or ‘tagged’ data containing package information, version numbers, and copyright messaging, and the archive containing the actual program files.” - https://blog.packagecloud.io/eng/2015/10/13/inspect-extract-contents-rpm-packages/

2 Likes

The reason for the nested archives is because the zip format compression operates on a file by file basis, which means you lose a meanginful amount of possible compression because it’s not acting over the entire archive at once. The flip side of that, is it’s actually situationally useful to be able to get random access to some of the files inside of a Wheel (notably the metadata files inside of the .dist-info directory). So a nested archive lets us get the best of both worlds, we can use per file compression for the .dist-info directory, enabling random access, and then we can compress the rest of the payload as a single stream of data, allowing greater compression wins.

For the record, I’m not sure that zstd is the right win here in general.

I do think that the recompression hook is kind of an odd duck, at least for long term support for a hypothetical new wheel format. Ideally I think if we change the format, the tooling would just learn to natively support it, not rely on recompressing it to different formats on the fly.

2 Likes

I changed my mind on the recompression hook as well. I had assumed that the baseline, DEFLATE-compressed “flat” wheels, would be faster than a more-compressed “nested” version. If that is the tradeoff then you would fill up your local wheel caches with old-style wheels to speed many-local-installs e.g. virtualenv-per-deployment style workflows, and only recompress for deployment. Turns out faster and smaller at the same time is an option so the extra step would waste time.

I think the download then decompress once math looks like:

If Algo A = 70 MB, decompresses in 5 seconds, Algo B = 75 MB, decompresses in 2 seconds, then

A is faster if your download speed is less than (5 MB / (5 - 2)s) * 8 bits / byte = 13.3 Mbps otherwise B is faster.

We’ve gone pretty far, but I don’t think the questions asked in the OP got answered directly. I’ve just read through all the posts here and, the discussion basically transitioned into “what would work well”, which is great.

Seems like the answer to the first question here is a yes, since we’re considering doing that right now.

These are the questions that haven’t been answered directly; can someone please shine some more light in this direction?

I’ve been interested in the idea of improving compression in wheel for a long time, but it was one too many features at the beginning. Excuse my enthusiasm that there is some traction now.

We need a test harness so that critics can download, say, the top 30 or 100 wheels, re-compress them with all variants of the idea and see a report about the change in time and space. A Jupyter notebook with wonderful charts? I have a compressor in my dholth/wgc repository but it does not do all compression algorithms, and a patch to install with pip. If we agree that we like a 25% reduction in bandwidth/storage or an increase in speed for the wheels built locally then,

We build the prototype up to a stage where it is useful locally and in private repositories, just like the original wheel which was used for local build caches long before it was used for public distribution on PyPI. Interested parties use the technology for ~6-12 months.

If there is consensus that this version is an improvement then PyPI makes rules for the subset of the spec that it will accept. Which ZIP compression algorithms are allowed? How much memory may decompression use? and allows the new version.

A similar gradual rollout should happen for “installable with pip without passing a flag” with “generated by bdist_wheel as the default” going last.

What version number should the prototype wheels have? We will choose a version number for testing wheels and increment that for a final spec. As long as the WHEEL file is in the same place as 1.0 all existing wheel installers will notice these and alert the user if they are installing an unsupported version.

1 Like

I don’t think there’s an exact criteria where if it falls under X line of improvement then it’s no go, but if it’s above X line then it’s a yes. Ultimately it comes down to how big the change is, how backwards compatible it is, and how big the improvement is. I suspect the real answer here is going to be a very vague “when the person who wants to make the change can convince enough people, particularly the BDFL delegate that the win outweighs the costs by a big enough degree”.

Write a PEP, get it approved.

With wheel specifically if we keep the wheel itself a zip file, and we keep the .dist-info directory as normal zip file members stored with delate or with no compression at all, then we would retain the ability for installers to support warnings/errors on unknown wheel version types and give useful error messages. Otherwise we’d be limited to putting the wheel version in the filename somehow.

Given the original wheel spec was designed with a wheel version embedded precisely so that new versions could be introduced cleanly, I’d be very uncomfortable with a new wheel version that throws away that versioning scheme before it’s even been of benefit for a single release bump. Feel free to consider that a BDFL-delegate pronouncement, if you want :wink:

(To put it another way, if we don’t follow the wheel versioning standard defined in PEP 427, then the new format should probably be given a new name and be treated as a replacement for wheels, not a new version of the wheel format).

Sure, but I was asking more on a technical level. :slight_smile:

And those two quotes together get me my answer: the *.dist-info/WHEEL file needs to continue to be readable on its own by code just like it is today to get at the Wheel-Version key.

1 Like

I was curious what effect the zlib tunables would have here. zipfile uses the zlib default compression level:

So if we drop a patch in wheelfile.py (with zopfli in there as well):

import zipfile, zlib, zopfli
z_comp = os.environ['Z_COMPRESSION']

def _get_compressor_patch(compress_type):
    if compress_type != ZIP_DEFLATED:
        return _get_compressor(compress_type)
    if z_comp == 'zopfli':
        return zopfli.ZopfliCompressor(zopfli.ZOPFLI_FORMAT_DEFLATE)
    return zlib.compressobj(int(z_comp), zlib.DEFLATED, -15)

_get_compressor, zipfile._get_compressor = zipfile._get_compressor, _get_compressor_patch

With that in place I used wheel pack on a tensorflow wheel at varying compression levels:

compression     user   sys   max RSS  size
deflate -1     50.23  1.04  1555476k  403M
deflate 1      23.58  0.91  1605604k  457M
deflate 2      24.79  0.97  1594144k  445M
deflate 3      28.16  1.01  1584876k  435M
deflate 4      31.30  0.94  1571500k  419M
deflate 5      37.34  0.93  1561152k  409M
deflate 6      50.17  0.98  1555252k  403M
deflate 7      60.30  0.95  1554268k  401M
deflate 8      88.70  1.12  1552852k  400M
deflate 9     115.30  1.27  1552480k  400M
zopfli       6154.12  3.24  3075700k  388M

i5-4670 desktop PC (scheduler powersave, frequency boost off). The user, sys, and max RSS are as output by /usr/bin/time. The size column is the humanized size per ls.

The actual wheel on PyPI seems to have been generated with a compression level of 3 or so, by its size. Perhaps wheel should configure zlib deflate with a slightly higher level of 6 or so?

I also threw zopfli in there for comparison. While the resulting wheel is smaller it took over an hour and a half to generate.

1 Like

Very neat… I was confused about some of the results in this discussion because I was using the OSX tensorflow wheel instead of the larger Linux one.

I tried again using wgc2.py on the 402MB tensorflow-2.1.0-cp37-cp37m-manylinux2010_x86_64.whl and an inner zip. I’m testing on a recent dual core MacBook Air, I’ll also try a Raspberry Pi 3.

With default ZIP_DEFLATE, 399MB, 1m20.254s.
With zstd -3 -T0, 275MB, 0m28.405s.
With zstd -19 -T0, 146M, 4m49.152s.

I thought it would be interesting to rank PyPI packages by downloads * size as a way to choose wheels to test instead of selecting them at random. Here’s a notebook that compares bandwidth versus popularity at the file level. https://nbviewer.jupyter.org/github/dholth/wgc/blob/master/pypipopular.ipynb . We could develop a corpus of wheels by taking the top 32 or 64 or so and compare the results running them through various schemes.

I’m more interested in making the format better for the end user than reducing bandwidth usage on the server, but those are related.

3 Likes

I’ve put up an index of 925 popular wheels in the prototype “nested .data.zip.zst” format. It is at https://w2test.website-us-east-1.linodeobjects.com/ unless I run out of transfer. It is suitable for use as pip install --extra-index-url=https://w2test.website-us-east-1.linodeobjects.com/simple/ [package] but you will need a patched pip from https://github.com/dholth/pip/tree/zstd-wheel

Went from 2.6G originals to 1.5G using zstd -19 -T2.

A silly counter-idea:

What if wheel zip did not use any internal, per-file compression, but the pypi server stored the artefacts raw, gzip, and br-compressed and thus served the artefacts according to Accept-Encoding request header field?

That was my idea:

Here are @uranusjr’s thoughts:

The actual wheel on PyPI seems to have been generated with a compression level of 3 or so, by its size. Perhaps wheel should configure zlib deflate with a slightly higher level of 6 or so?

Indeed. The compresslevel option was added to ZipFile relatively recently (Python 3.7) which explains its absence in the wheel project. It’s not a big deal to add the same option to the WheelFile class. Perhaps a warning should be emitted on earlier Python versions where the option is not supported?

I’ve created a new issue: https://github.com/pypa/wheel/issues/355