Improving wheel compression by nesting data as a second .zip

dholth · May 26, 2019, 9:12pm

As you may know .zip files only compress file content, individually, and do not compress any of the zip metadata like filenames and the zip directory structure. This is inefficient for archives with a lot of small files because of the uncompressed metadata and because the compression doesn’t have time to “get going” on a longer input. This is one reason why we let you put files in the root of the wheel instead of giving all files a prefix.

One way to get around this problem is to put an uncompressed zip inside another zip, so that the entire inner zip is compressed as a single unit, similar to the nested archive structure of a .deb. This will be the most helpful for wheels with a lot of small files and a lot of long filenames. I finally got around to seeing what this would do to wheel.

This program transforms a wheel so that everything except the .dist-info metadata goes into an inner .zip, and can compress that .zip with various algorithms: https://gist.github.com/dholth/42a9b452f024d283b7c3c34cfa40832a

To test, I took the top 512 packages over the last 30 days according to https://hugovk.github.io/top-pypi-packages/ and recompressed the ones that have wheels. The report shows the original size, the size change when they are recompressed with either gz -9 (the only compression algorithm allowed in wheel), bzip2, or LZMA (xz), and the weighted savings by download count. Did I make a mistake with LZMA or is it that good (and slow)? https://docs.google.com/spreadsheets/d/1PPGK5m7mX5G3blpqG77AFeR59-PiXn9guOhySZow1PI

Change in transfer with nested .zip:

gz-9 93.8%
bzip 83.0%
xz. 58.7%

pitrou · May 27, 2019, 10:22am

LZMA is indeed able to achieve very high compression rates, especially at the highest (and slowest) levels.

As for bz2, besides being not very much better than gz, it is also very slow to decompress, which makes the user experience less pleasant (I see it especially on conda packages, which use bz2 and can be quite large sometimes @msarahan ).

msarahan · May 27, 2019, 12:16pm

This is a lot of why we’re moving to a new package format for conda. https://github.com/conda/conda-package-handling

njs · May 27, 2019, 3:25pm

One thing to watch out for with LZMA is that XZ is kind of a terrible archival format, i.e. you probably shouldn’t use it for data you care about being able to read a decade from now. This isn’t necessarily the main criterion for something like PyPI, but it’s not irrelevant either. See: https://www.nongnu.org/lzip/xz_inadequate.html

Also something that’s not mentioned there is that you XZ/LZMA can be configured in modes that are very memory-hungry while decompressing, which might be problematic if you’re trying to install a package on a small-memory VM.

On purely technical grounds, zstd or brotli are probably better options than anything else, but neither is in the stdlib, which is a serious problem for pip…

dholth · May 27, 2019, 4:26pm

Brotli is cool. Might it have a super slow pure python version? Don’t know if lzma as done by zip has the same issues as literal .xz. Yes lzma in zip is supported by the standard and stdlib zipfile; brotli isn’t. Memory usage is probably bound by the modest size of Python packages?

njs · May 27, 2019, 4:50pm

Is there a standard way to use lzma directly in zipfiles? My impression is that lzma itself is not nearly as messy as xz, but I thought xz was the only widely supported way to use lzma.

If we’re talking about the technical quality of archive formats, then this critique of tar is also really interesting reading: https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

brettcannon · May 27, 2019, 5:51pm

If I remember correctly @Dino and I talked Python-specific archival formats at PyCon US.

pitrou · May 27, 2019, 5:58pm

Or perhaps you should anyway. The article you link to mentions many theoretical issues but it seems unlikely that XZ may end up unmaintained in 10 years, or fail reading files previously written.

As for data corruption, I’m not sure this is something you want to worry about at the application file format level nowadays.

I’m not sure zstandard achieves compression levels as high as xz. Brotli is behind zstd AFAIU.

dholth · May 27, 2019, 11:05pm

Compression algorithms are cool. Zstd sounds amazing (modern parallel cpu friendly design) but it also sounds hard to reimplement. Lzma is painfully slow to compress at the default setting and so might make users sad. Brotli is a tasty snack.

pitrou · May 27, 2019, 11:21pm

It’s borderline crazy to suggest reimplementing a compression algorithm in pure Python (at least for use with CPython). Just put the C wrapper in the stdlib.

dholth · May 28, 2019, 12:06am

Sure there are probably easier bootstrapping methods if you want but don’t already have the native decompressor.

If we do implement this LZMA would be the obvious choice. It provides excellent compression, it’s in the standard library as zipfile.ZipFile("zipfile.zip", "a", compression=zipfile.ZIP_LZMA), the memory usage on compression would surely not be more than 2x the size of the wheel (and only two of the top 512 are > 50MB) in the worst case and less on decompression, and if you were unhappy with the speed of repeated installs from cache you could reverse the transform or recompress the nested archive with gzip.

pitrou · May 30, 2019, 7:08pm

Just for the record: Fedora 31 Considers Compressing Their RPM Packages With Zstd Rather Than XZ - Phoronix

The switch from XZ to Zstd compression for Fedora RPMs is currently being considered in the name of greater decompression performance. Tests done by Red Hat engineers show this would pay off big time in much faster decompression speeds – around a third of the time it takes to decompress XZ’ed RPMs currently either to Tmpfs or an actual on-disk file-system. If going for the Zstd Level 19 compression level that’s being considered, it would also offer a much better compression ratio. At present, Fedora’s XZ-compressed RPMs are done at level two.