Making the wheel format more flexible (for better compression/speed)

Ok the above is wrong! I was doing too many things at once, and didn’t notice that the tar variants of the above were writing empty files into the archive (but all the files existed).

I’ve updated the gist with the fixed code, this is what pip is now:

  • output/current: 1.4M
  • output/current+bzip2: 1.3M
  • output/current+deflate: 1.4M
  • output/current+lzma: 1.3M
  • output/tar+brotli: 910K
  • output/tar+bz2: 1.0M
  • output/tar+gz: 1.2M
  • output/tar+xz: 914K
  • output/tar+zstd: 959K

tensorflow will have to wait until later.

1 Like

There’s also the measurement of decompression rate to consider per compression algorithm. If decompression takes a while then the topic on sharing installed packages across a machine becomes more relevant (on my phone or else I would link to it). But if decompression is fast then such special considerations isn’t necessarily needed.

Here’s the output for tensorflow’s manylinux wheels:

  • output/current: 402M
  • output/current+bzip2: 314M
  • output/current+lzma: 136M
  • output/tar+brotli: 148M
  • output/tar+bz2: 309M
  • output/tar+gz: 396M
  • output/tar+xz: 126M
  • output/tar+zstd: 122M

So my original numbers were definitely wrong :slight_smile: So compression algorithm does matter, I think that all of these except for bzip2 have relatively fast decompression available (and I think gzip/xz actually get faster the higher the compression level is? I could be misremembering that though).

That being said, assuming the compression #s hold up across other projects on PyPI the same as they hold up on pip’s latest wheel and tensorflow’s latest manylinux wheel, it does appear that some significant savings can be won here.

It appears that zstd gives us the best compression for tensorflow, but lzma gives us the best compression for pip (besides brotli, but brotli did worse on tensorflow). They’re both pretty close in general though so I think either one would be perfectly fine.

LZMA has a massive advantage though in that it’s already in the standard library since 3.3+, which lowers the barrier to being able to actually rely on it significantly. It really lowers the barrier to just two real major problems:

  • How would we handle Python 2.7?
  • Are we willing to tell people that they HAVE to ensure that their Python has the lzma module for pip to function? It’s currently an optional dependency for Python itself (as is zlib, but zlib is far more common to have available than lzma).

For the first one of those questions, there’s actually a fairly clean solution. We just ignore the fact it doesn’t exist in Python 2.7 natively in the spec, except to say that the new standard should be opt-in for wheels that are valid on Python 2, but on by default for wheels that only target Python 3. People could even ship two wheels, a py2 wheel on the older format, and a py3 wheel on the newer format.

Which really, at least in my opinion, brings us to the last question-- Do we feel comfortable mandating that the lzma module must be available in addition to zlib. I don’t really know the right answer to that-- probably we would want to try and survey the popular places people get Python from and see if LZMA is available in those Pythons. If nobody really has the lzma module by default in the Pythons they are shipping, then that sort of puts a nail in the coffin of that I think-- if most places make it available… then maybe it’s fine?

1 Like

You should double check CPU and memory overhead too… my vague memory is that LZMA is one of the most expensive algorithms to run, and one of zstd’s major selling points is that it’s cheap as well as effective.

This is particularly relevant when running pip on a small systems like Raspberry Pi’s.

Here are my results from last year for the most popular packages. The deltas tab shows the size difference between the original and recompressed packages. Other columns try to show the size difference weighted by the number of times each package is downloaded.


Recompressor https://gist.github.com/dholth/42a9b452f024d283b7c3c34cfa40832a

Some of the gain comes from turning many small individually compressed files (and never-compressed zip metadata) into one large singly compressed file.

I think that only “big” packages would prioritize moving to a new format.

bikeshed portion

I would do it without .tar. Invest in a streaming zip encoder/decoder e.g. https://github.com/SpiderOak/ZipStream. Zip is a better designed format, the combined format would not depend on multiple archive formats. I would not be offended by a compressed stored .zip. instead of relying entirely on zip-supported compression algorithms.

Am I reading this correctly that this is basically the embedded tarball idea, except it’s an embedded zip file with the members are ZIP_STORED?

That is how it works. It is a classic technique. The inner zip is not compressed so that it can be compressed by the outer zip. If the inner zip was compressed in some way then it would be desirable to use ZIP_STORED in the outer zip.

I’ve been thinking about this issue since the initial design in 2012. The typical Python package, at least at the time, is a small collection of small files. This kind of package will not be significantly smaller with other compression techniques, so it was a good tradeoff to have fast random access to those files in exchange for a tiny bit of space. Now we also have very big packages…

Yea I don’t actually care about zip vs tarball, the only thing is using the zip native compression limits us to gzip, bzip2, and lzma (and funnily enough, unzip on macOS doesn’t appear to support anything but gzip for a zip file, or I did something wrong).

That could mean if we ultimately push for something like zstd or brotli or whatever, that instead of using zip compression to store the data.zip that we do something similar to the tarball work where we store a data.zip.zst or so.

Yea. Looking at requests, pip, and tensorflow it appears we can get a file size that is somewhere between 30% to 80% of the current file size (30% being tensorflow, 60% being pip, 80% being reuqests).

My suggestion would standardize on a single archive format versus having two. I think it just makes things more coherent and less chance people have to jump around in their head as to which archive format they are dealing with.

And I do like the idea of the data.zip.zst file name for the decoupling of compression from archive format.

And one last thing to help motivate this whole discussion: based on Donald’s numbers, tensorflow could store 3 separate copies of the data files compressed with brotli, xz, and zstd in a single wheel and still save 6 MB compared to the status quo.

I wanted to like brotli but zstd is the best.

Here’s a pure-Python zstd decoder (uses WASM): https://github.com/dholth/zstdpy/blob/master/zstddec.py . It is not particularly fast because it uses a pure Python WASM runtime. It uses the compact ~50k single file decompress-only zstd. If this was to be made real I would recommend wrapping the streaming decompression API and using a different runtime.

1 Like

cc: @indygreg
For his experience writing the python wrapper for the zstandard C-API.

Is there a particular question you want me to answer?

I’m a huge fan of zstd due to its versatility. lzma/xz can still achieve higher compression ratios (read: smaller files) in many scenarios. But what makes zstd excel IMO is that decompression speed is always fast (>1 GB/s on modern CPUs) whereas lzma can be in the dozens of MB/s. This can add up to seconds when you are decompressing 100+ MBs.

The one thing to watch out for with zstd is that as you crank up the compression levels, it starts using a lot more memory. We’re talking hundreds of megabytes by the time you get to compression levels in the upper teens. Now, you can save a lot of bytes by cranking up the compression levels on large inputs. But it comes at a price. When the decoder can be on a memory constrained device (as is the case with Python - think Raspberry Pi’s), you’d probably want to set an upper bound on the compression/decompression window size so you don’t blow through a memory budget. Since most wheels will be very small, (hundreds of kilobytes to a few megabytes), I think a max window size of a few megabytes is reasonable, as there will be very few extensions that would be larger than a few megabytes. One area where lzma has an advantage over zstd is that lzma only uses a fraction of the memory on decompression for high compression ratios. (Compression is another matter entirely, however.)

I see some comments here about the choice of tar versus zip. It is important to consider how compression applies here. In a zip, each file is independently stored and compression is often used on each file. So a zip is effectively an iterable of individually compressed files. (Although you can store files without compression.) A tar is just a stream of file content + metadata. If you apply compression on a zip containing independently compressed entries, it won’t compress well, as most data is already compressed. This is why a compressed tar file is almost always smaller than a zip file. The main thing that zip files buy you is the ability to efficiently extract the content of a single file. This can be useful if you want to access a subset of files. Tar files require you to read the entire stream until you get the data for the file you want. You should probably not use zip files unless you care about random file access or want compatibility with Windows Explorer: Of course, if you don’t compress the files within the zip file and you feed that into a stream compressor, results will be similar to tar. But then you lose the benefits of efficient random file access, so you should probably just use a tar.

Finally, if you wanted, you could compress elements of a zip file using whatever compression you want. I think Firefox is using brotli in its omni.ja[r] files (which are effectively zip files). No standard software can extract them. But if you control the decoder, you can do whatever you want :slight_smile:

1 Like

This is an interesting point.

My tests above were all using whatever the highest compression level was for each compression, so that would be 22 for zstd, which suggests that a trade off here with zstd is that we’re likely going to have to tune the compression level down for zstd because of the memory issue on decompression. Do you happen to know offhand what a good target would be for zstd compression level? I can run my tests at a different compression level to see what the difference is.

The trade off then with lzma would be if we’re OK with requiring higher memory usage for building a wheel (and that’s an If!) lzma could possibly (assuming lowering the zstd compression level makes a meaningful difference) allow us to have much smaller wheels, without having to sacrifice file size for decompression memory.

Obviously the ultimate tradeoff with going with lzma is going to be the slower speed, personally I think that’s fine because I think for most of our wheels the speed difference is going to matter, even a few extra seconds for the big wheels is going to be noise in terms of the total install process I think.

The tar vs zip discussion is largely the same, in both cases we’d generate a container holding all the files with no compression applied to it, and then passing that entire file into the compressor, so the differences in how zip handle compression are not going to affect us.

>>> import zstandard
>>> for l in range(1, 22):
...     print('l=%d; memory=%d' % (l, 1 << zstandard.ZstdCompressionParameters.from_level(l).window_log))
...
l=1; memory=524288
l=2; memory=1048576
l=3; memory=2097152
l=4; memory=2097152
l=5; memory=2097152
l=6; memory=2097152
l=7; memory=2097152
l=8; memory=2097152
l=9; memory=2097152
l=10; memory=4194304
l=11; memory=4194304
l=12; memory=4194304
l=13; memory=4194304
l=14; memory=4194304
l=15; memory=4194304
l=16; memory=4194304
l=17; memory=8388608
l=18; memory=8388608
l=19; memory=8388608
l=20; memory=33554432
l=21; memory=67108864

Actually memory utilization on the decompressor will be a bit greater. But the window log/size is usually the biggest contributor.

You can also manually set the window log / size larger than what is exposed by the default compression levels:

>>> 1 << zstandard.WINDOWLOG_MAX
2147483648

I see applications in the wild use level 19 as the highest compression level when they want to optimize for size but not risk excessive memory utilization. If you can run a python process, you can probably allocate an 8 MB buffer :slight_smile: Level 19 is what Ubuntu’s dpkg-deb -Zzstd is using by default for compressing Debian packages.

The C library and zstandard package also allows you to explicitly control compression parameters. So if you wanted to use level 21’s default parameters but use an e.g. 16 MB window size, you could do that.

2 Likes

I didn’t put any thought into it beyond “what’s the highest” and figured we could tune from there, if 19 makes more sense that’s fine with me. I’ll rerun my numbers with that.

requests:

  • current: 57k
  • tar+xz: 44k
  • tar+zstd: 45k

pip:

  • current: 1.4M
  • tar+xz: 914k
  • tar:zstd: 960k

tensorflow:

  • current: 402M
  • tar+xz: 126M
  • tar+zstd: 141M

Here is the numbers again with zstd at 19, xz still at compression 6 actually.

So that means to me the tradeoffs seems to be:

lzma:

  • Smaller without blowing up the memory on decompression.
  • Slower
  • More memory on compression.
  • Already part of the Python standard library, so transition path is simpler.

zstd:

  • Much Faster
  • Somewhat larger files, but within the same ballpark.
  • Less memory on compression.
1 Like

zstd can still use a bit of memory on the compressor side. I think it is the same ballpack as lzma. Again, it is highly configurable via compression parameters.

Reran my tests with lzma set to 9 instead of 6, pip and requests were basically unaffected, tensorflow dropped down to 112M (from 126M with 6) so probably not worthwhile, but just as an FYI.

Could all compression method be allowed? As long as the uncompressed blob is the same format (eg tar or uncompressed zip, choose one) regardless of what is downloaded, the compression method could be whatever the client accepts (similar to an accept header in HTTP).

Of course, this would mean the package index would need to generate the different compressed versions on package upload.

Python package exchange is much more than PyPI, and request header is not a thing in a lot of cases. The only way I can think of to do this is to introduce an extra wheel tag (e.g. zstd.whl for a wheel compressed with zstd), but that would be very difficult to do in a backwards compatible way. And as @dstufft mentioned, very few packages would be motivated in uploading those new wheels, since the old ones are perfectly serviceable; tools in turn would be reluctant to implement support due to low adaptation.

I do think this would be a good way to do it, but feel the existing ecosystem isn’t structured for it to work.

I know it’s potentially some work but what if PyPi would generate on-demand the better compression packages, and serve those when it gets zip-v2 header? The tool and PyPi server could negotiate the compression protocol. (Not sure how easy this would be to achieve)