Making the wheel format more flexible (for better compression/speed)

brettcannon · April 2, 2020, 8:03pm

While dreaming about plans to standardize sdists, I have been thinking about how sdists are strongly tied to gzipped tarballs and wheels to zip files (although obviously more wheels than sdists). I view this as unfortunate as it means neither build artifacts could take advantage of e.g. zstd for compression and speed benefits.

I noticed that conda has their own two-tiered solution of an uncompressed zip file of two tarballs consisting of:

Metadata required to do stuff with the package
Everything else

Which aligns with .dist-info and .data if I understand the wheel spec appropriately.

Anyway, this topic isn’t about changing the format to exactly this, but would we ever even consider changing the file format for ‘wheel’, what that threshold would be, and how would we go about doing it?

pitrou · April 2, 2020, 8:39pm

Related: Improving wheel compression by nesting data as a second .zip

dstufft · April 2, 2020, 8:41pm

We’re almost entirely limited by what is available in the Python standard library, which afaik pretty much limits us to gzip, deflate, bzip, and lzma. Unfortunately Python does not make any of them mandatory (the related compression will just be unavailable at runtime) so diversifying what compression algorithms we use, means that we place additional constraints on the Python environment.

The current situation is since we’ve standardized around deflate (via zip files) and gzip’d tarballs our only constraint we place on a Python environment is it had to have been linked with zlib which is an incredibly common dependency to have available, even in docker containers, etc.

The other constraint here is that currently we still support Python 2 (at least to some degree) which rules out LZMA since that wasn’t available in Python 2 (unless we decide that this hypothetical new wheel format should only be available for Python 3 or we decide to say that people have to install some LZMA backport to install from wheel on Python 2.

Beyond that I see two real options for how the format would look like, either we do something similiar to conda where we have a tarball within a zip file, where the tarball is compressed with some specific compression scheme. The other is to just use either the zip file support for a different compression algorithm or a non gzip tarball (xz or bzip2 or wahtever).

Honestly, I think either one is perfectly fine, and would probably suggest that the decision would largely be driven by what the compressed size looked like between the two options on real world examples. There is an argument to be had that the metadata could/should continue to be just normal zip members (e.g. not in an inner tarball) to enable easier introspection without having to extract multiple levels of artifact formats.

pf_moore · April 2, 2020, 9:10pm

+1 on basically what @dstufft said. One additional point is that there’s been some work done (I can’t find the link right now, github has issues) improving the speed of pip, which took the approach of doing partial downloads of wheel zipfiles just to download the metadata file data. I don’t have details, but apparently it gave notable savings. So a new format that made it impossible to get metadata via partial downloads would block that option.

Of course, if we move to externally-held static metadata before any change in wheel format, this point becomes irrelevant

brettcannon · April 3, 2020, 4:49pm

… for now. My question was more about whether there was any historical knowledge about considerations on how to change things in the future if the opportunity presented itself rather than making an actual change right now (I think our communal TODO list is long enough to keep us all busy for the foreseeable future ).

Considering the amount of flack I still get about PEP 518 choosing TOML and there not being a TOML parser in the stdlib that doesn’t scare me.

If this were to ever happen, I think the timelines would be such that limiting ourselves to Python 2 wouldn’t be worth it.

Huh, that is another approach. I guess the question is whether pushing that work on to PyPI makes sense as it would have to do the CPU work and caching to make it effective (it probably does from the community’s perspective, it’s just a question of who would have the bandwidth).

You mean for sdists, or wheels specifically? And if the latter what isn’t static? (And if it’s the former I had an idea this morning on how to make everyone but setuptools_scm users static , but I have to finish some research on other things before it’s a complete proposal that could be turned into a PEP.)

dstufft · April 3, 2020, 5:05pm

Yea, we’ve never really done changes like that to the Wheel format, so it would be breaking new ground in general.

This is less about flack and more about technical constraints. get-pip.py, ensurepip, virtualenv, etc all bootstrap pip by adding pip to sys.path and installing pip with itself, a C library means that we can’t do that anymore, since we would need a build step before hand (and decompression using pure python sounds awful). I also think we can’t depend on C libraries at all in Windows (unless there’s some mechanism for doing so that we missed) because importing a C library locks the file for deletion, which if pip imports it, makes it impossible to upgrade the version.

pf_moore · April 3, 2020, 7:06pm

I meant for wheels - wheel metadata is static, yes. I was talking about accessing it without downloading the whole wheel file. The simplest way of doing that would be if PyPI (or more precisely an optional extension to PEP 503) exposed a way of just downloading the metadata, which is what I meant by “externally-held static metadata” but it can be done via partial reads at the moment, I believe (I think that’s what the work I referenced does). An alternative format may not support the partial read option.

dstufft · April 3, 2020, 8:39pm

I think that even with an extension to PEP 503 we’d still want to retain that property so pip can still do that for --find-links (although another possibility is --find-links is just always slower or we deprecate and remove --find-links and require an actual repository like most package managers do).

dstufft · April 3, 2020, 8:42pm

A more practical reason why we’d want to keep the .dist-info directory as is, is that pip can support failing to install a wheel that is made with a newer version with a meaningful error (the WHEEL file inside of the .dist-info directory lets us version the wheel format). The other options would be a weird, hard to debug error for older versions of clients OR making a .whl2 or something that pip wouldn’t see as a valid old style wheel.

dholth · April 3, 2020, 9:25pm

We also thought about implementing decompression in pure Python. It would be a shame if it wasn’t possible to cross compile a C compression library to Python. It would be slower but it would get you through the day or until you could download an extension.

We suggested putting the metadata at the end of the zip archive. bdist_wheel probably still does it. I had privately built a thing that used partial HTTP requests to download parts of a zip on demand (for example you start with the last 16k to probably get the zip manifest, and then you can download a single file from inside the zip with more HTTP partial requests). Seems like a weird thing to rely on however. Would it make a CDN sad?

I like the idea of a nested .zip.zstd that holds the .data directory. The wheel’s root could be empty or not.

thedrow · April 4, 2020, 6:51am

I don’t understand why adding zstd compression (or brotli compression) cannot be optional.
If the python environment has the library we can check if it’s available on the server using a HEAD request with the Accept header set to the right mime types.
If the compressed archive is available on the server, we download it.

pip can have optional dependencies as long as they are properly documented.

pitrou · April 4, 2020, 10:05am

There are probably workarounds if you really want it.

Necessary build step for a C extension: can be avoided by using ctypes (compression libraries generally have a stable API/ABI, so this should be less fragile than it sounds).
Importing a C library locks the file for deletion: you can probably decompress in worker processes using a ProcessPoolExecutor, with the added advantage of parallelizing decompression if multiple packages are installed. Shutdown the process pool when you are done.

Both non-trivial workarounds, with potential pitfalls of their own, but they seem to be doable anyway.

dholth · April 5, 2020, 2:57am

or just shell out

dstufft · April 5, 2020, 5:35am

If I publish a wheel that uses zstd compression, then anyone who installs it needs the capability to decompress zstd compressed files. They don’t really have an option to avoid that dependency other than just not using that wheel (so falling back to sdist? using a different project?) or something along that line. It’s not like HTTP compression or something where we can do real negotiation to determine capabilities, a wheel is a static file. The only other option would be to duplicate the wheel into each compression scheme, but I suspect very few authors are going to bother doing that.

dholth · April 5, 2020, 7:11pm

I would do it as a lossless transform. Then you could go backwards from the more compressed version and then install with a wheel v1 tool. But it would be a stretch to expect the publisher to recompress.

The wheel manifest is designed to be signable apart from the bytes produced by the compression algorithm. The details for making that happen with a “wheel greater compression” variant might be tricky.

dstufft · April 5, 2020, 7:47pm

A lossless transform… by who? I don’t even understand how this would even hypothetically work without breaking practically every security minded feature we have and are planning to add other than TLS and the wheel specific signing thing that practically nobody uses.

dholth · April 5, 2020, 8:37pm

I feel like you dislike all ideas especially mine. So I’ve mostly stayed away from open source work since.

It would be possible to define a transform between a “more compressed” v2 wheel and a “less compressed” v1 wheel. If you were very careful it would preserve the hash. The most likely situation would be that the publisher would publish the v2 wheel and some tool would want a v1 wheel, and you’d run the decompressor on the client. Or you would retrofit a wheel builder with the new “compress more” tool before publishing.

One simple transform would be to replace everything except the .dist-info directory with an embedded {packagename}-{version}.data.zip.{xz}. Either compressed with zip compression or with arbitrary compression and STORE’d in the parent zip. Unpack the embedded archive, rewrite the MANIFEST and repack and you have yourself backwards compatible v1 wheel again.

It’s true that no one cares about the hash of MANIFEST. They may use the hash of the entire .zip file. You could probably even generate a bit-identical zip file with some improved compression / decompression if you were really, really careful. But if recompression became popular then demand for a compression-independent hash could appear.

dstufft · April 5, 2020, 11:01pm

I dislike poorly thought out ideas that aren’t actually functional in reality.

This would require at least a collision attack on the hash function, if not a preimage attack. Effectively you’d have to switch to a broken hash… which would be a very bad idea. Unless you have some other scheme in mind for keeping the hashes lined up? But afaik:

Hash(open(Compressed_with_gzip).read()) == Hash(open(Compressed_with_zstd).read())

is just not possible with a good hash function.

How do you envision this working? If pip can decompress the wheel to recompress it, it would just install it? Or do you expect people to do some sort of download the wheel, run some other command to transform it, then pip instal that wheel manually?

dholth · April 6, 2020, 12:44am

It would be a chore to get the same compressed output round tripping between an extra-compressed and normal wheel, but it should be easy to convert a hypothetical extra-compressed wheel back and forth to a normal one. You could use a hash of the hashes of all the decompressed files in a fixed order, like Java’s JAR, if it was important to have a hash that was not dependent on the compression algorithm.

pip might only want such a compression tool to bootstrap the "Python does not have " problem until an extension could be installed. Or a user might prefer to cache less-compressed wheels on disk for faster re-installs.

dstufft · April 6, 2020, 1:40am

Just for kicks I write a small script that can convert a wheel between a number of different “wheel formats”. I currently have the following:

current: The existing wheel format
current+bzip2: The existing wheel format, using ZIP_BZIP2 instead of ZIP_DEFLATED
current+lzma: The existing wheel format, using ZIP_LZMA instead of ZIP_DEFLATED
tar+gz: A new format that keeps using the existing format for .dist-info directories, but puts all the “data” into a tarball and compresses with gzip
tar+bz2: The above new format, but using bz2
tar+lzma: The above new format, but using lzma
tar+zstd: The above new format, but using zstd
tar+brotli: The above new format, but using brotli

This is the results for pip:

output/current: 1.4M
output/current+bzip2: 1.3M
output/current+lzma: 1.3M
output/tar+brotli: 22K
output/tar+bz2: 21K
output/tar+gz: 23K
output/tar+xz: 22K
output/tar+zstd: 22K

This is the result for tensorflow:

output/current: 402M
output/current+bzip2: 314M
output/current+lzma: 136M
output/tar+brotli: 222k
output/tar+bz2: 206K
output/tar+gz: 234K
output/tar+xz: 218K
output/tar+zstd: 218K

I’m not 100% sure that those results are correct, I’ve done some basic spot checking and they look correct, but I could be wrong. The script I’m using to make these is https://gist.github.com/dstufft/fc3e60b89e87d20518ff944c85ac3d9e, which is a really dumb brute force way of doing it but it appears to be working fine.

That being said, assuming these results are correct it suggests that the discussion on compression algorithms seems to be largely needless, and the results are by far overshadowed by simply taking the “stuff to be installed”, sticking it in a tarfile, and compressing that.

An interesting observation here is that for cases like pip (which are likely going to be a much more common case than ones like tensorflow) is that a significant portion of the left over file size is down to the RECORD file. Unfortunately trying to compress that with something other than ZIP_DEFLATED (I just left the dist-info kept to that) doesn’t seem to meaningfully change the outcome. It’s possible that trying to put the RECORD file into the data payload and/or a more fundamental change (tarballs instead of zip files for the entire thing?) that would make it more difficult to introspect it.