Making the wheel format more flexible (for better compression/speed)

EpicWink · April 7, 2020, 7:43am

The use of an extension to indicate compression method was suggested by @dstufft (eg “spam-1-py3-none-any.whl.gz”, “spam-1-py3-none-any.whl.bz2”), and in addition most compressed file types I know have a magic string.

If the package index could (optionally, default being the current) generate the different compressed file types for the uploaded package, then there is no difference to the current upload process for distributors.

I don’t think it’s the ecosystem, but rather the standards. The internet has matured beyond these kinds of issues in other areas.

The current PyPI already has too much load. My suggestion is to generate the other compressions on upload, and store them on disk.

pf_moore · April 7, 2020, 8:04am

The point here is that a significant number of package consumers get packages from places other than PyPI. Devpi mirrors, other adhoc mirrors, local directories with --find-links.

Any solution that requires implementation at the PyPI end needs at a minimum to be standardised as part of the simple repository format (which is designed so that it can be implemented via static web pages) or must fall back gracefully to something workable when the new standard isn’t available (possibly just the existing wheel format).

The problem isn’t so much the standards, as backward compatibility. To break the logjam, we need to explicitly desupport certain things, and it’s hard to find improvements that are sufficiently compelling to justify that. So progress is slower than many of us would like.

dholth · April 7, 2020, 1:45pm

@dstufft I think you will find that zstd’s memory usage is the smaller of the window or the size of the (decompressed) data.

Replace everything but the .dist-info directory with a .data.zip with the same name as the existing .data directory plus the .zip extension. The inner .zip file includes members at the root of the wheel.

(With improved compression it would be equally valid to put every file in the .data/purelib and .data/platlib folders and not use the root, but that would not be the case when converting existing wheels).

Only .zip or .zip.zst not .bz2 etc.

If it is a .zip, STORE the files in the inner .zip and compress them with standard zipfile compression in the outer zip. Otherwise STORE the files in the inner zip, compress the inner zip with zstd, and STORE in the outer zip.

To preserve the hash of MANIFEST when converting back from an extra-compressed wheel to a regular wheel:

When converting from a regular wheel to an extra-compressed wheel, follow the file order in MANIFEST so that the same order is expressed in the inner zip’s index. Regenerate MANIFEST with the inner zip and .dist-info entries in that order.

When converting from an extra-compressed wheel to a regular wheel, copy the inner zip’s index into MANIFEST in order, again leaving .dist-info last.

To resolve ambiguity:

Allow empty directories as empty files with names ending in / per zip standard, plus an optional mention in MANIFEST with the hash of an empty file.

Allow zip standard symlinks.

Install files based on whether they start with a prefix package-1.0.data/{purelib,platlib,scripts,…}/ or none of the above in which case they are deemed to be in the root; so allow .data/unkownlib/ to be installed as a directory under PURELIB or PLATLIB, depending on where the root of the wheel is installed, with a warning. To allow additional install paths to be added in the future.

There would be a reference converter. Before installers have the time to write optimized logic for this update, they trap extra-compressed wheels, run through the converter, and install with their existing wheel code. Alternatively an iterator producing (‘path inside wheel’, filelike) transparently extracting any inner zip.

dholth · April 9, 2020, 10:15pm

Recompressor and pip install for nested-zip wheels. There’s some room to optimize pip by not extracting everything to a temp directory first, but the current way makes this kind of enhancement easy.

I tested a collection of 330 popular wheels. Compressed to 398M down from 527M.

To use this you should have zstd on the command line. Had no success with bindings’ stream wrappers.

Interestingly recent versions of Python’s zipfile have streaming write built in (write to a non-seekable stream). Funnily enough you could do a reasonable streaming read implementation, if the zipfile was written to a seekable stream, but Python’s zipfile module isn’t designed for that kind of thing at all.

(Standard pip will install the nested .zip.zst instead of unpacking. Don’t upload these.)

github.com

dholth/wgc/blob/master/wgc2.py

#!/usr/bin/env python
# wgc "wheel greater compression"
# puts everything but *.dist-info/ in an interior archive
# requires wheel >= 0.34.2

import sys
import argparse
import os.path
import pathlib
import zipfile
import threading
import tempfile
import subprocess

import io
import hashlib

from zipfile import ZipFile
from pathlib import Path
from wheel.util import urlsafe_b64encode

This file has been truncated. show original

Streaming decompression-only wrapper (rough draft), compiles to a ~50kb wheel. https://github.com/dholth/zstdpy/tree/master/dezstd

agronholm · April 12, 2020, 9:43am

Just to make sure everybody knows, I’ve relatively recently added a feature to wheel that allows disabling compression entirely, but at the same time this change makes it straightforward to add support for compression algorithms more advanced than deflate, if the community decides to go that way.

agronholm · April 12, 2020, 9:45am

Also, given that setuptools dropped Python 2 support some time ago, I don’t feel a lot of need to keep wheel py2 compatible either – it just hasn’t been inconvenient enough to drop.

agronholm · April 14, 2020, 4:55pm

Relevant wheel issue: https://github.com/pypa/wheel/issues/247

dholth · April 14, 2020, 8:29pm

Hi @agronholm. How would you feel about a setuptools ‘post build’ entry point that could be used by a recompressor, or an API that bdist_wheel could use to swap out the WheelFile implementation?

agronholm · April 15, 2020, 6:25am

To what end?

steve.dower · April 15, 2020, 7:35am

I would use it to embed a publisher signature in the wheel (including the metadata, which is why it has to be post build).

agronholm · April 15, 2020, 9:54am

What’s stopping you from embedding that signature now?

dholth · April 15, 2020, 10:55am

We’re talking about improving the wheelfile aka archiver layer. If bdist_wheel took that class as a parameter, and agreed on its interface, everyone could get better archives. On the other hand bdist_wheel is already a plugin, and “knobs on knobs” or “plugins with plugins” can get a bit awkward.

I was very surprised but it turns out doing a nested .zip.zst (with zstd -3) is both smaller and faster than compressing all the files individually with DEFLATE. So that kind of improvement. I don’t have any significant non-storage-related changes in mind.

If we do improve the wheel archive format it will probably be a little bit harder to generate than the current format. That may mean non-legacy build systems would actually use a documented WheelFile implementation instead of always rolling their own.

agronholm · April 15, 2020, 10:57am

Better archives how?

agronholm · April 15, 2020, 11:53am

So are we talking about making a backwards incompatible modification to the wheel file format just to make the resulting wheels smaller?

dholth · April 15, 2020, 11:56am

Yes, we would have to add about thirteen lines of code to pip to support this. But we can keep the data model exactly the same, so that it doesn’t matter whether you are installing from a “new” or an “old” wheel at any other layer.

agronholm · April 15, 2020, 12:01pm

Wouldn’t installing such wheels require adding yet another dependency since zstd is not (to my knowledge) supported in the stdlib?

agronholm · April 15, 2020, 12:05pm

What I’m trying to say here is that maybe using lzma is a better option since it doesn’t require changes to the wheel format itself? Not to mention superior compression ratios, plus it’s already supported by the zipfile module.

EDIT: LZMA support would need just one line of code to be added to the current wheel codebase.

dholth · April 15, 2020, 12:57pm

I’ve been experimenting at https://github.com/dholth/wgc

It should be easy to play with except that the decompressor uses an unreleased zstd binding. I’ve been testing a set of 330 wheels at 525MB, based on running ‘pip wheel’ to download the most popular wheels including the big, popular outlier tensorflow.

The compressor needs the zstd binary. The decompressor is not that interesting at 0m9.919s to unpack the nested zips vs 0m11.104s for the standard wheels.

The converter is more interesting. It takes a directory full of wheels and converts them both to nested .zst.zip style, and “rewrites” them by doing the same conversion skipping the nesting part just recompressing all members with DEFLATE. My test set takes 1m24 on standard wheel, but 51.3s to pack the .zst.zip style wheels. For that time savings you get 448M output instead of the slower 543M worth of standard wheels (they grew slightly). If you wanted to spend more time for smaller output you could increase the zstd compression level.

I went ahead and tried the same thing with nested zip_info.compress_type = zipfile.ZIP_LZMA compressed zips, and it took 17m48, but the compression was very effective at 355MB.

Pretty simple why Zstandard is so desirable, it provides good or very good compression at a high speed. I didn’t know that before trying it out, I thought it would surely be slower than DEFLATE but that is not the case.

github.com

dholth/wgc/blob/master/convertall.py

#!/bin/python
import glob
import wgc2
from pathlib import Path

if False:
    """
real    0m51.348s
user    0m40.623s
sys     0m8.666s
"""
    outdir = "converted"
    for i in glob.glob("wheels/*.whl"):
        infile = Path(i)
        outfile = Path(outdir).joinpath(infile.name)
        wgc2.recompress(i, outfile)

else:
    """
real    1m24.037s

This file has been truncated. show original

So I think we could require the ~100k decompressor by making it available as a old-style wheel, and then compress wheels with Zstandard as an option when the compressor is available. In that way the new algorithm could be introduced slowly as users with bandwidth concerns go to the extra trouble to postprocess before upload.

agronholm · April 15, 2020, 2:10pm

That doesn’t answer any of my concerns so far. Users with bandwidth concerns surely would prefer lzma over zstd due to the smaller compressed size, wouldn’t they? Not to mention the fact that lzma support does not require any changes to pip? I just tested this with the lzma enabled wheel and an unmodified pip.

dholth · April 15, 2020, 2:48pm

The answer to the “is it worth it” question is to measure the difference in transfer time plus the difference in compression or decompression time. (The Zstandard docs talk about this tradeoff a lot; I haven’t timed much LZMA decompression.) In the LZMA example we spend 17 minutes compressing and avoid 170Mb of transfer compared to the standard wheels, if I’m thinking about the transfer spent uploading them once.

The LZMA test include a nested .data.zip compressed all at once, not individually compressed archive members, so it would require a change in pip to get that level of compression.

I tried converting the same wheels with zstd -19, without using multi-threaded compression. It took about 11 minutes to yield 367Mb of wheels compared to the 355Mb of LZMA wheels.

I tried uncompressing just tensorflow with the LZMA compressed nested zip using wheel unpack, it took 9.5 seconds. Compared to 12 seconds to decompress the entire set of zstd -19 compressed wheels.

That is why Zstandard is so special. It can be either faster and better than DEFLATE, or faster and almost as good as LZMA.