PEP 777: How to Re-invent the Wheel

effigies · October 18, 2024, 2:07pm

Just a small question on Wheel-Version: 2.0: Does it make sense to tie the Metadata spec to a package format name? You could imagine a future where what we use changes enough to warrant being called something other than “Wheel”, or possibly the adoption of some pre-existing format where it makes more sense to also use the pre-existing format name and not just call it Wheel-Version: $N.0.

The simplest change would be Package-Format: <Format-Name>-<Format-Version> (so, Wheel-2.0), but you could also avoid substring parsing and use Package-Format: Wheel; Package-Format-Version: 2.0 or similar.

If this is taking forward compatibility consideration too far, feel free to ignore without comment. Brief consideration and out-of-hand rejection are satisfactory responses, AFAIC.

Finally, regarding pronunciation, whlx could be pronounced “whelk” as in the TeX tradition.

emmatyping · October 18, 2024, 3:40pm

Well, it’s not all we’d gain. We’d also get much faster version resolution for situations where .metadata isn’t available. I think that’s worth something. That being said I’m very curious about

What downsides do you know of? FWIW the PKWARE ZIP specification says files with prefix bytes are still ZIP files:

ZIP files MAY be identified by the standard .ZIP file extension
although use of a file extension is not required. Use of the
extension .ZIPX is also recognized and MAY be used for ZIP files.
Other common file extensions using the ZIP format include .JAR, .WAR,
.DOCX, .XLSX, .PPTX, .ODT, .ODS, .ODP and others. Programs reading or
writing ZIP files SHOULD rely on internal record signatures described
in this document to identify files in this format.

My gut tells me there are probably tools that may behave poorly due to not following the standard, but I’m not sure. When reading a ZIP you MUST rely on the central directory, and the location of file entries are stored as negative offsets to that central directory. A standards compliant ZIP reader should have no issue with a prefix byte.

emmatyping · October 18, 2024, 3:45pm

I think Wheel-Version would merely be excluded from metadata if a new package format were introduced - it only makes sense in wheels. It would be up to the new format to define metadata about it’s own version. I think it doesn’t make sense to do this because I want the parsing of Wheel-Version in METADATA to be the same as in WHEEL so it’s easier for package tools to compare them. Also, it is unclear to me what we’d do about sdists. They are un-versioned at the moment (AFAIK).

pf_moore · October 18, 2024, 3:52pm

The obvious one is that it would be easy to accidentally lose the prepended data (for example, if for some reason you repack the zipfile using unzip/edit/zip). That would leave you with an invalid wheel file.

It’s basically the same issue as with any custom format - you need custom tools to be sure you’re working safely with it. It’s just that prepended data will work safely with generic tools in readonly situations, so the risk is smaller. But it’s still there.

IMO, it’s not a big problem, but I’m not convinced the benefit is significant, either.

steve.dower · October 18, 2024, 3:59pm

Yeah, creating them is the downside I was thinking of, not reading them. I find myself frequently recommending people repack their wheels so they can do things that build backends won’t, like code signing or patching certain files.

mikeshardmind · October 18, 2024, 4:28pm

I suggested this specifically because it can work with any future format, it just requires we actually want that level of future compatibility and do a minimal amount of work. And I mean it when I say minimal, we don’t need to reinvent archival formats, just do minimal preprocessing/postprocessing of a binary file that contains one.

[8 bit version] [ data based on version]

actually just prefixing a byte to the container for this upcoming next version means simply reading offset 0 for the version, then if it’s version 2, bytes 1-end are a zip file.

we want a different container in the future, but don’t put anything else in the header?

version 3 we could just say the remainder is a tar file in the future.

maybe we start allowing wheels to use a small set of non-standard-lib compression and then

version 4 might be [4] [flags for required compression support as a single byte] [tar file]

by prefixing the version, we can communicate the version forevermore in a predictable location that works without having to try unpacking, without it dictating specific implementation of the rest of the file.

This isn’t really a novel concept, it’s the same way people can ensure binary wire protocols can be versioned, just in use with a file.

Would this remain the case if we said that tools that generate and consume wheels going forward must provide to users the ability to unpack and repack the inner archive to/from a wheel? It seems that this is predicated on users currently needing to work around tooling, so why not have tools expose this by specification, it’s a capability these tools would already need to have anyway, and I’m not suggesting to get rid of .dist-info/ so there’s a canonical source of that version that isn’t lost on unpacking to repack with a simple tool --pack-to-wheel some.zip

pf_moore · October 18, 2024, 4:44pm

You’re missing the point, which is that both @steve.dower and I have use cases for working with wheel files where we don’t want to use a dedicated wheel management tool. I use the standard command line zip/unzip tools on wheels a lot, and I often open them in 7-zip file manager (and I’d use Windows explorer native zip handling if it wasn’t reliant on the file extension ).

It may not be essential to be able to work with wheel files using standard zip handling tools, but it’s definitely convenient, and it’s not a feature I’d be willing to give up unless there was a compelling benefit - being able to read the wheel version from the first byte of the file rather than by looking at the embedded file metadata like we’ve been doing forever simply isn’t that compelling an argument for me.

Remember, the critical aspect of this PEP is that the wheel version is available from the index, to save downloading the wheel at all. If you’ve had to read the wheel, you’ve lost the major saving already (which might be the case for find-links, and/or indexes that don’t support PEP 658, but that’s a different question).

mikeshardmind · October 18, 2024, 4:55pm

I didn’t think I was missing it, just asking if the ability remained widely available in similar ways, would that be enough to clear that hurdle. I appreciate your answer anyway because it’s clear this breaks your workflow in a way you don’t find compelling.

I was hoping to end up in a situation with less reliance on the index and without being stuck with zipfiles, there’s a lot of reasons why pep658 doesn’t work for everything, including even just pip’s --wheel-dir in use with a common local network cache. There’s plenty more, tools like auditwheel don’t have an index to look at. This isn’t strictly about the index but also the ability to have future iteration here that isn’t tied to “we use a zip file forever” and have it continue to work even when the index isn’t in play.

pf_moore · October 18, 2024, 7:15pm

I’m concerned about the reliance on the index and PEP 658 support, as well. But I’m hoping we can find something that helps avoid the need to download the wheel, rather than something that needs installers to download and discard unsupported wheels.

I don’t know what such a solution would look like, though

I honestly don’t see what’s so bad about zipfiles. If we allow for the possibility of a zipfile containing the metadata and an embedded archive with the data, that seems pretty flexible to me…

mikeshardmind · October 18, 2024, 7:38pm

By being the first byte, when pep 658 support isn’t available, HTTP range requests can always choose to stop the download after the very first chunk, minimizing the amount downloaded in many cases. Same with reading from a local disk, you can stop after the first byte before trying decoding if you see an incompatible version there.

They’re not the worst format in the world, but they have more overhead than many other formats, and with the number of wheels stored and downloaded all the time, I think it’s a place where being aware that certain formats are less efficient… We have a large enough N that small inefficiencies are massive in their total impact, and I really don’t like the idea of trading a small amount of convenience in what I (personally, here) view as something most users should never need to do if tools are working for them, for a massive amount of bandwidth and disk storage across multiple locations (both CDN edge nodes, and mirrors) in a way that makes the trade perpetual and unchangeable.

It’s why I was hoping something like

py -m wheel unpack [wheel path]
py -m wheel repack [archive path/directory]

would be sufficient to bridge the existing convenience.

pf_moore · October 18, 2024, 8:14pm

Reading just the needed parts of a zipfile via range requests isn’t that hard. I know, I’ve implemented it. But OK, it’s a fair point. I still think it’s not something we should be worrying about right now, though.

Efficiency isn’t the only factor, though. And in any case, switching to a non-zip container format is out of scope for PEP 777. And I think that’s the right call for now. Prepending metadata content is technically not a “non-zip container format”, but I think it’s something that should be part of a future PEP specifically about changing the container format, and not part of this PEP.

mikeshardmind · October 18, 2024, 8:17pm

But by not considering it now, we’re creating a situation where in the future we want to consider it (something that reads as in-scope to me for process here), there isn’t a clean upgrade path, and we’ll need a more involved process again. If the version isn’t inside the interior archive, the upgrade path is “version higher than you know?” Okay, you don’t support that, no new extension needed again.

BrenBarn · October 18, 2024, 8:32pm

How significant is the zip format overhead relative to other sources of inefficiency in the packaging system? I tend to think that we wouldn’t need to worry much about the relative efficiency of different archive formats if we had improvements in other areas of the distribution system — most notably, hardlinking packages across environments (so libraries don’t take up duplicate space) and separating metadata from package content (so installers don’t need to download an actual archive until they know exactly which one they need). I don’t have any data on that, though, it’s just a hunch.

jamestwebber · October 18, 2024, 8:32pm

Is it really that inefficient compared to other options? In its most basic form^[1], a zip archive adds less than 200 bytes per file stored, at the end of the archive. It seems like that’s about the most space that a non-zip version could save by doing something more clever. At best you are compressing the .dist-info metadata a little bit better, which is a similar amount of data.

no compression at all, just stick a file in another file ↩︎

mikeshardmind · October 18, 2024, 10:39pm

200 bytes per file, even in a future case where there are only 3 files in the weel (which my understanding is that this is the proposed minimum keeping metadata in) is 600 bytes, and losing the ability to make the metadata more compactly represented (as it needs to remain as is for the wheel major version to be read from it reliably going forward).

let’s be pessimistic and say that last part is never improved on. 600 bytes

lets be further pessimistic, and say we can only eliminate half that overhead

pypistats.org says there were 51,014,097,947 downloads from pypi last month. Not all of them were wheels, but there vast majority were. lets under sell it at 50%. I’m assuming their methodology is reasonable, but please provide a better source if there is one.

with those generous lower bounds, that’s still over 7 TB of traffic per month on just the low-balled eliminable overhead, and that number goes up as more people use python, and as wheel adoption goes up, and that’s assuming only the traffic from the edge CDN to downloader, and only the pypi use case.

Yes, it’s a meaningful amount in a format as widely used as this.

it’s also more efficient when considering the future discussions the community will need to have for foreseeable iteration on this in the future, as there won’t be a reason to say “we can’t change the format of metadata” or “we may need a new file extension” discussions, simply tick the major version for changes of that nature.

emmatyping · October 18, 2024, 11:25pm

You have to compare to an alternative to really make this argument. All formats will need to record the filenames, sizes, etc of each file and will pay some overhead per-file. There are no free lunches!

zip files take at minimum 90 bytes per file of overhead (size on top of filename). tar files take several hundred bytes per file entry. A custom format could be used, but then we can’t use existing tools, which people have already said they really like being able to do.

Unless there is an existing format in wide use that has way less overhead, I don’t think it makes sense to create a completely new format to save <0.1% of PyPI’s bandwidth.

I’d much rather have a format that is easier to adopt so that compression can be shipped sooner, and we can get huge wins on the order of 25-50%.

h-vetinari · October 18, 2024, 11:32pm

That’s something that should IMO be supported by tooling, not by a loophole in the file format (“actually, it’s just a zipfile”). I definitely do not see manual edits and inspection of the wheel content as a driving factor here (well, except to the degree that it’s a personal preference of the decision maker), the interest of the ecosystem would be by far better served with something that minimizes total file size.

That said, the requirements of supporting format-changes, reading metadata without decompressing everything, package signing, etc. do make a good counter-argument to just put everything into one highly-compressed archive (with the format in some prefix bytes or whatever).

The conda ecosystem switched from .tar.bz archives to a custom .conda archive format in 2019^[1]; it took a bit of digging, but the design ended up in a similar place: an outer shell with zip, which contains metadata and the core artefact (which can use different formats, or be signed etc.). Perhaps the design notes are interesting for the purpose of this discussion (not 100% up-to-date, the default inner compression is now zstd, with configurable zstd_compression_level.)

well, the implementation landed. conda-forge switched over in 2022 ↩︎

Liz · October 18, 2024, 11:39pm

The zip overhead seems large, yet acceptable small in the grand scheme if keeping it enables faster iteration on things that improve size elsewhere. But I don’t see how that’s intended to happen if the things needed to improve size are things like compressing the zip differently or using features that aren’t part of the specification now. How do I implement something that checks what wheel version something is without a reliable location for future wheel versions?

Liz · October 18, 2024, 11:46pm

(prior post edited back, I accidentally was in edit rather than adding a new post)

Some of the proposed changes for the future is making metadata a different form:

and these were considered nearer-term. We can’t do that forward compatibly with what is here because there’s no way to differentiate new format my tool doesn’t understand and malformed wheel.

emmatyping · October 19, 2024, 12:54am

As PEP 777 specifies, .dist-info/METADATA would need to be kept at least as long as Python 3.13 is around (~5 years from now). It probably should be kept for longer. However, we can include the future JSON formatted metadata in the wheel right away alongside the existing metadata file, there are no backwards compatibility concerns with that.

As long as your version of CPython is supported, you can simply use importlib.metadata to query the wheel format version.