PEP 777: How to Re-invent the Wheel

then there’s no point in ever adding another form in the wheel. That’s just duplicating information that has to be there, going in the wrong direction.

Thank you for these design notes! I agree that the .deb design is really nice, and what I have in mind for some future enhanced wheel compression PEP.

1 Like

I don’t get the impression that this is designed to be future-compatible. Tools that design for this are not going to be able to gracefully handle what comes after this, and I think that’s an objectively poor outcome when you already know you intend on changing details.

2 Likes

To be clear, the timeline could, as an example, look like this:

2024

  • wheel 2.0 PEP accepted, wheels now include .dist-info/metadata.json
  • Python3.14 is updated to read the new format

2029

  • wheel 3.0 introduced, removing .dist-info/METADATA
  • Python 3.13 becomes end of life. 3.0 wheels may start being published without .dist-info/METADATA, only including .dist-info/metadata.json

importlib.metadata will continue to be able to read distribution data in both formats. The new format is adopted, while the old can be removed.

1 Like

so 5 years of duplicating all of the data to avoid a stable version in the header, and the payoff is that we have to support both for 5 years and we only gain json? Is the goal here to improve things or to be maximally compatible with only today’s widely used existing technology? json is a terrible format for a binary distribution to use internally.

2 Likes

The things I don’t think should be goals:

  • opening it with 7zip (unless someone writes a 7zip extension)
  • being backward compatible at the format level (that’s what the version number is for)

The things that I think should be goals:

  • improving the user experience
  • decreasing the resource use
  • ensuring that there is a meaningful way to raise useful errors when a tool encounters a wheel it cannot understand.
  • ensuring we can continue to improve the format at faster than once every 5 years.

This set of goals in terms of the direction of compatability matches sqlite’s forward compatability, and nobody complains that upgrading sqlite breaks their database, because that doesn’t actually happen.

The direction of compatability here should be backward compatible at the tooling side, and forward compatible at the format side. It’s perfectly fine to tell people they need to upgrade to use the new format.

There are two reasonable ways to do that.

  1. Keep metadata where it is forever
  2. Break it once, and place only the version somewhere unchanging and minimal, allow the rest to evolve.

I was thinking 2 was the better option here, but the small, but not negligible overhead associated with the status quo is preferable to duplicating all of the metadata in multiple formats in the same archive.

4 Likes

I’m another -1 on custom byte headers. I don’t want to have to install custom tools when manipulating wheels. Being able to easily work with archives using standard tooling is a strength and good user experience.

Feel free to dismiss this idea, but have we considered being explicit in the file extension?

Eg.
*.whlx.zip
*.whlx.tar
*.whlx.zst
*.whlx.7z

3 Likes

Worth adding, I don’t think tools can ever drop support for a wheel format and be doing things in the interest of their users. abi3 wheels exist for a reason,and tools dropping support for old wheel formats would break their reason for existing. There’s no reason an abi3 wheel should stop working before the stable abi is broken, especially for software that is stable and finished.

While I’m aware that can be seen as a point toward keeping zip and metadata the same forever, I only view it as a point in favor of ensuring that we mentally separate the idea of backward compatibility of tools from designing a format that we can continue to support going forward. All that requires is that tools consuming a wheel can know that

  1. This is a wheel.
  2. This is a version they support
  3. When they have a version they support, they can validate it to that version.

I think if we are committing to keeping zip (and as others have pointed out, the overhead of it can theoretically buy faster adoption of other features), we should commit to keeping metadata the same and the .whl extension, and hold off on any re-extension or restructuring of metadata for a single time break that actually allows a better forward compatibility of the format, because the version remains always accessible.

2 Likes

It’s not terrible, it’s very widely supported (i.e. you can easily write code to parse that JSON metadata in almost any existing programming language), it will probably compress quite well, and it will still be around in 2030.

There is no obviously better alternative IMHO. If you’re thinking “well, I’d use a binary format such as Protocol buffers or Flatbuffers”, then you’re vastly underestimating the amount of scaffolding work required to use them.

3 Likes

I’d be more than happy to write a reference implementation using only what’s available in the standard library for a binary format that’s actually efficient, but I think it’s a moot point. If we have to keep the existing metadata, having it in another format too isn’t helpful.

While my preference would be an optimized binary format json is also worse than csv, tsv or any other basic single delimiter format for non-arbitrary data with a fixed number of fields, even in cases where variable length array nesting needs expressing.

Just to try to bring the discussion back on track, metadata format is not part of PEP 777. It’s a change that PEP 777 hopes to enable, but there’s no proposal on the table yet to change the format of wheel metadata. Equally, there’s no proposal on the table to change the wheel format itself. So I suggest we defer these discussions until there is a proposal to debate.

What is relevant to PEP 777 is how we detect the format of a new-style wheel file. We can start by looking at the file extension (.wheel, .whlx, whatever shade the bikeshed finally gets coloured). That enables distinguishing current format wheels from newer format wheels.

Beyond that, PEP 777 is trying to define a mechanism so that consumers can detect the version of a new-style wheel without extracting the metadata from that file (which is, in the general case, costly - particularly given that the goal is to reject unknown formats).

For wheels served from an index, the current proposal in the PEP covers that (and it could be made even more efficient by adding Wheel-Version to the simple API, rather than relying on PEP 658 metadata). But for wheels obtained from any other source, the only approach is to read the wheel metadata.

For local files, reading the metadata isn’t prohibitively costly, so I think we can accept that case. The problem is with wheels accessed via remote URLs (direct URL requirements, approaches like pip’s --find-links option for treating a remote directory as a source of distributions, or even remote directories mounted as a filesystem).

It’s possible to take the view that remote wheels aren’t an important enough case to worry about, and say that consumers will need to download and read the wheel metadata. That’s a valid view, although I’m not convinced it will be sufficient - installer performance is important to users.

We’ve already had the discussion about including the wheel version in the file extension, and rejected that option. But is there any reason why PEP 777 couldn’t change the wheel filename format, to require a wheel version to be included? That would be immediately accessible with all forms of access, it wouldn’t have the same limitations that including the version in the extension would have, and it’s where all current information that lets installers discard incompatible wheels is held, so it’s known to work as we want.

This approach would not only work well with existing codebases, it would also remove the need to alter the core metadata spec to add data that’s related to the distribution artifact rather than to the project, which IMO is a good thing[1].

What am I missing here?


  1. And I say that as the person who added the Dynamic metadata item, which is similarly awkward in that respect ↩︎

6 Likes

PEP 777 essentially breaks the existing method tools have to determine a wheel version. The rule is currently, wheels are a zip file, check major version from metadata, if the major version is higher than you know, error.

By trying to say that isn’t good enough, but still keeping it a zip file, the question is why isn’t that good enough? PEP 777 suggests keeping it a zip file forever, and yet at the same time says the extension must change (this is breaking the existing compatability method) and yet it also puts a bunch of limitations that need not exist if we’re already breaking forward compatability:

Wheel files, when installed, MUST stay compatible with the Python standard library’s importlib.metadata for all supported CPython versions. For example, replacing .dist-info/METADATA with a JSON formatted metadata file MUST be a multi-major version migration with one version introducing the new JSON file alongside the existing email header format, and another future version removing the email header format metadata file. The version to remove .dist-info/METADATA also MUST be adopted only after the last CPython release that lacked support for the new file reaches end of life. This ensures that code using importlib.metadata will not break with wheel major version revisions.

Wheel files MUST remain ZIP format files as the outer container format. Additionally, the .dist-info metadata directory MUST be placed at the root of the archive without any compression, so that unpacking the wheel file produces a normal .dist-info directory holding any metadata for the wheel. Future wheel revisions MAY modify the layout, compression, and other attributes about non-metadata components of a wheel such as data and code. This assures that future wheel revisions remain compatible with tools operating on package metadata, while allowing for improvements to code storage in the wheel, such as adopting compression.

Package tooling MUST NOT assume that the contents and format of the wheel file will remain the same for future wheel major versions beyond the limitations above about metadata folder contents and outer container format. For example, newer wheel major versions may add or remove filename components, such as the build tag or the platform tag. Therefore it is incumbent upon tooling to check the metadata for the Wheel-Version before attempting to install a wheel.

Finally, future wheel revisions MUST NOT use any compression formats not in the CPython standard library of at least the latest release. Wheels generated using any new compression format should be tagged as requiring at least the first released version of CPython to support the new compression format, regardless of the Python API compatibility of the code within the wheel.

These restrictions seem pointless to break over, because compliant tools can do this by just respecting the major version that already exists in the current wheel spec, so it invites asking what can’t we do using only the major version in an existing file format inside of the zip file, and how would other breaking changes enable even better things overall, and then if we are refuting all of those things, why are we breaking the existing spec for this?

2 Likes

And the easiest way to do this that makes it part of the file was explicitly rejected by you and others. The filename is possible to use instead, but the filename is not part of the file, it’s filesystem metadata. Perhaps that’s too pedantic, but I definitely wasn’t considering it with it being ruled out of the extension and I think before now, you’re the first person to bring up that it could still be in the filename without being in the extension, but this still leaves the question: “what do we gain by breaking the format now?”, the cost is existing installers won’t see a wheel, and will fall back to sourcedists or older versions, which other packaging threads have called out as a bad user experience before. Keeping the existing format, installers will get a wheel, then be able to error in compliance with the current specification for what to do when encountering a wheel version tools don’t speak.

1 Like

That’s an opinion one can have, but it’s certainly not a universal truth. I have the exact opposite experience – CSV/TSV are orders of magnitude worse for arbitrary string content (the escaping games you have to play if your delimiter appears in the payload…) than JSON, and that’s even before talking about any nested content.

Not that I’m a huge fan of JSON, but it’s ubiquitous exactly because it has a very good ratio of “it works” vs. the amount of effort required, which – for better or worse (depending on your view) – includes the fact that it does not prescribe a fixed schema and is thus very easily extensible.

2 Likes

If you’re discussing manually handling delimiters, you should be comparing to manually parsing json. python has an extremely capable csv library built in that handles delimiters correctly.

There are also actual binary formats that are schema-less and generally compare reasonably well with a handwritten protobuf. msgpack is a much better choice for a machine format than json. If you only need to support a small subset of msgpack that can store the same types that json can, you can handroll a pure python implementation in under 100 lines of code.

1 Like
Off-topic

I’m discussing real-world content and how it breaks assumptions of CSVs. If your payload contains a newline, you’re hosed. If whatever source of the CSV didn’t correctly quote content containing delimiters, you’re hosed (and often there are many different tools of varying quality or implementation choices touching such files). Worse, the breakage is silent, because whatever happens, it can still be interpreted as a CSV (just with a different number of columns than expected). All of that is a recipe for lots of headaches.

JSON is easy to parse, validate, default-format etc. CSVs are not even close to in competition (I get that we all live in our own bubbles, but I thankfully I haven’t seen CSVs for years in either professional or FOSS life, while JSONs are literally everywhere. I take this as circumstantial evidence that other people came to similar conclusions).

I like msgpack too! My guess why it isn’t more widely adopted is that in the end even “machine formats” often end up being read by humans, and in most cases, removing that layer of introspectability/patchability has not been worth the savings in overall footprint.

In any case, I don’t think we should be discussing the merits JSON here, so I’ll leave it at that.

3 Likes

a lot more discussion overnight than I expected.

I don’t think this is possible to decouple. Whether or not it is worth committing now to breaking an existing compatibility rule is predicated on if the gains are worth doing that. This pep has not provided anything that requires breaking, and the options for improvement that would benefit from breaking are largely unpopular. [1]

  • Access to the wheel version without downloading is possible by supplementing the index, and has already been minimized with HTTP range requests.
  • zstd use is possible by archive in an archive.
  • a new metadata format would be breaking, and would not enable dropping existing code, we can’t stop supporting install of v1 wheels without undermining abi3.

The pep also supposes that if breaking happens, we would do things extremely similar to how they already are, which sets up repeated breakage as not just a realistic outcome, but an expected one.

While I disagree with @Liz on some of the specifics, the general idea that we should be looking for improvements enabled by breaking is pretty reasonable if someone is suggesting breaking, because if there aren’t any that require breaking, why are we doing this? And if we are going to break things, why would we not do everything we reasonably could to ensure breakage was a single event based on specific needs and future-proofing based on where the current approach fell short?


  1. The fact that manually editing a binary distribution format is something people seriously want to retain as normal is a clear signal to me that tooling generating wheels isn’t serving users right now. There is zero apparent confidence from experienced packaging experts that the tools will be able to do the right thing without manual intervention, but none of the issues related to that are being put forth as something requiring breaking to fix. ↩︎

4 Likes

I don’t see how zstd is possible without changing the wheel version–an installer wouldn’t know that it needs zstd to install the package. And if the wheel version needs to change, there are two ways to do that: keep the extension and let old installers fail completely[1], or use a different extension so that the old installer ignores them entirely and finds the most recent whl.

That’s the rationale laid out in the PEP and it makes a lot of sense to me–and the reason to do this as a separate step is to iron out the adoption phase before there’s a rush to transition to a new format, minimizing breakages and disruption.


  1. because the resolver will consider them, reject them for being the wrong version, and stop ↩︎

2 Likes

Yeah, I had a similar line of thinking. If storing the information in the metadata is too costly, putting it in the filename is the next best thing. I haven’t had time to think through all of the implications. I think the main concerns I have with putting it in the filename is:

  • it would significantly limit wheel filenames in the future (e.g. we’d need to define the delimiter used as part of the spec)
  • I’m not really sure where it should go. If it’s at the end then it further limits the filename. Prior tags would have to be mandatory
  • If it’s at the beginning, it would look weird before the distribution name
  • We could make it right after distribution version, since I expect any wheel filename to start with {distribution}-{version}, but then people might think it’s a build tag based on experience from wheel 1?

Of all of the above, I think after distribution/version seems the most appealing, but I’d love to hear others’ thoughts.

The other thing about putting the version in the filename is this would re-allow side-by-side upload of wheels of different versions for migrations to new wheel versions. I’m still not sure if we want that to be supported, but I think it would be nice if it was.

3 Likes

The current behavior is an issue because resolvers will pick incompatible wheels, since they don’t have the information. The point of changing the file extension is to allow new wheels to be uploaded in an ecosystem where some old tools don’t check the version during resolution.

Current tooling cannot reasonably select new wheels when supported as installation candidates today. They cannot see the major version during resolution time at the moment.

2 Likes