What would it look like to have a new archival format for distributions?

thatch · June 8, 2026, 3:27pm

>@mikeshardmind We can’t allow arbitrary zstd. We have to pick specific options.

Can you explain more of the “why” here? Pretty sure the only one in that list that affects decompression is the window size. The others are only compression-side options that produce equivalent (but not bit-identical) streams with some ratio variation. You should probably double-check whether 28 is allowed by default, I don’t think it is.

Because the effective window is limited by the total uncompressed size of the frame (and of the archive, worst case) windows beyond 8MB show diminishing returns on current real-world examples. Even large wheels need to have contrived contents to benefit much.

EDIT:

Specific examples would help. I don’t know of any where you end up with different resulting data, only differences in accept vs reject.

thatch · June 8, 2026, 3:35pm

@emmatyping good thoughts on all of those [and some drafts already do, there are actually four half-baked ideas so far]

The reason I at-ted you originally is I’m also interested in what the motivating example for symlinks actually is. I think that’s what led us down the route of considering other formats, given the backwards-incompatible change it would be within zipfile.

mikeshardmind · June 8, 2026, 4:29pm

sigh thanks, Got bit by this, that’s noted by comment there:

The limit does not apply for one-pass decoders (such as ZSTD_decompress()), since no additional memory is allocated

because my initial testing for just the resulting compression ratios was simplified.

diminishing returns still add up, especially when we’re talking about some of the larger gpu wheels right now with how many users they have, that last setting was with those wheels in mind.

per the zstd specification, supporting zstd doesn’t require supporting the full gamut of options or supporting decompressing archives compressed with options they don’t support. I’d much rather avoid an issue down the road where some tool implemented in another language looks fine, but then users run into issues on some packages that chose different options on compression.

The differential on accept/reject here is important to me. It should be clear from the moment a tool sees the compression label whether or not the tool can handle unpacking it, as well as if it can reasonably add it to a pool of ongoing decompression threads (memory limit) to parallelize over wheels being installed. Without standardizing these options, and with needing to support tools not written in python using what we can know exists, I was attempting to limit the potential for issues there.

If we go with requiring full zstd support, because much of the zstd api is only “partially stable”, we need a robust definition of what that actually entails, rather than defining it based on the compression and decompression parameters + test vectors.

I was storing the compression settings as an integer, which would require tools know what each one maps to in a specific version. I have a slightly higher bar for expectations of tool authors when it comes to consuming a for-purpose binary format, and I also anticipate that if we get agreement on a format, we might be able to get implementations that we can all verify exhibit the correct behavior in multiple languages to make adoption easier.

thatch · June 8, 2026, 7:25pm

I don’t think we have to settle this early on, but that’s not my read of it. The word “parameters” to me implies only the window-size-related limits, not that partial implementations are ok. Could you quote the specific passages?

I believe bit stream (unlike parts of the API) is stable, and there have only been minor errata.

mikeshardmind · June 8, 2026, 8:53pm

Unless otherwise indicated below, a compliant compressor must produce data sets that conform to the specifications presented here. It doesn’t need to support all options though.
A compliant decompressor must be able to decompress at least one working set of parameters that conforms to the specifications presented here. It may also ignore informative fields, such as checksum. Whenever it does not support a parameter defined in the compressed stream, it must produce a non-ambiguous error code and associated error message explaining which parameter is unsupported.

Which upon rereading, does change the calculus a little bit. It’s insufficient to specify the parameters of a compressor, because different valid implementations can produce different conformant outputs that only one of is possible to decompress with a valid decompressor.

We have to require unpacking support the entirety of the current version of the zstandard specification, and we should still require pledging a specific window log size to allow for safe implementations of unpacking.

emmatyping · June 8, 2026, 11:06pm

Ah, sorry I missed the at! I agree with Paul’s comments on this, but perhaps with different priorities

For me the most important use case is projects that want to provide libraries that users can build against on Linux, so the common foo.so.1.2 -symlink> foo.so trick which is used to provide both a runtime library and a library to build against without duplicating the library.

I also think symlink-based editable installs would be nice, the current mechanism does not provide sufficient information for static analyzers. There are other solutions to that problem, but symlinks is one.

hpkfft · June 9, 2026, 5:08am

It would be nice to make a Python wheel that contains native libraries such that the native libraries retain the names naturally used by the system. For example, on my system:

$ ls -lh libzstd*
lrwxrwxrwx 1 root root   16 Mar 13  2025 libzstd.so.1 -> libzstd.so.1.5.7
-rw-r--r-- 1 root root 806K Mar 13  2025 libzstd.so.1.5.7

As you can see, the actual file libzstd.so.1.5.7 has a name that includes the full version information.
The file named libzstd.so.1 must be present since that’s what’s loaded at runtime:

$ readelf -d libzstd.so.1 | grep SONAME
 0x000000000000000e (SONAME)             Library soname: [libzstd.so.1]

Libraries that depend on other libraries indicate their NEEDED dependencies. So, if another native library needs libzstd.so.1, it puts exactly that in its header because that’s the SONAME of what it wants. For example, libzstd.so.1 depends on libc.so.6:

$ readelf -d libzstd.so.1 | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]

so the file named libc.so.6 must be found on the system.

C or C++ developers would have installed the development package libzstd-dev and would have yet another symlink, libzstd.so -> libzstd.so.1 so they could compile as follows:

cc my-application.c -lzstd

and get the most recent version installed on their system.

My FFT module, hpk.cpython-313-x86_64-linux-gnu.so, has a NEEDED dependency on libhpk_core.so.0. Currently, in packaging my wheel, I have a choice:

Package only libhpk_core.so.0, which lacks full version information, or
Package both libhpk_core.so.0 and libhpk_core.so.0.6.0, which are two copies of the same thing.

That’s why I’d like to see wheels support symlinks (at least within the installation directory).

pf_moore · June 9, 2026, 7:36am

One interesting point here is that the use cases for symlink support are with wheels, whereas the discussion is about a new format for all distribution types (sdists and wheels). Is it OK to have a case like this, where certain features are only allowed for particular distribution types?

pitrou · June 9, 2026, 9:26am

Symlink support could be useful for sdists too, for example if a license file needs to be replicated in several subprojects.

pf_moore · June 9, 2026, 10:02am

Of course, with any symlink support, we’d need to define what tools should do when faced with a filesystem that doesn’t support symlinks. There are two obvious options:

Use copies instead.
Fail with an error.

Option (1) would break the editable install case (I don’t know if Linux .so files work with copies rather than symlinks). Option (2) would discourage using symlinks except where absolutely necessary (for example, the license example could make a sdist unusable in cases where copying would be absolutely fine).

We could leave fallback behaviour to tools, but that affects reproducibility, which I think some people here have said they care about.

A more complex option could be to have additional metadata in the file for symlinks, saying whether copies are acceptable or not. Consumers could then fall back to copies if that’s allowed, but error if there are required symlinks that can’t be created.

konstin · June 9, 2026, 12:33pm

pip already supports symlinks in source distributions (pip/src/pip/_internal/utils/unpacking.py at 486db076e2f4f0bf6780c24cd487f09dc2a14015 · pypa/pip · GitHub) and CPython std handles the case when platforms don’t support symlinks (cpython/Lib/tarfile.py at v3.14.4 · python/cpython · GitHub), using option (1).

pitrou · June 9, 2026, 12:46pm

There’s no reason why they wouldn’t work. Symlinks should be considered a size and maintenance optimization.

mikeshardmind · June 9, 2026, 3:19pm

I’m still of the opinion that the archive format shouldn’t contain symlinks directly, just indicate which paths have the same content. Leave it to those unpacking to make the optimization or not. (This can be optimized within the archive format without specifically calling it a symlink)

jamestwebber · June 9, 2026, 3:36pm

That makes sense, but does it just delay the task of standardization? If different tools make different choices about how to unpack the redundant path, is it possible for them to create incompatible installations? Like, one tool was used to install a dependency, and another tool assumes it was installed in a different way. I’m not sure how that could cause a problem but it feels like someone would find a way if it’s possible.

Maybe it would be acceptable to say “use a single tool for your environment, they might make incompatible choices” but I think eventually a standard would be desirable.

Perhaps that’s not that complicated–just say “the first listed path^[1] should be the file, the rest can be links or copies or whatever”. Although that could prevent making all of the instances symlinks–if you had a global cache or something.

and/or highest in the directory tree ↩︎

mikeshardmind · June 9, 2026, 3:58pm

I don’t think so, at least not if there are reasonable statements made. It just explicitly makes it clear that those distributing should only rely on the content they ship being available at a specific path relative to where unpacked, not the means of which it is available at that location.

mikeshardmind · June 9, 2026, 4:02pm

Oh, right.

The way it could create an incompatibility is if a wheel/sdist ends up modifying it’s own files. In the case of a symlink, all copies of that file are modified, in the case of copies, only the one modified is.

I have zero issues ruling out this being supported behavior for wheels. I don’t think sdists should be doing it either, but maybe there’s an argument there.

tiran · June 9, 2026, 4:12pm

Equal file content does not necessary mean that files should be symlinked. A common example is an empty __init__.py file.

You’d also need a hint to instruct the unpacker which file should be solid file. In the shared library use case, libzstd.so.1.5.7 should be the file and libzstd.so.1 should be the symlink. And there is also the hypothetical case where file-a and file-b have the same content and the user wants link-a -> file-a / link-b -> file-b.

mikeshardmind · June 9, 2026, 4:39pm

I don’t really follow any of those given examples.

Why should we care if all empty __init__.py are possibly symlinks? Why should the distribution be relying on which file is the solid file? What in behavior we need to support within python requires these things? If it’s just a hypothetical, lets rule it out. I don’t see any reason why we should be allowing either distributors or users to rely on this when the ability to even have symlinks on the file system unpacked to isn’t a reliable assumption on those distributing.

hpkfft · June 9, 2026, 4:53pm

This just seems to be inventing a new word for symlink. That is, a filesystem has a “symlink”, which is a pointer named libzstd.so.1 to the file libzstd.so.1.5.7. An archive file has a “contentlink” which is a pointer named libzstd.so.1 to the archived contents of the file libzstd.so.1.5.7.
Regardless, my hope would be that installing on a system that supports filesystem symlinks results in symlinks.

I don’t know much about Windows. Is that the only Python-supported system that does not support symlinks? I kinda think, maybe, that Windows supports hard links? If so, that might be better than installing two copies of the same file (both to save space and to avoid the copies diverging if one file is later modified).
I kinda think, maybe, that Windows supports symlinks; they just call them “shortcuts”. But, maybe, one has to be an administrator or something to create them? [Note: I really don’t know what I’m talking about here. I’m just brainstorming…]

mikeshardmind · June 9, 2026, 4:55pm

One of the reasons to not let the archive specify it as a symlink is that that has a predetermined notion of meaning.

An optimized installer might keep a central cache of all package files, and for all venvs it manages link all files to that cache based on content hash.

A symlink with a notion of a specific relation is incomaptible with that potential optimization, as are junctions and hardlinks. I’d rather not confuse the terminology or rule out other optimizations.