This is in the vein of my other “what would it look like?” posts. If you were at PyCon with me this year, you know that I like throwing big (not necessarily practical!) ideas our there to get people thinking about how we can make fundamental, future-oriented changes to Python packaging ![]()
TL;DR: It’s my opinion that ZIP (for wheels) and tar (for sdists) have served the ecosystem well for years (decades!), but that we’re overdue for a better archive format, one that is ideally unified between dist kinds. This is my attempt to lay out the justification for that plus some desiderata for a “better archive” format. I’m positive that it isn’t exhaustive.
Benefits of ZIP and tar
ZIP and tar benefit significantly from being old, mature formats: they’re baked into every version of Python, are widely available via system utilities, are understood by users and analysis tools, etc. Both are also capable of incremental modernization: ZIPs support zstd members, and tar’s status as a flat archival format means we can always wrap it in a newer compression format (e.g. .tar.zstd).
Problems with ZIP and tar
Both ZIP and tar are not just old, but ambiguous: both suffer from decades of layered/overlapping/conflicting specifications, and are additionally complicated by real-world considerations (a long tail of implementations that diverge from specifications, but are important enough in practice that other implementations must also support their divergences).
To briefly summarize, here are the different ways in which ZIP and tar are standardized/documented:
- ZIP: PKWARE’s APPNOTE, ISO/IEC 21320-1:2015, INFO-ZIP’s APPNOTE
- tar: Unix v7, ustar, GNU, pax, plus vendor variants that are not widely emitted but are often supported on read paths (e.g. Sun’s variant of pax).
On top of this, the two formats have disjoint limitations:
- ZIP has very limited file metadata support, including a limited (MS-DOS derived) filetime field and only a loose convention for UNIX-style permissions.
- tar has various path and file size limitations, which are worked around with ustar/pax or GNU extensions (and unfortunately, sometimes both in the same archive).
- tar lacks a (native) index/TOC, meaning that readers have to stream the entire archive to locate a single member.
- Relatedly, tar doesn’t support per-member compression, meaning that a (compressed) tar stream can’t designate a single member as “stored” for cheap access. The traditional workaround to this is to nest tars (or tars and ZIPs) together, with the outermost layer being “stored” and the inner layer being compressed.
Furthermore, on top of that, the two standards are differential-prone, even when implementations are careful to obey one or more of the specifications[1]. This is particularly true in the case of tar, where the layering of specifications means that each implementation makes bespoke (and not necessarily consistent) decisions in its extension handling state machine.
All of the above has practical consequences: we’re fans of adding MUST language to PEPs, but when it comes to distribution formats these are largely unenforceable. For example PEP 625 stipulates that source distributions use the “pax” flavor of tar, but in reality a significant percentage of sdists on PyPI (including ones made after PEP 625) either use the GNU flavor (or, worse) use chimera/hybrids of pax and GNU. I think it’s a real risk that any future attempts to mandate a tractable/safe subset of these formats will meet a similar fate.
Desiderata
So, what would it look like to do better? Here are some (certainly not exhaustive) desiderata I can think of, in the context of Python packaging:
- We should probably use the same archive formal for both wheels and sdists, in the (distant) future. It seems like mostly a historical quirk that we ended up with two different formats for the two, since they’re already uniquely distinguished by their filenames.
- Having an index/table of contents[2] in the archive is extremely useful for numerous purposes: optimized metadata reads, live inspection with something like PyPI’s inspector, efficient static scanning/malware analysis, etc.The former is accomplishable through other mechanisms (putting metadata first in the stream, stacked tars, etc.), but the more general case is difficult to accomplish without a true file index.
- Having a completely unambiguous format definition: we’ve learned a lot about making parse-resilient archive formats in the last ~30 years, and we can probably improve on the distinct limitations in tar/ZIP by ensuring that we make proper accommodations for long paths, UNIX permissions, making it harder to represent extractions outside of a path, etc.
- (Maybe) having per-member compression. This is a pretty nice property of ZIP: some entries can be stored verbatim, while others can be compressed with various compression algorithms (for Python, in practice this is DEFLATE and - eventually - zstd), meaning that a client performing an incremental read of a ZIP off the wire can often cheaply seek to exactly the metadata it needs and read it directly without another round of decompression. The downside to this is some lost compression opportunity (between members) plus complexity/malleability in member representation.
Prior art and other considerations
Programmers love inventing new formats (often for marginally beneficial reasons), and it’s usually a bad idea to crate a new standard. But usually doesn’t mean always, and I think the ecosystem has a prevailing interest in something better here ![]()
Some prior art for consideration:
- The cpio family: cpio predates tar, and has a similar lineage (in terms of extensions). It doesn’t have many of the desiderata above, since it’s a flat archival format that is typically wholly wrapped in a layer of compression.
- xar: @jjhelmus pointed this out: it uses a header-index-heap design where the index points into the heap, with heap members being individually compressed.The index is an XML manifest, and the format itself appears to be de facto unmaintained, even though it’s seemingly widely used within macOS still.
- Others? There’s 7z, ar, etc., but I’m not very familiar with them.
CCing some folks who I’ve talked about this with and who I know have widely ranging and valuable opinions on the subject: @jjhelmus @emmatyping @sethmlarson @geofft @konstin
