What would it look like to have a new archival format for distributions?

This is in the vein of my other “what would it look like?” posts. If you were at PyCon with me this year, you know that I like throwing big (not necessarily practical!) ideas our there to get people thinking about how we can make fundamental, future-oriented changes to Python packaging :slightly_smiling_face:

TL;DR: It’s my opinion that ZIP (for wheels) and tar (for sdists) have served the ecosystem well for years (decades!), but that we’re overdue for a better archive format, one that is ideally unified between dist kinds. This is my attempt to lay out the justification for that plus some desiderata for a “better archive” format. I’m positive that it isn’t exhaustive.

Benefits of ZIP and tar

ZIP and tar benefit significantly from being old, mature formats: they’re baked into every version of Python, are widely available via system utilities, are understood by users and analysis tools, etc. Both are also capable of incremental modernization: ZIPs support zstd members, and tar’s status as a flat archival format means we can always wrap it in a newer compression format (e.g. .tar.zstd).

Problems with ZIP and tar

Both ZIP and tar are not just old, but ambiguous: both suffer from decades of layered/overlapping/conflicting specifications, and are additionally complicated by real-world considerations (a long tail of implementations that diverge from specifications, but are important enough in practice that other implementations must also support their divergences).

To briefly summarize, here are the different ways in which ZIP and tar are standardized/documented:

On top of this, the two formats have disjoint limitations:

  • ZIP has very limited file metadata support, including a limited (MS-DOS derived) filetime field and only a loose convention for UNIX-style permissions.
  • tar has various path and file size limitations, which are worked around with ustar/pax or GNU extensions (and unfortunately, sometimes both in the same archive).
  • tar lacks a (native) index/TOC, meaning that readers have to stream the entire archive to locate a single member.
    • Relatedly, tar doesn’t support per-member compression, meaning that a (compressed) tar stream can’t designate a single member as “stored” for cheap access. The traditional workaround to this is to nest tars (or tars and ZIPs) together, with the outermost layer being “stored” and the inner layer being compressed.

Furthermore, on top of that, the two standards are differential-prone, even when implementations are careful to obey one or more of the specifications[1]. This is particularly true in the case of tar, where the layering of specifications means that each implementation makes bespoke (and not necessarily consistent) decisions in its extension handling state machine.

All of the above has practical consequences: we’re fans of adding MUST language to PEPs, but when it comes to distribution formats these are largely unenforceable. For example PEP 625 stipulates that source distributions use the “pax” flavor of tar, but in reality a significant percentage of sdists on PyPI (including ones made after PEP 625) either use the GNU flavor (or, worse) use chimera/hybrids of pax and GNU. I think it’s a real risk that any future attempts to mandate a tractable/safe subset of these formats will meet a similar fate.

Desiderata

So, what would it look like to do better? Here are some (certainly not exhaustive) desiderata I can think of, in the context of Python packaging:

  1. We should probably use the same archive formal for both wheels and sdists, in the (distant) future. It seems like mostly a historical quirk that we ended up with two different formats for the two, since they’re already uniquely distinguished by their filenames.
  2. Having an index/table of contents[2] in the archive is extremely useful for numerous purposes: optimized metadata reads, live inspection with something like PyPI’s inspector, efficient static scanning/malware analysis, etc.The former is accomplishable through other mechanisms (putting metadata first in the stream, stacked tars, etc.), but the more general case is difficult to accomplish without a true file index.
  3. Having a completely unambiguous format definition: we’ve learned a lot about making parse-resilient archive formats in the last ~30 years, and we can probably improve on the distinct limitations in tar/ZIP by ensuring that we make proper accommodations for long paths, UNIX permissions, making it harder to represent extractions outside of a path, etc.
  4. (Maybe) having per-member compression. This is a pretty nice property of ZIP: some entries can be stored verbatim, while others can be compressed with various compression algorithms (for Python, in practice this is DEFLATE and - eventually - zstd), meaning that a client performing an incremental read of a ZIP off the wire can often cheaply seek to exactly the metadata it needs and read it directly without another round of decompression. The downside to this is some lost compression opportunity (between members) plus complexity/malleability in member representation.

Prior art and other considerations

Programmers love inventing new formats (often for marginally beneficial reasons), and it’s usually a bad idea to crate a new standard. But usually doesn’t mean always, and I think the ecosystem has a prevailing interest in something better here :slightly_smiling_face:

Some prior art for consideration:

  • The cpio family: cpio predates tar, and has a similar lineage (in terms of extensions). It doesn’t have many of the desiderata above, since it’s a flat archival format that is typically wholly wrapped in a layer of compression.
  • xar: @jjhelmus pointed this out: it uses a header-index-heap design where the index points into the heap, with heap members being individually compressed.The index is an XML manifest, and the format itself appears to be de facto unmaintained, even though it’s seemingly widely used within macOS still.
  • Others? There’s 7z, ar, etc., but I’m not very familiar with them.

CCing some folks who I’ve talked about this with and who I know have widely ranging and valuable opinions on the subject: @jjhelmus @emmatyping @sethmlarson @geofft @konstin


  1. Which, to be clear, I would not say that most implementations are careful in this regard :slightly_smiling_face: ↩︎

  2. Central Directory in ZIP parlance. ↩︎

8 Likes

Others? There’s 7z, ar, etc., but I’m not very familiar with them.

I think AR is noteworthy in the prior art category as the outer aggregation container for the DEB package format. Its benefit there is a cheap means of concatenating separate TAR bundles for metadata and installable content. A basic description can be found in the deb(5) manpage: deb(5) — dpkg-dev — Debian buster — Debian Manpages

1 Like

Silly use case to consider: I often want to inspect locally built wheels and sdists for testing. Having a CLI to inspect (without extracting) and extracting is a must of course, but I usually find it easier to just visit the file in Emacs[1]. Having support for that format would be nice.


  1. yeah, yeah, substitute your code editing thingie of choice ↩︎

6 Likes

Thanks for writing this up william! Big +1 from for replacing zip and tar for the reasons you wrote.

The main feature I’m looking for (apart from security) is streaming compression and decompression, which is crucial for performance. If it clashed with a central index, I’d take installation performance over inspection convenience.

Maybe we can still have cheap metadata reads by placing the metadata at the front so you only have to read the first compression block? This would be more effective than zip currently as it’s a single request and we wouldn’t even need range requests for missing index metadata, we could just start a streaming HTTP request and terminate it once we got our file. But that’s also only a less important consideration, we should focus on indexes providing .metadata files, not pre-PEP hack support.

Per-file compression tends to perform badly on packages with many small files. For example, the plotly wheel plotly-6.7.0-py3-none-any.whl is 9.5MB baseline, 9.3MB when switching the compression to zstandard in the wheel (per-file compression), but 6.6MB when zstandard compressing the a zip with uncompressed files. Even when zstandard compressing the original plotly-6.7.0-py3-none-any.whl wheel that’s already DEFLATE’d (plotly-6.7.0-py3-none-any.whl), it’s only 8.6MB.[1]


  1. No guarantees for those quick numbers, but I did a proper benchmark some time ago and it was a similar effect ↩︎

7 Likes

It should be of no surprise given our chats during the summit that I am in favor of this. Big +1 from me.

I think I would like to see something that is solely designed for the complexities of Python packaging and address those concerns.

My thoughts after thinking about this on the way home from PyCon.

  1. Index at a known offset. Metadata reads without full archive downloads by reading Index and seeking to Metadata location. This kind of works in ZIP but iirc requires scanning backwards until you find the EOCD signature?
  2. Both distribution formats use the same format and are distinguished by contents ie dist-info vs project layout.
  3. Per-member compression with a fixed algorithm menu - By this I mean each entry in the index declares its compression with maybe 1-2 options like stored and zstd.
  4. Deliberately limited metadata per entry - No UIDs, no GIDs, no xattrs, no ACLs, no link targets.
  5. No “extra fields” blocks. No vendor-specific headers. If the format needs to evolve, it gets a new version number in the magic, and the spec for that version is singular and complete.

Thanks for triggering this conversation William :slight_smile:

4 Likes

Most of this looks fine. Mixed feelings on per-member compression. There are tradeoffs here.

If designing a new format, I would expect it to be comprised as something along the lines of:

  • Header: whatever magic you want, followed by a file format version, followed by a value indicating if this is acting as a wheel or sdist.
  • distribution metadata block: (fixed offset per file format version): length prefixed, streamable, any metadata needed by resolvers should be here, as well as the offsets of the following blocks.
  • index block: info about the content data block. This could be designed to allow a choice of per-member compression. Can have multiple index files point to the same data, etc. Any compression allowed needs to be standard from a set of choices allowed in that format version.
  • content data block: data that is opaque without referencing the index block. contains all the file content.

Agreed, and I’d also say that file paths in metadata should be encoded in a way that it is impossible for a path to express anything other than “relative to where it would be unpacked and without any upward traversal operator.”

2 Likes

Warning: opinions

including a limited … filetime field and only a loose convention for UNIX-style permissions

I don’t think this is 100% true – like tar, it does occasionally require extra extensions but those are essentially part of the de facto spec.

I think we can define a boring subset of zip or tar with the intent that it’s validated by local tools, validated by warehouse, and available for third-party mirrors to validate as well. Do we consider that validation sufficient to let permissive decoders do what they already do (just in a more predictable way)?

I’m not a fan of tar-in-cpio or tar-in-tar or squashfs nonsense – If we get real benefits (like seekable tar) then great, but I’m kind of with Barry that there should be a bias towards existing formats if fit for purpose.

Another point, not directly related: there’s stuff in zipfile that we’re not fixing because of the potential for behavior change. I hope that whatever’s decided here is done in a forward-compatible way that lets decoders know if they “understand” it fully.

5 Likes

Here are my initial thoughts after a quick skim.

  • Whatever we do should future-proof by:
    • Choosing a new file extension for both binary and source distributions that will be used forever. The extensions are purely for informational purposes and tools must not acknowledge them except as a user aid.
    • Having format identifiers contained within that will dictate how the file is processed. If we end up not creating something bespoke, then we should probably use the multiformats standard directly.
  • Improving security is important, and package resolution performance should certainly be considered, but so far we seem to be thinking too narrowly for my liking. I think the most compelling reason to redesign this space is to make lower-level, foundational improvements using content-addressed files. We should be thinking about how to deduplicate content not just on the package repository but also locally, which is particularly relevant for the large assets required by packages for machine learning, GPUs, etc. I’m aware that this introduces complexity around repository “ownership” of shared content and questions around how to make the runtime behavior work in this way but I think the benefits outweigh any implementation hardship on our side.
3 Likes

This would be neat, but (by count) the vast majority of files in wheels are tiny. I assume you’re trying to reduce pypi’s storage burden, the download time for users, and also potentially local disk? Maybe having a separate regime for >1MB members would help the user experience, but complicates the job of quota calculation for pypi.

2 Likes

Thanks for getting this discussion started!

I’m working on writing up a very restrictive definition of .tar.zst, based on a whole lot of conversations at PyCon (with @thatch and others). The actual problem here is that we need a precise and unambiguous archive format, and tar has multiple specifications and both it and Zstandard give a lot of leeway. But we do not necessarily need a brand new format that’s incompatible with them, and I think we can write up a “new” format that happens to be parseable by .tar.zst readers (i.e., our format is a subtype of regular .tar.zst) but avoids the usual tarpits. I’m working on writing up what I’m thinking soon, and that’s my actual (current) preference.

But since you mentioned a new format, I think it’s worth talking about NAR, the Nix Archive. Here’s the original definition via an encoder in 31 lines of functional pseudocode. For those of us who find it easier to read Python, here’s a Python translation:

from pathlib import Path
import os

def st(b: bytes) -> bytes:
    return len(b).to_bytes(8, "little") + b + (-len(b) % 8) * b"\0"

def serialise(path: Path) -> bytes:
    return st(b"nix-archive-1") + serialise1(path)

def serialise1(path: Path) -> bytes:
    return st(b"(") + serialise2(path) + st(b")")

def serialise2(path: Path) -> bytes:
    t = b""
    if path.is_symlink():
        t += st(b"type") + st(b"symlink")
        t += st(b"target") + st(bytes(path.readlink()))
    elif path.is_file():
        t += st(b"type") + st(b"regular")
        if path.stat().st_mode & 0o100:
            t += st(b"executable") + st(b"")
        t += st(b"contents") + st(path.read_bytes())
    elif path.is_dir():
        t += st(b"type") + st(b"directory")
        for child in sorted(path.iterdir()):
            t += st(b"entry") + st(b"(")
            t += st(b"name") + st(os.fsencode(child.name))
            t += st(b"node") + serialise1(child)
            t += st(b")")
    else:
        raise RuntimeError("nar does not support this")
    return t

Even in our more verbose language, the reference encoder is only one lines longer. Despite this simplicity, there are a lot of nice things in the format: it supports 64-bit file contents (and file names!), and it can encode symlinks and executable bits, which are useful for representing built software.

There’s also a lot of things it leaves out, compared to other archive formats, and for representing built software I think you don’t need them. The only permission bit is the single executable bit. You can’t store the setuid bit. You can’t encode file ownership (users and groups) at all. You can’t represent hard links, device nodes, etc. Nix, like us, is transferring directories of compiled code to run on other people’s computers and accounts, so supporting any of this is an anti-feature. (NixOS manages things like setuid bits and devices and users by constructing them in a separate location on the user’s system after the package is installed; the Nix package itself cannot contain these things.) It’s worth noting that this is also the data model Git uses to represent directories (trees) exactly: there’s technically a full set of permission bits in the serialized format, but all the client actually stores or reads is plain files, executable files, symlinks, or other trees.

Another important thing it leaves out is support for sparse files, which it can do because it assummes NARs are always going to be compressed, and the compressor will take care of collapsing ranges of zeroes on its own. I think we should do the same in our format. (I assume this is also why the format is relatively unconcerned about space usage and doesn’t try to pack things into tight binary structures.)

One thing it doesn’t quite handle that we’d want to account for is encodings. Nix is designed for UNIX-style systems, where pathnames are all arbitrary byte strings (char *) that are by convention usually valid UTF-8, but you’re allowed to put anything you want in a path name other than a NUL character. We need to support wheels on Windows, so I think we need a rule that filenames are, in fact, UTF-8. (I don’t see a reason to support non-UTF-8 filenames even on UNIX.)

I think it also satisfies your requirement of being completely unambiguous. There’s no binary structures, no “extended headers,” no repurposing of one field to mean something else, no archive headers at arbitrary locations, etc. Everything is a length-prefixed bytestring read in order from the beginning so there isn’t really a way to misinterpret anything. It’s pretty obvious how you would extend it to store more types of objects if you wanted, and how to do that in a way that doesn’t introduce ambiguity with existing objects.

It’s also, of course, an example of an ecosystem deciding they were willing to bother with a completely novel format and it working out fine for them (… as far as I know, would be interesting to see if it really did).

A notable missing feature from your list an index / random access. I think I want to argue that this is an anti-feature, because it inherently introduces ambiguity. If you have a completely standalone index, then it allows, for instance, two files to overlap each other. If you have something more ZIP-shaped, then you have two sources of truth. I also do think that being able to compress files against each other is important (though it’d be useful to see numbers on this), and coming up with a format where each file is independently decompressible but they share compression state seems tricky.

I suspect that it’s sufficient to have a rule that archives SHOULD (not even MUST!) put metadata like .dist-info at the beginning of the file. Then, with a compressed tar format, a client can do a single range request the first (e.g.) 64 kB of the file, uncompress it, see if it has the metadata it needs, and if not, read more of the file. From a security point of view, it seems good that it’s uncompressing with its normal decompressor; it’s not using a separate archive header, or scanning from the end of the file when some other tool might scan from the beginning, or whatever. Your average .dist-info directory should compress well, and a single round-trip for a short request is likely going to beat out two requests for an index and the actual file. And the downside (as I understand it) is just slower installs, so there’s an incentive for tool authors to order their wheels correctly. Finally, even for servers that don’t support range requests, the client can drop the connection after it has the data it needs.

Anyway, again, I think I favor a locked-down variant of .tar.zst over NAR, but NAR seems like an illustrative starting point for a new format.

2 Likes

Assuming the NAR state machine requires fields to be in that particular order, and bails otherwise, I don’t see much opportunity of differential there (it’s entirely in things like utf-8 parsing, embedded nulls, mixed pathseps, overlong encodings and normalization). It reminds me of a more verbose bencode which has the added benefit of only having one checksum for identical data.

If we go that route would we still need the zip trailer?

.tar.zst has more differentials than I’d like across languages (table with axis labels intentionally removed attached) – we can only validate what goes through warehouse, not other mirrors. Is that a concern?

1 Like

I’m inclined to think a lower-common denominator format is actually preferred here. We have a range of significantly different platforms to think about, and so when people suggest “new formats properly handling [one specific platform and not the rest’s] features”[1] it just sounds like a compatibility nightmare waiting to happen.

I’m also very keen to ensure that there are common/likely tools that already exist that can unpack the files. (Ever since Conda switched their format it is so painful to debug anything to do with indexes or packaging, and I long for .tar.bz2 to come back as a Windows user.)

I like Geoffrey’s idea of taking an existing format and scoping down what we support, so that existing readers have no problems but writers are constrained. I lean towards ZIP over TAR (for the lack of fancy features), though obviously there are all sorts of other issues with ZIP and I’m not totally confident we can spec them out of what we want to allow.

I do think we can make installers handle more special cases oninstall (such as creating symlinks) based on our own metadata rather than needing it to be native to the format. Manually unpacking the files should get you a working library install,[2] but if there are extra steps required for e.g. scripts or headers or whatever (as there already are today), that’s fine by me.

Not sure if this is a hypothetical future observation of the current situation, or if you’re observing the past situation that led us to .tar and .whl [.zip]? We’re here because ZIP made the most sense for wheels, but when we decided to drop one of the two supported sdist formats, “everyone” wanted to drop ZIP and keep TAR. If we’d dropped TAR instead, then we’d only have one format right now.

So it’s not a “quirk” - it was decided by a discussion like this exact one that you’ve started :wink: Hopefully we don’t deliberately run ourselves into adding a third quirk for future historians to analyse.


  1. Relocatable installs being top of mind, which means absolute symlinks are out. ↩︎

  2. That is to say, if you make a wheel that can’t be imported after a naive unpack (assuming dependencies are present and can also be imported), you’ve messed up and should fix your wheel. ↩︎

9 Likes

For reference, this was PEP 517 “A build-system independent format for source trees”.

1 Like

I also like this idea (and I think @geofft advances it strongly!), but I’m simultaneously concerned that it won’t play out as we hope in practice: people keep finding new ways to induce differentials in archive formats, and I’m concerned that the practical impact of saying “you must use this subset” will be somewhat similar to the current status with pax-for-sdists (where the standard says must, but everybody ignores it).

That could be somewhat ameliorated/mitigated by ensuring that PyPI only accepts the strict subset on day 1, but I think even that strict subset will develop differentials over time (particularly for formats like tar, where there’s no one “true” specification). I think it’ll also be functionally impossible to assert for third-party indices, whereas a format that’s constrained by construction (versus by whittling) would be harder for them to get wrong.

(Separately, there’s a problem with enforcing subsets of archive formats on PyPI: neither zipfile nor tarfile is a “forensic” parser, i.e. neither attempts to present all state in a parsed file, only whatever subset of state matches the structures and APIs exposed. So we’d probably need a stronger/lower-level set of parsers than we currently have to reject e.g. unrecognized pax extensions, extension sequencings that get ignored, overlaps and holes, etc. At which point we’re very close to having written a new parser - or two - ourselves :slightly_smiling_face:).

Yes, 100% agreed. I think enforcing UTF-8-only names should be table stakes regardless of whether we invent a new format or constrain an existing one.

(We also then get the joys of deciding on a UTF-8 normal form, or just punting on that problem. This is a real differential in some archive formats as well!)

This is a good point, but I think there’s a middle-ground between these two designs: you can have an index where the index is the sole source of truth, with a parser-enforced property that heap entries referenced by the index are (1) fully packed, and (2) never overlapping. In effect this would be pretty similar to xar’s design, except with those two properties being enforced rather than assumed. Unless I’m missing something (which I could be!), that avoids two sources of truth while also preventing overlap.

The point about gains from compressing members together is also really good though, and I agree it’d be good to have numbers for this – I think it’d be bad to leave >10% space savings on the table if that’s what per-member compression would cost us. But if it’s below that, maybe it’s fine (especially given that we’ll be getting savings from zstd)? I don’t know :slightly_smiling_face:

On the topic of fully random access though: I think putting the metadata up-front solves the main problem for resolvers, but there are probably other pieces of data within distributions that people want (or will want) to access randomly. For example it can be useful to access pyproject.toml itself (for source distributions that have no dynamism, this is a legal way to avoid a build backend invocation when resolving), and the aforementioned cases of inspector/static analysis. We could probably carve each of these out, but there’s a trend of more and more stuff going into .dist-info and I think people might eventually expect cheap access for those members too.

OTOH, this is arguably YAGNI and is best solved by having indices do more sidecar presentation, e.g. detached metadata. I think there’s a strong argument for this.

I’m less familiar with the sdist ones (since I mostly live on Windows, where even if you wanted to use the fancy TAR features, you just can’t), but the ZIP ones I’m familiar of really just require us to resolve arguments quickly, rather than preemptively filter out all potential misuse.

So when someone comes and says “I put 16 characters in this field and now my Windows 95-era ZIP extractor fails” we can say “we defined the Python 3.14 ZIP extractor as the baseline, now go away” instead of having an existential crisis for months over whether their issue should be a supported feature or a vulnerability.

I certainly don’t want to police format specificities on third-party indexes. PyPI users should have a guaranteed level of “the archive you get will be safely extractable by XYZ…”[1] but the whole point of a third-party index is to do things that PyPI doesn’t, can’t, or won’t. The users and maintainers of that index can figure it out from there, but I think this discussion is us as a community figuring it out for PyPI.


  1. Hence my example of using a Python version as the baseline. Certainly needs a better definition than just that, but it’s a starting point. ↩︎

3 Likes

Most of the existing archive formats that people want to reuse because of things like their existing other tools already understanding the format have features that conflict with what is best for the specific case and pretty much all of them also have a ton of potential for differentials.

I think a new format is warranted here, and we should be looking to make it as easy as possible for users to work with the new format: the standard library should be able to unpack it/repack it, and this functionality should be exposed for CLI use.

We get a bunch of benefits here, and the one downside people have is that their existing archive tools won’t understand the format. How important is that really here if we keep it easy for users to still interact with the format manually when they need to?

3 Likes

I think this is a different class of problems than the one I’m concerned about: differentials will naturally occur via platform differences, but the more dangerous ones (from a security perspective) come from places where the spec is ambiguous/contradictory, or worse is not obeyed in practice because nobody obeys it).

Concrete examples of this would be malleability between the ZIP central directory and local file entries, or ZIP’s explicit allowance for “holes” in a ZIP stream‘s contents. A well-formed file shouldn’t have those problems, but they’re common enough in practice for parsers to be flexible in accepting them, and may not be super easy to enforce within PyPI without effectively implementing even more of a custom ZIP parser.

I think we could indeed resolve these as they come up, but that’s So when someone comes and says “I put 16 characters in this field and now my Windows 95-era ZIP extractor fails” we can say “we defined the Python 3.14 ZIP extractor as the baseline, now go away” instead of having an existential crisis for months over whether their issue should be a supported feature or a vulnerability. an indefinite commitment to a more loosely defined problem. I think our future selves might thank us for constructing the problem out of existence instead :slightly_smiling_face:

If the stdlib is the only tool that supports it, then it really needs to be available in every non-EOL stdlib before we can assume it’s present. Even then, the lack of GUI tooling will be a legitimate pain (not everyone wants to extract an entire archive or scan a list of 10K lines of text just to check one filename). It’ll also mean that we’ll lose support from virus scanning tools, which are proving to be fairly important right now.

Right, those are also the ones I’m thinking of when I want to say “well, if your Rust library makes a different assumption from our designated reference implementation, it should stop making that assumption”. I’m yet to see any complaint about this that solely involves Python - it’s always a third-party tool making a different assumption (similar concerns/complaints happen with URL parsing for similar reasons).

Or curse ourselves for constructing us into a different set of problems or responsibilities…

4 Likes

It should be possible to provide a cannonical implementation that can be vendored as a single file. The tools needed for what people have mused about for a bespoke format are the struct module + whatever compression is supported. If you want, I could put together a prototype example of this later.

I would expect that rather than wait for every non-EOL version to support it, it’s vendored for older but supported python versions until then.

Maybe, but we’re also seeing that many of these tools miss these issues right now even with a format that’s just a zip file. If the cannonical way to use the format requires unpacking (no zipimport-like behavior) and unpacking is designed to always be safe at a format level, then traditional endpoint AV still has visibility between unpack and any later import/execution.

As for people doing ecosystem level scanning, it’s also safe for them to unpack without any extra consideration as a result of the design. It does mean those building research tools will have an intermediate unpack step, but if we keep the actual capabilities of the format limited to the exact things the ecosystem needs, and roll this out with proper notice, I don’t think adapting to this is going to be a significant barrier.

I would also expect that major AV companies would want to understand the format, and could easily do so if we kept the format simple, identifiable, and built off of things they already need.

The differentials last summer show the problem with this: if the spec is ambiguous it’s locally optimal for third-party tools to follow what Python does, but the globally optimal thing is to disambiguate the spec itself. But more generally, that’s how differentials manifest — there’d be no differential if there weren’t multiple implementations.

Yeah, I think this is a pretty critical point: one of the main current problems is that it’s very easy to construct a ZIP (or tar) that presents differently to different applications (not just different packaging tools, but AVs, IDPs, etc.). Those tools would need to learn a new format if we went with one, but they’re already (and IMO more gravely) exposed to significant analysis problems at the moment.

1 Like