A tool claiming to analyse malicious/etc. archives that isn’t detecting inconsistencies like that is not doing what it claims. I’d rather those tools go out of their way to handle non-standard interpretations, rather than forcing an entire ecosystem go out of their way just to simplify those tools (at least until there are irresolvable conflicts between legitimate use cases, e.g. two popular archive libraries in different ecosystems that claim to be compatible with each other but aren’t, while both comply with the same spec… pretty sure that can only happen in email parsing though).
I’m as much with you guys on the safety/security side, but I think you’re overindexing on that and ignoring/devaluing the actual impact on regular users (in this discussion so far - obviously not in your actual work, since I know you’ve both done things that help regular users at a “cost” to the specialists… but I think it’s the specialists responsibility to carry that cost).
To clarify: I agree that security vendors bear responsibility for making their tools correct. I think this discussion has relevance outside of that scope; they were just an example.
Conversely, I think Python packaging has a responsibility to advance safer and future-oriented standards, and that advancing those standards does benefit users both in the short and long terms outside of security interests. Speaking concretely: a simpler format means a faster index and faster tools, as well as less disruption from security patches. Such a format would also be a sufficient (but not necessary) lead-in for other things discussed up-thread that have previously been brought up as user interests: being able to safely represent symlinks in distributions, making it easy to index/partially read from all distributions, being able to modernize dist compression, etc.
All of that has a cost in terms of a new format, but I think we shouldn’t pre-emptively index on that cost either. Instead, I think we should inventory that cost and figure out which parts are acceptable/mitigable or not. Non-exhaustive examples:
Users won’t be able to extract a new format with native OS archival tooling. But is this a problem that the median (or even +/- 3 SD) Python package consumer has? And if it is, then the alternative of setting an incompatible baseline for an existing format presents a similar issue. My preference: we’d expose a reference implementation and tool for this, and users who need to manually introspect distributions could use it.
A new format here would run ahead of CPython, unless we spend ~5 years cycling it into place. But I think this is OK: the goal here is to have a simple format, one that could be parsed (performantly!) in Python and vendored directly into pip. I think this would probably be the same as the reference implementation above. I believe this precedent has been established already with pip’s adoption of TOML parsing in advance of CPython’s inclusion of tomllib.
I completely agree that zip and tar are problematic formats that are under-specified.
When considering a new format, the question in my mind is whether it is better to specify requirements for tar/zip, or make a new format.
I think I lean towards specifying requirements for tar/zip for a few reasons:
Changing the format would mean users won’t see updates for packages released in the new format, unless you allow side-by-side upload, which has it’s own security implications and complications. This was discussed at length in the PEP 777 discussion, and the consensus was people felt very uneasy that users would silently not receive updates. If we do allow dual publishing, this has it’s own security and compatibility concerns, such as what happens with symlinks
Specifying a new file format is incredibly difficult to do well. There are a lot of subtleties and footguns related to defining the header, versioning/how to make it extensible, whether or not to have an index (as @geofft pointed out, this is not always a winning proposition), and chunking. The Python standard library already has a lot of hard-learned lessons about safely extracting symlinks from tar files for package formats. I think we should not dismiss the fact that for a new format this will need to be reimplemented and introduces the potential for security issues.
A new format would likely preclude zipimport of packages. I am not personally a fan of zipimport but people do use this feature and users of older Python packages would miss out on it.
Some additonal thoughts:
I’m glad I was able to nerd-snipe you into doing this! (In all seriousness, thank you and Tim for working on writing these details up!)
I think it would be good to know how many behaviors/details of tar/zip would need to be specified to make a reasonable comparison to a new format. I look forward to your write up!
I think PyPI having strict checks on day 1 would be a forcing function for build backends to generate correctly from the start, and significantly avoid issues like we see with sdists. Like I proposed at the packaging summit, I think a wheel 2.0 still based on zip can place many additional constraints on the format as long as .dist-info/WHEEL is extractable.
One final thought is it is hard to evaluate a new format without a concrete proposal. I realize that is a big ask for unknown results, but I worry it is easy to view a new format as grass being greener on the other side. It very well may be greener, but it’s hard to say unless we can look over the fence.
One thing I’d like to highlight: tar in particular has a header/extension state machine that isn’t well defined in the presence of multiple different tar “flavors”. For example, a tar stream might have a mixed pax/GNU sequence like this:
g -> x -> L -> [entry]
In practice most tar parsers will apply the relevant states from each sequence in order of appearance, so the GNU L will take precedence over the pax x. But this isn’t well defined: a “pure” pax parser is supposed to treat unrecognized typeflags as regular files, in which case the “correct” (but obviously wrong from the user-perspective) behavior would be to apply x (or g’s) attributes to L instead of [entry], resulting in stream desynchronization.
In effect this means that a “pure” pax parser needs to be stricter than the spec says, and reject any typeflags other than the ustar and pax ones. I believe tarfile doesn’t currently do this (the “pure” pax path is write-only), and doesn’t expose “longname” extensions via TarInfo (I’ll confirm that in a second).
(Another thing to cover would be hdrcharset in pax and ensuring it’s constrained to UTF-8 or something similarly reasonable.)
I agree, although I think we currently lack a way to express some of those strict checks per above
(I’m also worried that we don’t have a firm grounding of what “strict” would be in the case of something like tar, since there’s a read of the pax spec that does allow GNU extensions, just not with clear semantics.)
Very fair. I feel appropriately nerd sniped into writing up an initial spec, if only for comparison
I can have a draft format ready by end of this week.
The point would be to only specify the archive format, not any of the internal content, including the actual content of the resolver metadata block.
These are the decisions I’d be designing around, if anyone has a suggestion for an additional design consideration that I haven’t covered or ruled out, if you mention it before wednesday, it’s possible it ends up considered in a draft.
Strict format with no ambiguous parsing behavior.
No unneeded features (ie. no filesystem flags, file attributes, etc)
Ability to efficiently deduplicate content in archive, independent of how tools handle it (allow tools to symlink if user wants, but don’t prescribe the method of accomplishing that in the archive, just give the archive an efficient way to specify “next file is duplicate content of previous”)
Streamable (both download and any decompression)
Metadata needed by resolvers is always in a fixed location near start of file.
All blocks are length-prefixed and either versioned or opaque data to the format.
Ability to vendor an implementaion of this in a reasonable single python file.
Ability to store without features not guaranteed to be available (ie. store without compression)
Partial deterministic file content: With the same versions of tools used on the same system, and the same options used to pack the files, the resulting archive content should be the same.
Things I don’t think are worth designing around and why:
I don’t want to implement random read for compressed files or optimize for time packing or adding files to an existing archive.
As an archive format that the primary use of involves unpacking either just the metadata or the entire file (depending on if the file is used in a full resolver solution), and that’s used primarily for 1 to many distribution, I think an additional cost for people interested in only a single file is fine, as is telling people to rebuild the entire archive when packing new files if it keeps the format simpler and more efficient.
I don’t want to optimize for fully deterministic content
Compression can vary system to system and version to version with the compression libraries we want to be able to leverage. The same system, content, and options producing the same content is strong enough.
Hi Michael, I came to a similar set of requirements (although spent most of my time trying to figure out what “ambiguity” means for filesystems). I only got through what maps to your points 1+4 last week, design draft at malo/zar/design.md at main · fastzip/malo · GitHub (and two other files in that dir, a simple spec.md and a filename-focused unicode.md, plus test cases). I’d like to see what you come up with as well!
I don’t quite understand what this one means. Both tar and zip are at least partially length-prefixed, and this means that any offset bugs quickly can be turned into differing-contents bugs. We can’t validate what goes in contents, so it’s easy to hide an alternate header there while remaining valid-looking to most unpackers.
For this, I went with unpacking requiring filenames be NFKC normalized (because python normalizes unicode identifers this way, import names will match), and rejecting non-normalized names on during unpacking, rather than allowing the normalization to change where a file ends up from where it was declared. The packing implementation does normalize.
Paths are stored in the archive as a length prefixed array of length preixed bytes, not storing the path separator and must be constructed during unpacking relative to unpacking target.
The exact values . and .. are rejected for any path fragment, as are these characters:
For zstd compression, I’m not allowing arbitrary settings. According the the zstd specification, a zstd implementation is considered valid if it can produce and consume a zstd archive, it’s not required to support all possible settings.
We can avoid a possible issue here by requiring support for predetermined configurations, this is the remaining work I have on this.
Packing step also requires multiple passes in my implementation, but for a different reason. There are no symlinks in the format as I’ve gone about it, just a way to say “use the previous file content again”. This leaves the choice of symlink to the installer as a tool choice and doesn’t require symlink support where unpacked to result in the same content at the declared location. It also avoids a situation where people packing an archive get to assume something is a symlink when an underlying install might choose a different means, including a copy.
It’s built off of blake2b, as python guarantees this is available for all supported python versions. sha256 is not guaranteed to be available by python. as brought up in the discussion for the prior work, blake2b is as platform and language available as sha256 with formally verified implementations available, and should therefore be available to tools implemented in other languages (eg. uv)
Files in the archive format are ordered by (content hash, content_size, encoded_destination). This places identical file content next to eachother (used for format level optimization of duplicate content), and avoids a mapping in the format, instead iteratively matching the manifest block during streaming decompression.
I’ve written up the basic idea for an archive format as well, here:
(Per the warning in that repo’s README, this is (1) paperware, (2) not at all “final” in terms of how thought-out it is, and (3) currently focuses primarily on the semantics of the archive, rather than how it’s actually encoded. Please push back on everything about this design!)
Edit: I should also note that war is not intended to be the “real” name, it’s just a placeholder.
I realize I forgot to comment on this bit. This isn’t to try and prevent it from being valid as multiple different types of file, this one is to limit the potential to need to later iterate further on the archive format, and if such a need arises, it’s versioned.
For example, in my current working draft, distribution metadata is just length prefixed bytes. It’s up to resolvers to then parse the metadata. Changes to distribution metadata (such as new keys) don’t break the archive format.
I also have this which has been my playground a bit in creating things and trying things as I have time to help me further fine tune what I do and do not like in a new archive format. Feel free to use pieces, create discussions. It is a playground for me to ideate so its fair game for criticism and stealing ideas from.
I’d like to +1 this use case, packers should be permitted to put things first, and the format should have enough metadata early to figure out if the optimistic-read got all the files you need.
I see “the archive format” as separate from “python using the archive format” – the latter may have additional restrictions (like “.dist-info dir(s) come first”, or “no archives over 100MB” or “no compression ratios over 3000:1”) that warehouse and packers are welcome to tack on, but the core format shouldn’t care.
I’ve read through the various ideas and can kind of squint at the common themes – these are the remaining ones that don’t feel settled enough to make a combined draft (pre-coffee, please excuse brevity and typos):
Reproducibility - what does this gain us? We’d have to give up a lot to get there (decisions made in stone, and probably not reasonable for the compressed data)
CAS - Can anyone interested in this say more (is this for saving local disk? download time? CDN storage?). Is there an example other than vendoring giant native libraries that this is a win that wouldn’t better be solved by putting the duplicated item in its own project? I have bad memories from backups with hardlinks where there are surprising scaling limits like max number of links for “the” empty file.
Multiple compression algo choices - what’s the argument for supporting deflate (and store)? We would still require unpackers to support zstd, is this just for faster archive creation when testing locally? I’d have to benchmark, but would expect zstd level 1 to compress faster than you can write to a consumer SSD.
We’ve spent a lot of time on “how” to store symlinks, but not the “why”. I’ve been taken this as a requirement from a discussion last year with @emmatyping but think this needs some more words about the problem it’s solving (for example, a couple of the drafts can’t symlink files outside their own archives as a security limitation, but I don’t know if that’s still useful). Target FS support is also varied, and I think we need to give clear guidance about what fallback is allowed or we end up with zip-bomb behavior on FAT32.
The one use case I know of for symlinks in wheels is editable installs. Some people want to be able to create a wheel that contains a set of symlinks back to the project source directory, so that what’s installed is all “live” versions of the files the user is editing. That’s a nasty case, because it relies on being able to link to arbitrary filesystem locations. However, it’s also well constrained, because it’s only needed for wheels built using the build_editable hook - such wheels are explicitly not publishable. Also, we’ve managed for some time now without needing this form of editable install, so it’s not clear to me at this point if it’s still a real need (nobody has ever cared enough about it to put together a proposal for symlink support in the current wheel format).
I believe the way Linux creates versioned .so files involves symlinks, but those are just links between two files in the same directory, so they should be fairly safe.
This is rather difficult to do robustly, as there is no OS API to check this. CPython doesn’t do this currently for ZIP.
uncompressed_size_hint
Minor nit but I think this should contain the exact uncompressed size, right? “hint” implies that it is a guesstimate, which we don’t want.
currently focuses primarily on the semantics of the archive, rather than how it’s actually encoded
I think this is actually one of the hardest parts of defining a new format. I don’t really want to ask you to go do this but I also think the hardest and perhaps most important part of defining a new format is the encoding of the data. There are a number of gotchas such as ensuring a null byte in the header magic that people have learned which I worry we’d miss if we try to design a new format.
Separately, I also want to reiterate that a custom format makes it significantly harder to adopt for users. We have already seen a number of index providers not implement new features in the simple index. If we require those vendors to implement interactions with a custom file format (e.g. JFrog’s Artifactory is implemented in Java, so I imagine they’d need their own implementation), then I fear that format won’t see uptake by those vendors.
I had some other things come up this week that demanded my free time, but I did manage to come to a few conclusions about a custom format for python with looking into actual outcomes on sizes of various popular packages, and running into little implementation details.
We can’t allow arbitrary zstd.
We have to pick specific options. This prevents some differentials, and ensures that there is a guarantee that valid zstd compressed archives have specific memory use associated with streaming decompression.
These are the options that seem to best fit popular packages when optimizing for the consumer of the distribution. Options here were picked to offer options that have lower memory use for packages where long-distance matching isn't a major gain, or that expect to be installed on platforms with less available memory.
This should come with recommendations to not use the option corresponding with the most memory use for packages under a certain size.
We should be compatible with multiformat CIDv1 hashes
There are potential gains to using something else[1], but they don’t appear to be of a large enough magnitude to not building on top of something where there is some level of standardization already in use.
We don’t have to bake in anything else, but providing this may allow various optimizations for installers that keep their own cache.
However, If we do this, the only suitable choice of hash while remaining portable and backportable as a pure python package to supported python versions is blake2b, with a 64-bit (or larger, if someone can argue a reason to) digest.
We don’t need to support deflate, xz, etc.
None of these are guaranteed to be available by python, and they perform worse than zstd. We should keep an option for just storing without compression for packages meant to bootstrap, backport, and other cases where a compression option may not be available for use, but supporting more compression options that we can’t show are worth it just increases the complexity, potential for not immediately noticed differentials, and downstream dependencies required for tools written in other languages.
I’m still looking into potential performance differences in allowing per-file vs full block compression. I would suggest to disallow (or rather, not support) per-file initially based on the data I currently have.
I’m partial to my own directory hashing solution, but the things it offers aren’t enough of a gain here. ↩︎
Yeah. I’ve been experimenting with writing a tar parser with ‘safe’ defaults as well, and I’ve come to the conclusion that this is the wrong layer to perform these kinds of checks at – the archive format should probably reject things like control codes in path components, but actual reserved names are a matter of OS/FS semantics that would probably be an exercise in frustration to fully capture.
I’m inclined to remove this and limit to just the “sane subset of Unicode” requirement, perhaps along with a normalization requirement. I’m curious if @thatch has opinions about requiring one of the normal forms from his work
The intent behind “hint” was to convey that it’s essentially an untrusted field, i.e. that a parser can’t safely malloc(uncompressed_size_hint) and blindly stream the compressed entry into the resulting buffer. But I agree it also implies a guesstimate, which is inaccurate.
Maybe just uncompressed_size, with the requirement that the decoder must cross-check the index’s claimed uncompressed size after each decompression step? Or maybe even that is too strict, I don’t have a strong intuition there
That’s fair. My original thought was to do this with one of the popular IDLs, since they (1) have mature pre-existing tooling and (2) don’t require me or anyone else to innovate a specific binary encoding. I’m not a huge fan of protobufs for unrelated reasons, but something like that doesn’t seem inherently poorly suited for the task[1].
(As part of avoiding the “format wrapped in a format” problem, I’d imagine this would be conveyed as (header, encoded-payload) i.e. the header would be its own small format, and then encoded-payload would be the IDL’s encoding’s format.)
I’m worried about this too, but I think the situation is slightly different: we see third-party indices refuse to adopt new index features because they can get away with it, but they can’t get away with rejecting new metadata versions, etc. because that would actually break users who roll onto new versions of their dependencies. In other words the economics are on our side in this case, similarly to how the economics favored wheel adoption[2] (and will favor wheelnext).
(With that said, any actual adoption story here needs real thought, since the end user experience will also be very disruptive by default. I freely admit I don’t have a great solution there yet!)
I don’t know the compression side well enough to have a strong opinion here, but I’ll take it on face value that this is true!
If so, I think we can borrow a trick from the TLS world and avoid parametrization by fully specifying the parameters in the “format” itself. For example, instead of a STORE_ZSTD or similar flag indicating that the member is zstd compressed, we could have STORE_ZSTD_PYTHON_PROFILE_YYYY_MM_DD where the “Python profile” is defined as matching the options you’ve noted.
(This wouldn’t change the actual value, it just makes it harder for an implementer of the format to ignore the requirements.)
I think there’s some benefit to supporting deflate in terms of supporting the current matrix of supported Python versions, but that’s not a hill I’ll die on. I agree fully about not supporting xz (lzma, bz2, etc.) though, and I also agree with @jamestwebber about being able to provision other compression schemes in the future.
I wouldn’t pick protobufs in this specific case, though, since the Python support story for them is not great. I also don’t think we need all of the weird nullability/option/default semantics that protobufs have evolved over the years. ↩︎
Wheel adoption took a long time, but I think that’s more attributable to larger modernization problems in Python packaging at the time than to index provider resistance. ↩︎
opinions about requiring one of the normal forms from his work
From the security side of things, I don’t see an inherent issue with allowing un-normalized unicode. I don’t think you could require normalization in the file format unless you also specify the version of unicode it was normalized with, or ban unassigned codepoints. Neither of those seems great, and my draft explicitly says “No particular normalization form is required”
the archive format should probably reject things like control codes in path components, but actual reserved names are a matter of OS/FS semantics that would probably be an exercise in frustration to fully capture
Yes, but to be clear these are not about making it so you can use the names unescaped in a string command line, or ensure that there are never filesystem collisions or lookalikes. They are about moving the error reporting to when the archive is made because that’s being done by the person that ought to do something differently.
I think the core set of DOS device names dates back to at least DOS 3.3 from 40 years ago, and is enumerable with a simple regex. I’d be in favor of preventing their use, not for security [we get that with CreateFile flags], but because I think it’s a better user experience to do so.
A few posts up I mentioned the format vs our use – I’m perfectly fine with our use being more restrictive than what’s allowed in the archive itself, as long the archive is specific about what you do if you find it, there are tests, etc. My vote is to allow device names in the archive format, but disallow them at the user-friendly layer (with a bypass allowed).