As a way to properly fix CVE-2007-4559 , I propose adding filter argument to tarfile extraction methods, and, after a deprecation period, limiting tarfile’s more dangerous features unless the user specifically asks for them.
The previous discussion was positive in general.
I’m not too happy with the interaction with shutil.unpack_archive and thus zip – basically, with my proposal you’ll either get DeprecationWarning or need to detect and special-case tar archives. I don’t see a good way to solve this without adding filters there, which would need a thorough understanding of zipfile, which would probably mess up my plans for 3.12 :(
Major changes since the previous discussion:
The default is no longer a filter function, but a special default value. The default is to warns once per extraction, not per member. (Turns out that many members are changed in some way, e.g. mode is ignored for all directories, so warning on changed members only isn’t very useful.)
Mention that code that relies on TarInfo object identity may break – IMO it’s a small problem concern in practice, but worth noting
Adjusted path filter rules to better work across platforms: I based them on *nix-based tar that e.g. assumes absolute paths start with /, which isn’t the case on Windows.
The default can be set via a class attribute (and as a function, so monkeypatchers will probably need staticmethod)
Filter functions take the destination path, so they don’t match the addfile filter API
The filter that keeps benign unix/*nix-/posix-specific details is still named tar, after too much internal bikeshedding
Can we justify making unpack_archive use the more secure default straight away?
Or alternatively, could there be a “warn on blocked member” option? Primarily for cases where tools are extracting on behalf of a user rather than some internal operation, which I would guess make up a significant number of shutil uses. I’d rather add a logger argument to unpack_archive (mirroring make_archive) than a filter argument.
Maybe. I don’t feel qualified to make the call, and don’t know who would be.
I guess there are two reasons to use unpack_archive. Either you don’t know much about the archive (not even the format), in which case you absolutely want the secure, no-surprises option. But it’s also the easier-to-use way to extract a known and trusted archive, compared to tarfile.open(tarpath).extractall(destpath) – and then it would just break compatibility.
I guess that would mean passing errorlevel=0.
FWIW, tarfile alwas logs to stderr, and it’s fairly quiet about skipping un-extractable files. It predates both PEP 282 and PEP 20’s note on silent errors :).
FWIW, I don’t think the proposed unpack_archive behavior is terrible. You get a DeprecationWarning for two releases, telling you that you should look at the code – which you should, especially if you’re someone paying attention to warnings, since it’s likely to be insecure. You can then do what’s necessary, in 2-3 lines, e.g.:
add filter (if you don’t care about unpatched/old CPython and non-tar archives)
filter the warning out (if your archives are compatible with the new default, and not dangerous with the old one)
or just ignore the warning (perfect for a quick personal script)
add filter conditionally depending on archive type (and filter support, if you support unpatched/old CPython)
The more I look at zipfile, the more I think I don’t want to add named filters to it now.
Here’s a summary, which I’ve put in the PEP as a “Possible Further Work” section (i.e. let’s discuss, but IMO the PEP is good enough as it is):
For consistency, zipfile and shutil.unpack_archive() could gain support for a filter argument. However, this would require research that this PEP’s author can’t promise for Python 3.12.
Filters for zipfile would probably not help security. Zip is used primarily for cross-platform data bundles, and correspondingly, ZipFile.extract’s defaults are already similar to what a 'data' filter would do. A 'fully_trusted' filter, which would newly allow absolute paths and .. path components, might not be useful for much except a unified unpack_archive API.
Filters should be useful for use cases other than security, but those would usually need custom filter functions, and those would need API that works with both TarInfo and ZipInfo. That is definitely out of scope of this PEP.
If only this PEP is implemented and nothing changes for zipfile, the effect for callers of unpack_archive is that the default for tar files is changing from 'fully_trusted' to the more appropriate 'data'. In the interim period, Python 3.12-3.13 will emit DeprecationWarning. That’s annoying, but there are several ways to handle it: e.g. add a filter argument conditionally, set TarFile.extraction_filter globally, or ignore/suppress the warning until Python 3.14.
Also, since many calls to unpack_archive are likely to be unsafe, there’s hope that the DeprecationWarning will often turn out to be a helpful hint to review affected code,
I’ve added another note to stateful fiters, i.e.:
with StatefulFilter() as filter_func:
The need for stateful filters is a reason against allowing registration of custom filter names in addition to 'fully_trusted', 'tar' and 'data'.
With such a mechanism, API for (at least) set-up and tear-down would need to be set in stone.
I’ve removed the open question How far should this be backported?, which will need an answer but doesn’t need to be part of the proposal. The PEP does specify how to backport.
Those changes don’t touch the “meat” of the PEP, and it looks like further discussion won’t affect acceptance, so when the changes are live I’ll submit to the SC.
1: If an exception is raised by the filter to abort extraction, is cleanup performed or is that left to the caller? I’m assuming the easy implementation of leaving that to the caller. Explicitly specify that. You partially touch on this concept later in the FilterError section. (cleanup code brings its own potential security issues, so not claiming to implement that is a good thing IMNSHO)
2: Specify what happens when an unknown filter= string name is passed in. Presumably raise ValueError?
3a: on the specification of what a filter callable looks like, it is currently listed as
Is there a reason these must be positional only parameters as indicated by the /, ? I assume your implementation intends to only ever call it using positional args as is natural. I wouldn’t specify it as a requirement.
I’d also add a note about the typeshed type signature for filter functions. For the typing community it is probably better to annotate the filter= argument as:
Rather than specifically restricting to TarInfo in an annotation as we do expect universal filter functions to be written that use introspection and operate on other archive formats such as an equivalent feature in zipfile in the future. Particularly in the context of what’d be passed into the archive format agnostic shutil API.
Overall I like the pep, my comments are mostly nitpicks. I agree that getting this in for tarfile even before a zipfile equivalent is ready is fine. A zipfile variant can build on this and likely won’t be its own PEP as it’ll be mirroring the API concepts.
The FilterError note has all there is to say. I’ll copy it up:
Note that extractall() may leave the archive partially extracted; it is the user’s responsibility to clean up.
It’s not a change from the status quo, though before this PEP, such errors are much more rare.
Sure, ValueError sounds good.
Yes, I intended to allow any callable with a compatible signature, like def filter(tarinfo, dest_path='.', extra_data=None):.
I’ll reword it as ”a callable that can be called as filter(/, member: TarInfo, path: str) -> TarInfo|None”.
I’d rather leave that to typeshed experts and/or to the eventual PR that extends this to ZipFile.
If I had fully thought the generalization through, I’d include all of it in the PEP, not just the type annotation.