PEP 706 – Filter for tarfile.extractall

Hello,
As a way to properly fix CVE-2007-4559 , I propose adding filter argument to tarfile extraction methods, and, after a deprecation period, limiting tarfile’s more dangerous features unless the user specifically asks for them.

I’m not a tarfile expert (certainly I wasn’t one two months ago), so I’m progressing carefully, and I wrote a PEP to help make sure I’m not forgetting something: PEP 706 – Filter for tarfile.extractall | peps.python.org

The draft implementation should work now, but lacks docs and the tests only work on my machine.

The previous discussion was positive in general.
I’m not too happy with the interaction with shutil.unpack_archive and thus zip – basically, with my proposal you’ll either get DeprecationWarning or need to detect and special-case tar archives. I don’t see a good way to solve this without adding filters there, which would need a thorough understanding of zipfile, which would probably mess up my plans for 3.12 :‍(

Major changes since the previous discussion:

  • The default is no longer a filter function, but a special default value. The default is to warns once per extraction, not per member. (Turns out that many members are changed in some way, e.g. mode is ignored for all directories, so warning on changed members only isn’t very useful.)
  • Mention that code that relies on TarInfo object identity may break – IMO it’s a small problem concern in practice, but worth noting
  • Adjusted path filter rules to better work across platforms: I based them on *nix-based tar that e.g. assumes absolute paths start with /, which isn’t the case on Windows.
  • The default can be set via a class attribute (and as a function, so monkeypatchers will probably need staticmethod)
  • Filter functions take the destination path, so they don’t match the addfile filter API
  • The filter that keeps benign unix/*nix-/posix-specific details is still named tar, after too much internal bikeshedding

How does this look to you?

6 Likes

Can we justify making unpack_archive use the more secure default straight away?

Or alternatively, could there be a “warn on blocked member” option? Primarily for cases where tools are extracting on behalf of a user rather than some internal operation, which I would guess make up a significant number of shutil uses. I’d rather add a logger argument to unpack_archive (mirroring make_archive) than a filter argument.

Maybe. I don’t feel qualified to make the call, and don’t know who would be.
I guess there are two reasons to use unpack_archive. Either you don’t know much about the archive (not even the format), in which case you absolutely want the secure, no-surprises option. But it’s also the easier-to-use way to extract a known and trusted archive, compared to tarfile.open(tarpath).extractall(destpath) – and then it would just break compatibility.

I guess that would mean passing errorlevel=0.
FWIW, tarfile alwas logs to stderr, and it’s fairly quiet about skipping un-extractable files. It predates both PEP 282 and PEP 20’s note on silent errors :‍).

For me (Windows, Firefox), that link returns “Page not found”.

EDIT: 4 days later, link works

Weird, works for me. Please search for CVE-2007-4559 instead, that should get you the info.

FWIW, I don’t think the proposed unpack_archive behavior is terrible. You get a DeprecationWarning for two releases, telling you that you should look at the code – which you should, especially if you’re someone paying attention to warnings, since it’s likely to be insecure. You can then do what’s necessary, in 2-3 lines, e.g.:

  • add filter (if you don’t care about unpatched/old CPython and non-tar archives)
  • filter the warning out (if your archives are compatible with the new default, and not dangerous with the old one)
    • or just ignore the warning (perfect for a quick personal script)
  • add filter conditionally depending on archive type (and filter support, if you support unpatched/old CPython)

The more I look at zipfile, the more I think I don’t want to add named filters to it now.
Here’s a summary, which I’ve put in the PEP as a “Possible Further Work” section (i.e. let’s discuss, but IMO the PEP is good enough as it is):

For consistency, zipfile and shutil.unpack_archive() could gain support for a filter argument. However, this would require research that this PEP’s author can’t promise for Python 3.12.

Filters for zipfile would probably not help security. Zip is used primarily for cross-platform data bundles, and correspondingly, ZipFile.extract’s defaults are already similar to what a 'data' filter would do. A 'fully_trusted' filter, which would newly allow absolute paths and .. path components, might not be useful for much except a unified unpack_archive API.

Filters should be useful for use cases other than security, but those would usually need custom filter functions, and those would need API that works with both TarInfo and ZipInfo. That is definitely out of scope of this PEP.

If only this PEP is implemented and nothing changes for zipfile, the effect for callers of unpack_archive is that the default for tar files is changing from 'fully_trusted' to the more appropriate 'data'. In the interim period, Python 3.12-3.13 will emit DeprecationWarning. That’s annoying, but there are several ways to handle it: e.g. add a filter argument conditionally, set TarFile.extraction_filter globally, or ignore/suppress the warning until Python 3.14.

Also, since many calls to unpack_archive are likely to be unsafe, there’s hope that the DeprecationWarning will often turn out to be a helpful hint to review affected code,


I’ve added another note to stateful fiters, i.e.:

with StatefulFilter() as filter_func:
    my_tar.extract(path, filter=filter_func)

saying:

The need for stateful filters is a reason against allowing registration of custom filter names in addition to 'fully_trusted', 'tar' and 'data'.
With such a mechanism, API for (at least) set-up and tear-down would need to be set in stone.


I’ve removed the open question How far should this be backported?, which will need an answer but doesn’t need to be part of the proposal. The PEP does specify how to backport.


Those changes don’t touch the “meat” of the PEP, and it looks like further discussion won’t affect acceptance, so when the changes are live I’ll submit to the SC.

3 Likes

Questions/comments that arise as I read the PEP:

1: If an exception is raised by the filter to abort extraction, is cleanup performed or is that left to the caller? I’m assuming the easy implementation of leaving that to the caller. Explicitly specify that. You partially touch on this concept later in the FilterError section. (cleanup code brings its own potential security issues, so not claiming to implement that is a good thing IMNSHO)

2: Specify what happens when an unknown filter= string name is passed in. Presumably raise ValueError?

3a: on the specification of what a filter callable looks like, it is currently listed as

filter(/, member: TarInfo, path: str) -> TarInfo|None

Is there a reason these must be positional only parameters as indicated by the /, ? I assume your implementation intends to only ever call it using positional args as is natural. I wouldn’t specify it as a requirement.

3b:

I’d also add a note about the typeshed type signature for filter functions. For the typing community it is probably better to annotate the filter= argument as:

_ArchiveInfo = TypeVar("UnspecifiedArchiveTypeMemberInfo")
filter: Callable[[_ArchiveInfo, str], _ArchiveInfo|None]

Rather than specifically restricting to TarInfo in an annotation as we do expect universal filter functions to be written that use introspection and operate on other archive formats such as an equivalent feature in zipfile in the future. Particularly in the context of what’d be passed into the archive format agnostic shutil API.


Overall I like the pep, my comments are mostly nitpicks. I agree that getting this in for tarfile even before a zipfile equivalent is ready is fine. A zipfile variant can build on this and likely won’t be its own PEP as it’ll be mirroring the API concepts.

1 Like

The FilterError note has all there is to say. I’ll copy it up:

Note that extractall() may leave the archive partially extracted; it is the user’s responsibility to clean up.

It’s not a change from the status quo, though before this PEP, such errors are much more rare.

Sure, ValueError sounds good.

Yes, I intended to allow any callable with a compatible signature, like def filter(tarinfo, dest_path='.', extra_data=None):.
I’ll reword it as ”a callable that can be called as filter(/, member: TarInfo, path: str) -> TarInfo|None”.

I’d rather leave that to typeshed experts and/or to the eventual PR that extends this to ZipFile.
If I had fully thought the generalization through, I’d include all of it in the PEP, not just the type annotation.

1 Like

The Python Steering Council is happy to accept PEP 706 adding a filter to tarfile extraction methods. This was a very well written PEP that laid out a nice technical design.

personal hat: I can help review PRs if you’d like.

3 Likes

The feature is in now Python 3.8-3.12, either released or planned for the next security update.

Unfortunately, I failed to backport it to 3.7.

I use os.path.realpath to ensure files stay “in” the destination (see code here).
Unfortunately, Python 3.7’s os.path.realpath function does not resolve symbolic links on Windows (GH-54158). Teaching it to do so involved dozens of lines of platform-specific code and a fix-up commit, so it doesn’t seem like something I can backport to 3.7 (or worse, maintain separately in tarfile).
So, I don’t have a good way to prevent symlink attacks on Windows.

Limiting PEP-706 features to non-Windows platforms would be quite involved, I can’t realistically do it and test properly.

Backporting as-is would mean hasattr(tarfile, 'data_filter') would lie about data_filter protecting you from symlink attacks. Since I expect libraries to drop their home-grown security precautions if data_filter is available, this would be dangerous.

So, I cannot backport to 3.7 in good conscience.

Full disclosure: I will backport it to Python 3.6 for Red Hat, where Windows symlinks aren’t a concern. The 3.7 patch is available for any other redistributor.
I’m sorry I couldn’t do it for everyone.

2 Likes

The fact that symlinks don’t really work in 3.7 is a pretty good defence :wink: It requires admin privileges, as we didn’t support the “allow non-admins to create symlink” setting at that time.

Backporting with a platform-specific runtime warning seems fine to me, if you wanted to go that route. Or a platform-specific block on all symlinks. Chances are nobody using 3.7 has home-grown security precautions that are better than your implementation anyway, so they likely need to be warned.

Oh, I wouldn’t be that sure! Rejecting everything but regular files named [a-z]+ is pretty good security-wise, if it works for your use case. (Of course it’s an extreme case and the edge of a spectrum.)

IMO, making data_filter/tar_filter block all Windows symlinks on 3.7 would be too much of a behaviour difference between 3.7 and 3.8.
Note that when a symlink can’t be created (default on Windows), tarfile falls back to extracting a copy of the linked file (if it is in the archive). And we can’t really know whether symlink() will fail until we try. So, a block on all symlinks would block some benign archives for non-admin Windows users.

1 Like

That’s the default configuration, but an administrator can grant the symlink privilege to any user or group with just a few mouse clicks in the secpol snapin.

1 Like