Policies for tarfile.extractall, a.k.a. fixing CVE-2007-4559

Q: Can you describe the shape of changes you are intending to make in security fix backports to older versions of Python?

This post seems to focus on 3.12+ (good in the PEP-like sense), but we should also consider what shape it’ll take in older versions so that it is usable by the world and is not just RedHat-python specific.

A general theme for security fix backports to existing releases: No change in API surfaces: ie, you can’t add a policy= parameter in a backport. Because nobody will be able to write code that works on both an unpatched and patched runtime stdlib (without difficult to maintain costly API introspection hacks).

A workaround to this is to provide an alternative mechanism for changing the behavior that is a no-op on older unpatched versions. Such as setting a magic otherwise unused by existing code and unlikely to be used by any subclasses attribute. t = TarFile(...); t._policy = ... for example. This API should be documented, but limited to legacy purposes and its use discouraged for anyone targeting only 3.12+. With the attribute documented as write only (ex: implemented a property who’s getter raises AttributeError).

If you are allowed to describe why RedHat seems to be treating this as a high priority issue to be resolved after decades, it’d be interesting to know that. But feel free to leave that story for later to avoid derailing the work with a digression. (not a complaint, I’m glad to see someone actually care!)

Thank you for the comments!
Looks like the next step is to turn this into a PEP.

Good point. I’m not sure if tar-specific options are OK. Presumably you’re using unpack_archive because you don’t care that much about the format.
Or you’re using unpack_archive because the API is more ergonomic. Maybe tarfile could get a top-level extract function that takes tar-specific options?

OK, I’ll switch to filter. After all, this it’s about cleaning up the archive in general, rather than just security.

Yup, functions will be allowed. And I’ll remove TarInfo.apply_policy, which just calls the given function. (An earlier draft needed policies do a bit more that what a filter function can, but it turns out a member-wise filter is enough if we don’t provide “dry run” functionality.)
I’ll keep the string names though: they’re much nicer to use (I sigh every time I write subprocess.PIPE), and better generalize to other formats.
And no, this is not about site-specific configuration – that could simply keep a dict or do getattr(tarfile, name).

I don’t see much harm in keeping it forever. Getting DeprecationWarning about something that gives you DeprecationWarning sounds silly.
But saying that, I realize it would make sense to hide it more. Perhaps with tarfile.legacy_deprecation_policy, without a string 'legacy_deprecation' alias.

Good idea! I won’t have time to work on that myself, unfortunately.
I think a good first step would be exposing the same interface from “member info” objects, possibly os.stat_result whose st_ prefix won’t collide with pre-existing attributes. But there’s a bunch of complexity down this road: for example, mode could use some way to model tri-state logic (set/clear/honor) for each bit, numeric/named users/groups are a pain … and I expect Windows-specific features are just as “fun” as the tar/UNIX ones.

A simpler way would be to use common policy filter names, that should take care of the simple cases.

Good point, thanks!
Even better: let’s use a proper public TarFile.extraction_filter attribute, instead of tarfile.DEFAULT_POLICY.

Umm, there’s a customer request. I’m trying to keep myself on the engineering side of things, so I don’t know too much more. If I did I probably couldn’t tell, customer confidentiality is a big thing.
Speaking of this, let me clarify a few things:

  • Red Hat can and will use vendor patches for anything customers want but isn’t appropriate for CPython. I only propose the changes that (IMHO) will benefit the wider community, and aren’t system-specific.
  • The RHEL integration part will be public when released. It’ll probably look like what we did for CVE -2021-23336 (patch).
  • I’d be happy to draw the CPython/vendor boundary somewhere else.

This raises a question that I had when I read the proposal. Tarfiles do get created on Windows, and they are somewhat weird, in that they try to make Windows-specific details conform to the (historically Unix-based) tar format. I don’t know if that impacts security considerations (although I imagine that anything which works based on user IDs would be confused as heck by a UID fabricated by a Windows tar creation tool - the tarfile module creates everything with UID 0, for example) but maybe it should be clarified?

My guess is that when extracting on Windows, the “data” policy (filter - but in this context “policy” reads better IMO) is the only one that really makes sense. When extracting tarfiles created on Windows, on Unix systems, there might be more issues to consider - but again “use the data policy unless you know the file comes from a source that is compatible with yours” might be sufficient.

I agree. 'data' is good for both cross-platform (lowest common denominator) archives, and cross-platform destination filesystems. An archive models a filesystem, after all.

My proposal is for skipping extraction of some pieces of metadata, and for modifying the input data. I’m staying far away from the actual mechanics, mostly because I can’t really design and test the right behavior on Windows.

As for creating tar files from a Windows filesystem, that’s entirely out of scope here. Extraction filters are meant to deal with all kinds of weird tarballs.

But yes, it would be good to clarify that the tar filter/policy will preserve the UNIX-specific features. Perhaps unix would even be a better name for it?


This might be reasonably implemented by passing an exists callable into the filter. Then a dry-run operation can shim that while a real extract can just provide the real function.

But if it’s not really essential, keep things simple.

Nope, you’d also need to track an evolving zoo of symlinks, see _check_member or _get_effective_name in Lars’s patch.
I have great respect for Lars; I don’t think dry-run code can be much simpler.
Myself, I wouldn’t want to maintain code like that, claim that it’s safe, or design ways for users to hook into it.

1 Like

I think we should either make an exception to the general theme, or not backport at all.
Assuming the attribute is extraction_filter (but it would work the same if it was hidden under _policy as you suggest), and that the function that implements a “safe” policy will be called tarfile.data_filter():

  • my_tarfile.extraction_filter = getattr(tarfile, 'data_filter', lambda x: x) says “be secure if possible, silently allow everything if not”, and it’s verbose enough to imply the author really means that.
  • my_tarfile.extraction_filter = (lambda x: x) says “trust this archive” and works everywhere
  • There’s no good way to “always be secure”, short of using a third-party backport from PyPI or writing one yourself.
  • my_tarfile.extraction_filter = 'data' would make older/unpatched versions of Python ignore an explicit security-related request, which is terrible. So, whatever else I’ll do, the attribute will not accept the string shortcuts.
  • my_tarfile.extraction_filter = tarfile.data_filter would only work on patched versions of Python. That just shows that exposing the filter functions is a change in API surface.
    • If we allow that, why not allow extractall(filter=tarfile.data_filter) too?
    • If we don’t expose the function, there’s not much left from the feature to backport. (Maybe the warning by default, but we don’t add new warnings in patch releases.)

It woud be nice to allow third-party code to write:

if hasattr(tarfile, 'data_filter'):  # (or some other feature check)
    # new way of doing things
    # remove this after a deprecation period
    warn('Extracting may be unsafe, consider updating Python')

rather than

if hasattr(tarfile, 'data_filter'):
    my_tarfile.extraction_filter = 'data'
    # XXX: reset the filter? Do we need to be thread-safe here?
    # remove this after a deprecation period
    warn('Extracting may be unsafe, consider updating Python')

Nice analysis. These examples of how to write a secure when possible / warn when not across all versions code snippet are an ideal type of thing to add to a tarfile documentation update that goes along with the security fix.

Agreed that a magic write only attribute api for the backports doesn’t make sense in this case given the hasattr test and desire to encourage people’s code to log when something may be hazardous.

1 Like

The behavior should possibly be as close to zipfile as possible to prevent confusion and make them as interchangeable as possible for shutil. Am am mostly uncertain about the chosen behavior for user and group ownership… zipfile says that it tries to preserve as much information as possible except for stripping some paths like .. and absolut paths. IMO the data policy feels to extreme as a default and I would like a mix between tar and data which might ignore pipes and other special file-like entries but preserves ownership of possible.

IMO, data is a pretty good approximation of zipfile behavior, simply because zipfile doesn’t support *nix-specific features. (At least when extracting. For example, as far as I can see from the source, permissions are stashed in the archive, as external_attributes, but ignored when unpacking.)
data does honor the executable permission bit, unlike zipfile. That part is inspired by Git, FWIW.

As for ownership info, note that tarfile itself does not set it unless you are the *nix superuser.
It’s quite specialized piece of metadata. As far as I know, zipfile doesn’t support it at all.

1 Like

The TarFile initialiser accepts a tarinfo argument, which sets the type of tarinfo objects (it defaults to the TarInfo class).

Q: Could we implement the policies described above in subclasses of TarInfo? It might be easier to retrofit. E.g.:

import tarfile
tarball = tarfile.TarFile.open(..., tarinfo=tarfile.FullyTrustedTarInfo)

Thanks for the consolidation of ideas. A few small suggestions:

Please clarify in the documentation when the policy object is called. Do I assume right that during a call of extractall() the policy is called exactly once per tarfile member and each call is made in the order the members appear in the tarfile? This is important for users who want to implement stateful policies such as TrustPreExistingSymlinksPolicy(). (Users would create a new policy object at each use for stateful policies as they cannot be shared across different archives. One could add a reset() function to tell the policy that the next call is for a new archive but this would not be enough when working with multiple tar files concurrently.)

BTW: I think I see a way to implement TrustPreExistingSymlinksPolicy without having to store the tar file’s symlinks in memory: (1) Read the pre-existing symlinks into memory. (2) Delete the symlinks. (3) Extract the tarfile but change the name attribute of the TarInfo objects according to the memorised symlinks and check the path as if the target path was the destination of the symlink. (4) Restore the memorised symlinks. This way, memory exhaustion cannot occur the first time the attacker gets the user to extract a malicious archive but will instead occur on the next extraction to the same target path, unless either the filesystem or the tarfile module stopped the creation of an excessive number of symlinks.

This leads to another suggestion for additional checks users may want to perform: Limit the number of symlinks created. The documentation should mention that checks can be implemented by sub-classing one of the exsiting policy classes. Counting the number of symlinks, folders and files created and the number of bytes written to files are also good examples of state a policy object may want to keep.

Re unification with zip, TarInfo and policies would have to be renamed and placed into a separate module that both tarfile and zip import. It can also make sense to have a global policy for filesystem access, e.g. with open() and os.link(), to restrict newly created files to a target folder or to apply other constraints such as on the number of files and bytes written. However, one could say that such protection should come from a chroot environment and filesystem quota. This though could also be applied to the initial tarfile issue: Instead of inspecting member.name ourselves, we could delegate the tarfile extraction to a new process inside a chroot environment and let the VFS handle protection of everything outside the target path.

1 Like

An issue is that TarInfo is used for everything – adding files, introspecting them, and extraction. The filters here only make sense for extraction. If I didn’t want to change the default, implementing this for TarInfo could work, but then users would need to carefully select the right TarInfo for what they want to do with the archive.

Thanks for mentioning stateful policies.
Your assumption is right. Using a new policy for each extraction sounds best. (I wouldn’t bother with reset() – throwing the policy away and creating a new one is the best kind of reset. If there’s anyhing to keep between runs, whatever you create the policy from – factory, parent, module, etc. – could keep it.)

TrustPreExistingSymlinksPolicy and deeper integration into the system (chroots/quotas) are exactly the kinds of things I’d like users to create. I hope the proposed API is enough, but I’ll probably need to draft an example to be sure.

Like, one thing that became face-palmingly obvious once I started an implementation is that filter needs a destination path argument to do the checks it needs. I’d like to avoid that kind of mistake for stuff like TrustPreExistingSymlinksPolicy.


I think this a case that compositions is better the inheritance.

This is now PEP 706, discussed here.

1 Like

Post-acceptance changes!

Eagle-eyed @hroncok reviewed the PEP and implementation, and found ambiguity in error handling. I intend to add the following summary & resolution to the PEP (with SC approval):

Errorlevel, and fatal/non-fatal errors

Currently, TarFile has an errorlevel argument/attribute, which specifies how errors are handled:

  • With errorlevel=0, documentation says that “all errors are ignored when using extract and extractall”. The code only ignores non-fatal and fatal errors (see below), so, for example, you still get TypeError if you pass None as the destination path.

  • With errorlevel=1 (the default), all non-fatal errors are ignored. (They may be logged to sys.stderr by setting the degug argument/attribute.) Which errors are non-fatal is not defined in documentation, but code treats ExtractionError as such. Specifically, it’s these issues:

    • “unable to resolve link inside archive” (raised on systems that do not support symlinks)
    • “fifo/special devices not supported by system” (not used for failures if the system supports these, e.g. for a PermissionError)
    • “could not change owner/mode/modification time”

    Note that, for example, file name too long or out of disk space don’t qualify. The non-fatal errors are not very likely to appear on a Unix-like system.

  • With errorlevel=2, all errors are raised, including fatal ones. Which errors are fatal is, again, not defined; in practice it’s OSError.

A filter refusing to extract a member does not fit neatly into the fatal/non-fatal categories.

  • This PEP does not change existing behavior. (Ideas for improvements are welcome in Discourse topic 25970.)
  • When a filter refuses to extract a member, the error should not pass silently by default.

To satisfy this, FilterError will be considered a fatal error, that is, it’ll be ignored only with errorlevel=0.

Users that want to ignore FilterError but not other fatal errors should create a custom filter function, and call another filter in a try block.

FWIW, I think errorlevel didn’t age well. If I designed it today, it would look like this: ExceptionGroup for TarFile.extractall

Additionally, the PEP currently says the data filter will:

  • Remove the group & other read permission (S_IRGRP|S_IROTH) if the owner doesn’t have it (stat.S_IRUSR).

This never happens, since the data filter sets read permissions for the owner. I found this in the implementation but forgot to update the draft PEP, and then copied the PEP to the docs.
I’ll remove the point. Thanks @hroncok again for finding it.


to add the following summary & resolution to the PEP (with SC approval):

Go ahead. No objection from the SC.

Thank you! That was fast :‍)

The 3.12 implementation is merged. I’ll start on the 3.11 backport, and update the PEP after its docs are rebuilt, so I can link them in the canonical-doc admonition.