Policies for tarfile.extractall, a.k.a. fixing CVE-2007-4559

Thanks for the consolidation of ideas. A few small suggestions:

Please clarify in the documentation when the policy object is called. Do I assume right that during a call of extractall() the policy is called exactly once per tarfile member and each call is made in the order the members appear in the tarfile? This is important for users who want to implement stateful policies such as TrustPreExistingSymlinksPolicy(). (Users would create a new policy object at each use for stateful policies as they cannot be shared across different archives. One could add a reset() function to tell the policy that the next call is for a new archive but this would not be enough when working with multiple tar files concurrently.)

BTW: I think I see a way to implement TrustPreExistingSymlinksPolicy without having to store the tar file’s symlinks in memory: (1) Read the pre-existing symlinks into memory. (2) Delete the symlinks. (3) Extract the tarfile but change the name attribute of the TarInfo objects according to the memorised symlinks and check the path as if the target path was the destination of the symlink. (4) Restore the memorised symlinks. This way, memory exhaustion cannot occur the first time the attacker gets the user to extract a malicious archive but will instead occur on the next extraction to the same target path, unless either the filesystem or the tarfile module stopped the creation of an excessive number of symlinks.

This leads to another suggestion for additional checks users may want to perform: Limit the number of symlinks created. The documentation should mention that checks can be implemented by sub-classing one of the exsiting policy classes. Counting the number of symlinks, folders and files created and the number of bytes written to files are also good examples of state a policy object may want to keep.

Re unification with zip, TarInfo and policies would have to be renamed and placed into a separate module that both tarfile and zip import. It can also make sense to have a global policy for filesystem access, e.g. with open() and os.link(), to restrict newly created files to a target folder or to apply other constraints such as on the number of files and bytes written. However, one could say that such protection should come from a chroot environment and filesystem quota. This though could also be applied to the initial tarfile issue: Instead of inspecting member.name ourselves, we could delegate the tarfile extraction to a new process inside a chroot environment and let the VFS handle protection of everything outside the target path.

1 Like

An issue is that TarInfo is used for everything – adding files, introspecting them, and extraction. The filters here only make sense for extraction. If I didn’t want to change the default, implementing this for TarInfo could work, but then users would need to carefully select the right TarInfo for what they want to do with the archive.

Thanks for mentioning stateful policies.
Your assumption is right. Using a new policy for each extraction sounds best. (I wouldn’t bother with reset() – throwing the policy away and creating a new one is the best kind of reset. If there’s anyhing to keep between runs, whatever you create the policy from – factory, parent, module, etc. – could keep it.)

TrustPreExistingSymlinksPolicy and deeper integration into the system (chroots/quotas) are exactly the kinds of things I’d like users to create. I hope the proposed API is enough, but I’ll probably need to draft an example to be sure.

Like, one thing that became face-palmingly obvious once I started an implementation is that filter needs a destination path argument to do the checks it needs. I’d like to avoid that kind of mistake for stuff like TrustPreExistingSymlinksPolicy.


I think this a case that compositions is better the inheritance.

This is now PEP 706, discussed here.

1 Like

Post-acceptance changes!

Eagle-eyed @hroncok reviewed the PEP and implementation, and found ambiguity in error handling. I intend to add the following summary & resolution to the PEP (with SC approval):

Errorlevel, and fatal/non-fatal errors

Currently, TarFile has an errorlevel argument/attribute, which specifies how errors are handled:

  • With errorlevel=0, documentation says that “all errors are ignored when using extract and extractall”. The code only ignores non-fatal and fatal errors (see below), so, for example, you still get TypeError if you pass None as the destination path.

  • With errorlevel=1 (the default), all non-fatal errors are ignored. (They may be logged to sys.stderr by setting the degug argument/attribute.) Which errors are non-fatal is not defined in documentation, but code treats ExtractionError as such. Specifically, it’s these issues:

    • “unable to resolve link inside archive” (raised on systems that do not support symlinks)
    • “fifo/special devices not supported by system” (not used for failures if the system supports these, e.g. for a PermissionError)
    • “could not change owner/mode/modification time”

    Note that, for example, file name too long or out of disk space don’t qualify. The non-fatal errors are not very likely to appear on a Unix-like system.

  • With errorlevel=2, all errors are raised, including fatal ones. Which errors are fatal is, again, not defined; in practice it’s OSError.

A filter refusing to extract a member does not fit neatly into the fatal/non-fatal categories.

  • This PEP does not change existing behavior. (Ideas for improvements are welcome in Discourse topic 25970.)
  • When a filter refuses to extract a member, the error should not pass silently by default.

To satisfy this, FilterError will be considered a fatal error, that is, it’ll be ignored only with errorlevel=0.

Users that want to ignore FilterError but not other fatal errors should create a custom filter function, and call another filter in a try block.

FWIW, I think errorlevel didn’t age well. If I designed it today, it would look like this: ExceptionGroup for TarFile.extractall

Additionally, the PEP currently says the data filter will:

  • Remove the group & other read permission (S_IRGRP|S_IROTH) if the owner doesn’t have it (stat.S_IRUSR).

This never happens, since the data filter sets read permissions for the owner. I found this in the implementation but forgot to update the draft PEP, and then copied the PEP to the docs.
I’ll remove the point. Thanks @hroncok again for finding it.


to add the following summary & resolution to the PEP (with SC approval):

Go ahead. No objection from the SC.

Thank you! That was fast :‍)

The 3.12 implementation is merged. I’ll start on the 3.11 backport, and update the PEP after its docs are rebuilt, so I can link them in the canonical-doc admonition.