Policies for tarfile.extractall, a.k.a. fixing CVE-2007-4559

Hello,
You might have heard of CVE-2007-4559, the tarfile directory traversal vulnerability. Discussion about it has been going on and off since 2007, but no conclusion was reached.

tarfile’s previous maintainer, Lars Gustäbel, wrote a post on the issue before stepping down. I agree with the conclusion:

I understand the community’s desire to fix this issue no matter how. Unfortunately, fixing these kinds of issues is not trivial and requires a lot of effort.

Anyway, I was asked to fix this on our systems. I need to fix the issue, rather soon. And I’d rather fix it in CPython, and fix it well, so everyone can benefit.
I spent the last few days (literally) reading up on this, and here’s what I came up with. It’s long (sorry) and roughly PEP-shaped – probably appropriate after 15 years of discussions.

And it starts from a different angle than fixing a CVE.

Motivation

The tar format is user for several use cases, many of which have different needs. For example:

  • A backup of a UNIX workstation should faithfully preserve all kinds of details like file permissions, symlinks to system configuration, and various kinds of special files.
  • When unpacking a data bundle, it’s much more important that the unpacking will not have unintended consequences – like exposing password file by symlinking it to a public place.

To support all the use cases, tar has very many features. But in many cases, it’s best to ignore or disallow some of them when extracting an archive.

Python allows extracting tar archives using tarfile.TarFile.extractall(), whose docs warn to never extract archives from untrusted sources without prior inspection. However, it’s not clear what kind inspection should be done. Indeed, it’s quite tricky to do such an inspection correctly.
And so many people don’t bother, or do the check incorrectly, resulting in security issues such as CVE-2007-4559.

Since tarfile was first written, it’s become more accepted that warnings in documentation are not enough. Whenever possible, an unsafe operation should be explicitly requested; potentially dangerous operations should look dangerous. But, TarFile.extractall looks benign in a code review.

Tarfile extraction is also exposed via shutil.unpack_archive, which allows the user to not care about the kind of archive they’re dealing with. The API is very inviting to extracting archives without prior inspection, even though the docs again warn against it.

It could be argued that Python is not wrong – it behaves exactly as documented – but that’s beside the point. Let’s improve the situation rather than assign/avoid blame. Python and its docs are the best place to improve things.

Rationale

How do we improve things? Unfortunately, we will need to change the defaults, which implies breaking backwards compatibility. TarFile.extractall() is what people reach for when they need to extract a tarball. Its default behaviour needs to change.

What would be the best behaviour? That depends on the use case. So, I propose adding several general “policies” to control extraction. They are based on use cases, and ideally they should have straightforward security implications:

  • Current behavior: trusting the archive. Suitable e.g. as a building block for libraries that do the check themselves, or extracting an archive you just made yourself.
  • Unpacking a UNIX archive: roughly following GNU tar (e.g. stripping leading / from filenames).
  • Unpacking a general data archive: the shutil.unpack_archive use case.

Eventually, after a deprecation period, the last option – the most limited but most secure one – would become the default, and the others would be easy to specify.

Even with better general defaults, users should still verify the archives they extract, and perhaps modify some of the metadata. Superficially, the following looks like a reasonable way to do this today:

  • Call TarFile.getmembers()
  • Verify or modify each member’s TarInfo
  • Pass the result to extractall’s members

However, there are some issues with this approach:

  • TarInfo does not officially allow modification. You can modify it, but you can get surprising results since TarInfo objects are cached/shared.
  • Calling getmembers can be expensive and it requires a seekable archive.
  • When verifying members in advance, it may be necessary to track how each member would have changed the disk, e.g. how symlinks are being set up. This is hard. We can’t expect users to do it.

To solve these issues we’ll:

  • Provide a supported way to modify TarInfo when needed, without unnecessary work when not. A replace method, similar to what dataclass or namedtuple have, should do the trick.
  • Provide a way to hook into extractall’s loop to fitler/modify members as they are processed.
  • Require that this hook is called just before extracting each member, so it can scan the current state of the disk. This will greatly simplify the implementation of policies (both in stdlib and user code), at the cost of not being able to do a precise “dry run”.

The default policies described above will be implemented using the API for user-defined policies, so they can be used as building blocks or examples.

Full disclosure & redistributor info

I work for Red Hat, a redistributor of Python with different security needs and support periods than CPython in general. We’re planning to carry vendor patches to:

  • Allow configuring the defaults system-wide, and
  • Change the default as soon as possible, even in older Python versions.

The proposal should make this easy to do, and it allows users to query the settings.

Specification

Modifying and forgetting member metadata

The TarInfo class will gain a new method, replace(), which will work similarly to dataclasses.replace. It will return a deep copy of the TarInfo object with attributes replaced as specified by keyword arguments:

  • name
  • mtime
  • mode
  • linkname
  • uid
  • gid
  • uname
  • gname

Any of these, except name and linkname, will be allowed to be set to None.
When extract or extractall encounters such a None, it will not set that piece of metadata, leaving it as default (current user/group, mode based on umask, current time).
When addfile encounters such a None, it will raise an error. (It could also not store the attribute, if the format allows it, but that’s a possible future enhancement.)

The documentation will mention why the method is there: TarInfo objects retreived from TarFile.getmembers() are “live”, and modifiying them directly will affect subsequent unrelated operations.

Policies

TarFile.extract and TarFile.extractall methods will grow a policy parameter, which take a function with the signature:

policy(member: TarInfo) -> TarInfo|None

(XXX would filter be a better name than policy?)

When used it will be called on each member as it is extracted, and extraction will work with the result. On None the member will be skipped.

The function can also raise an exception. This can, depending on Tarfile.errorlevel, abort the extraction or cause the member to be skipped.

We will also provide a set of defaults for common use cases. In addition to a function, the policy argument will take one of the following strings: (XXX better names welcome)

  • 'fully_trusted': Current behavior: honor the metadata as is. Should be used if the user trusts the archive completely, or implements their own complex verification.

  • 'tar': Roughly follow defaults of the GNU tar command (when run as a normal user without --absolute-names):

    • Strip leading / from filenames
    • Refuse to extract files with a .. component in the filename
    • Refuse to extract files whose absolute path (after following symlinks) would end up outside the destination. (Note that GNU tar instead delays creating some links.)
    • Clear high mode bits (setuid, setgid, sticky) and group/other write bits (S_IWGRP|S_IWOTH). This is an approximation of tar’s default, which limits the mode by the current umask setting.
  • 'data': Extract a “data” archive, disallowing common attack vectors but limiting functionality. In addition to tar:

    • Refuse to extract links (hard or soft) which end up linking to a path outside of the destination.
    • Refuse to extract device files (incl. pipes)
    • For regular files and hard links:
      • Set the owner read and write permissions (S_IRUSR|S_IWUSR).
        (By now only the executable bits depend on information in the archive.)
      • Remove the group & other executable permission (S_IXGRP|S_IXOTH) if the user doesn’t have it (S_IXUSR).
    • For other files (directories), ignore mode entirely (set it to None).
    • Ignore user and group info (set uid, gid, uname, gname to None).
  • 'warn': Like 'fully-trusted', but emit a DeprecationWarning for each member that would be changed or removed under the 'data' policy.

The default will be 'warn' in Python 3.12-3.13, and change to 'data' in Python 3.14.

The corresponding functions will be available as tarfile.fully_trusted_policy(), tarfile.tar_policy(), etc, so they can be easily used in custom policies.

A new exception, PolicyError, will be added. It’ll have several new subclasses: one for each of the refusal reasons above. PolicyError’s member attribute will contain the relevant TarInfo.

In the lists above, “refusing" to extract a file means that a PolicyError will be raised. If the TarFile.errorlevel is 1 or more, this will abort the extraction. With errorlevel=0, the error will be logged and the member will be ignored, but extraction will continue.
Note that extractall() may leave the archive partially extracted; it is the user’s responsibility to clean up.

(The default errorlevel is currently 1, not 0 as the documentation says. The docs will be fixed.)

A new method, TarInfo.apply_policy(self, policy=...), will raise PolicyError or return a (possibly modified) TarInfo to be extracted.
The docs will warn that calling this for all members does not do a “dry run” extraction (previously extracted files might make changes on disk that affect later members, e.g. set up symlinks).

Introspection

The module will get new constants DEFAULT_POLICY (name of the default policy, i.e. 'warn' for now) and POLICIES (a dict of available policy values).
Users are encouraged to look for these, rather than the Python version, to check whether policy is supported.
Applications and system integrators may wish to change DEFAULT_POLICY to suit their requirements.

Hints for further verification

Even with the proposed changes, tarfile will not be suited for extracting untrusted file without prior inspection. Among other issues, the proposed policies don’t prevent denial-of-service attacks. Users should do additional checks.

New docs will tell users to consider:

  • extracting to a new empty directory,
  • checking filenames against an allow-list of characters (to filter out control characters, confusables, etc.),
  • checking that filenames have expected extensions (discouraging files that execute when you “click on them”, and extension-less files like Windows special device names),
  • limiting the total size of extracted data, size of individual files, and number of files,
  • checking for files that would be shadowed on case-insensitive filesystems.

Also, the docs will note that:

  • tar files commonly contain multiple versions of the same file: later ones are expected to overwrite earlier ones on extraction.
  • tarfile does not protect against issues with “live” data, e.g. an attacker tinkering with the destination directory while extraction (or adding) is going on (see GNU tar manual for more info)

This list is not comprehensive, but the documentation is a good place to collect such general tips. It can be moved into a separate document if grows too long or if it needs to be consolidated with zipfile or shutil (which is out of scope for this proposal).

Shutil

shutil.unpack_archive will use tarfile’s default policy: 'warn' for a deprecation period, then 'data'.

Rejected ideas

SafeTarFile

An initial idea from Lars Gustäbel was to provide a separate class that implements security checks (tarfile: Traversal attack vulnerability · Issue #65308 · python/cpython · GitHub). There are two major issues with this approach:

  • The name is misleading. General achive operations can never be made “safe” from all kinds of unwanted behavior, without impacting legitimate use cases.
  • It does not solve the problem of unsafe defaults.

However, many of the ideas behind SafeTarFile can be used.

Add absolute_path option to tarfile

A minimal change to check the “CVE resolved” box doesn’t go far enough to protect the unaware, nor to empower the dilligent and curious.

Thanks

This proposal is based on prior work and discussions by many people, in particular Lars Gustäbel, Larry Hastings, Joachim Wagner, Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, and many others.

17 Likes

Thanks for writing this up! I don’t use tarfile and don’t have experience with it, so I won’t have much to say on the details, but it’s great to see progress being made.

I assume it will get a new parameter to override the policy.

This certainly sounds like the best way to handle it (better than spamming the whole world with pull requests, at least).

I personally prefer filter over policy, though that would also imply that you can provide your own callable to evaluate each file (which I think you should be able to do). Perhaps filter=tarfile.warn_filter is a suitable approach? (That is, we provide a callable that implements the policy - even if we just treat it as a unique constant and implement the behaviour some other way.)

Your proposed defaults make sense. Thanks for thinking it through so thoroughly.

3 Likes

And the spam PR had silly bugs in it when i got one…

In Pygments, after the first one was closed (it was not a security-sensitive context), they even reposted the exact same PR a second time… :frowning:

(Sorry for the OT.)

Will the 'warn' policy itself be deprecated, or do you expect to keep it forever?

Would it make sense to allow passing the policy as callable (policy=tarfile.fully_trusted_policy) rather than via string? (Also suggested by Steve.) Do I suspect correctly that passing by string is motivated by the need to allow system-wide configuration?

Overall I like the plan as proposed going forward.

Q: The concept makes sense for more than just tarfile. Any filesystem container should offer something like this. This suggests that the policy concept should be abstract enough that zipfile could also use it? I realize TarInfo and ZipInfo have different APIs, but both modules otherwise feature similar extract and extractall concepts. I don’t expect a universal policy filter function without doing some isinstance calls internally, but it should at least be possible that way.

Motivation: Lets acknowledge the general theme of the problem and apply the pattern everywhere. My cursory guess is that it won’t be much more work in this case? I assume some software accepts archives in multiple formats and wants to do equivalent processing on them regardless.

7 Likes

Q: Can you describe the shape of changes you are intending to make in security fix backports to older versions of Python?

This post seems to focus on 3.12+ (good in the PEP-like sense), but we should also consider what shape it’ll take in older versions so that it is usable by the world and is not just RedHat-python specific.

A general theme for security fix backports to existing releases: No change in API surfaces: ie, you can’t add a policy= parameter in a backport. Because nobody will be able to write code that works on both an unpatched and patched runtime stdlib (without difficult to maintain costly API introspection hacks).

A workaround to this is to provide an alternative mechanism for changing the behavior that is a no-op on older unpatched versions. Such as setting a magic otherwise unused by existing code and unlikely to be used by any subclasses attribute. t = TarFile(...); t._policy = ... for example. This API should be documented, but limited to legacy purposes and its use discouraged for anyone targeting only 3.12+. With the attribute documented as write only (ex: implemented a property who’s getter raises AttributeError).

If you are allowed to describe why RedHat seems to be treating this as a high priority issue to be resolved after decades, it’d be interesting to know that. But feel free to leave that story for later to avoid derailing the work with a digression. (not a complaint, I’m glad to see someone actually care!)

Thank you for the comments!
Looks like the next step is to turn this into a PEP.

Good point. I’m not sure if tar-specific options are OK. Presumably you’re using unpack_archive because you don’t care that much about the format.
Or you’re using unpack_archive because the API is more ergonomic. Maybe tarfile could get a top-level extract function that takes tar-specific options?

OK, I’ll switch to filter. After all, this it’s about cleaning up the archive in general, rather than just security.

Yup, functions will be allowed. And I’ll remove TarInfo.apply_policy, which just calls the given function. (An earlier draft needed policies do a bit more that what a filter function can, but it turns out a member-wise filter is enough if we don’t provide “dry run” functionality.)
I’ll keep the string names though: they’re much nicer to use (I sigh every time I write subprocess.PIPE), and better generalize to other formats.
And no, this is not about site-specific configuration – that could simply keep a dict or do getattr(tarfile, name).

I don’t see much harm in keeping it forever. Getting DeprecationWarning about something that gives you DeprecationWarning sounds silly.
But saying that, I realize it would make sense to hide it more. Perhaps with tarfile.legacy_deprecation_policy, without a string 'legacy_deprecation' alias.

Good idea! I won’t have time to work on that myself, unfortunately.
I think a good first step would be exposing the same interface from “member info” objects, possibly os.stat_result whose st_ prefix won’t collide with pre-existing attributes. But there’s a bunch of complexity down this road: for example, mode could use some way to model tri-state logic (set/clear/honor) for each bit, numeric/named users/groups are a pain … and I expect Windows-specific features are just as “fun” as the tar/UNIX ones.

A simpler way would be to use common policy filter names, that should take care of the simple cases.

Good point, thanks!
Even better: let’s use a proper public TarFile.extraction_filter attribute, instead of tarfile.DEFAULT_POLICY.

Umm, there’s a customer request. I’m trying to keep myself on the engineering side of things, so I don’t know too much more. If I did I probably couldn’t tell, customer confidentiality is a big thing.
Speaking of this, let me clarify a few things:

  • Red Hat can and will use vendor patches for anything customers want but isn’t appropriate for CPython. I only propose the changes that (IMHO) will benefit the wider community, and aren’t system-specific.
  • The RHEL integration part will be public when released. It’ll probably look like what we did for CVE -2021-23336 (patch).
  • I’d be happy to draw the CPython/vendor boundary somewhere else.
2 Likes

This raises a question that I had when I read the proposal. Tarfiles do get created on Windows, and they are somewhat weird, in that they try to make Windows-specific details conform to the (historically Unix-based) tar format. I don’t know if that impacts security considerations (although I imagine that anything which works based on user IDs would be confused as heck by a UID fabricated by a Windows tar creation tool - the tarfile module creates everything with UID 0, for example) but maybe it should be clarified?

My guess is that when extracting on Windows, the “data” policy (filter - but in this context “policy” reads better IMO) is the only one that really makes sense. When extracting tarfiles created on Windows, on Unix systems, there might be more issues to consider - but again “use the data policy unless you know the file comes from a source that is compatible with yours” might be sufficient.

I agree. 'data' is good for both cross-platform (lowest common denominator) archives, and cross-platform destination filesystems. An archive models a filesystem, after all.

My proposal is for skipping extraction of some pieces of metadata, and for modifying the input data. I’m staying far away from the actual mechanics, mostly because I can’t really design and test the right behavior on Windows.

As for creating tar files from a Windows filesystem, that’s entirely out of scope here. Extraction filters are meant to deal with all kinds of weird tarballs.

But yes, it would be good to clarify that the tar filter/policy will preserve the UNIX-specific features. Perhaps unix would even be a better name for it?

2 Likes

This might be reasonably implemented by passing an exists callable into the filter. Then a dry-run operation can shim that while a real extract can just provide the real function.

But if it’s not really essential, keep things simple.

Nope, you’d also need to track an evolving zoo of symlinks, see _check_member or _get_effective_name in Lars’s patch.
I have great respect for Lars; I don’t think dry-run code can be much simpler.
Myself, I wouldn’t want to maintain code like that, claim that it’s safe, or design ways for users to hook into it.

1 Like

I think we should either make an exception to the general theme, or not backport at all.
Assuming the attribute is extraction_filter (but it would work the same if it was hidden under _policy as you suggest), and that the function that implements a “safe” policy will be called tarfile.data_filter():

  • my_tarfile.extraction_filter = getattr(tarfile, 'data_filter', lambda x: x) says “be secure if possible, silently allow everything if not”, and it’s verbose enough to imply the author really means that.
  • my_tarfile.extraction_filter = (lambda x: x) says “trust this archive” and works everywhere
  • There’s no good way to “always be secure”, short of using a third-party backport from PyPI or writing one yourself.
  • my_tarfile.extraction_filter = 'data' would make older/unpatched versions of Python ignore an explicit security-related request, which is terrible. So, whatever else I’ll do, the attribute will not accept the string shortcuts.
  • my_tarfile.extraction_filter = tarfile.data_filter would only work on patched versions of Python. That just shows that exposing the filter functions is a change in API surface.
    • If we allow that, why not allow extractall(filter=tarfile.data_filter) too?
    • If we don’t expose the function, there’s not much left from the feature to backport. (Maybe the warning by default, but we don’t add new warnings in patch releases.)

It woud be nice to allow third-party code to write:

if hasattr(tarfile, 'data_filter'):  # (or some other feature check)
    # new way of doing things
    my_tarfile.extractall(filter='data')
else:
    # remove this after a deprecation period
    warn('Extracting may be unsafe, consider updating Python')
    my_tarfile.extractall()

rather than

if hasattr(tarfile, 'data_filter'):
    my_tarfile.extraction_filter = 'data'
    my_tarfile.extractall()
    # XXX: reset the filter? Do we need to be thread-safe here?
else:
    # remove this after a deprecation period
    warn('Extracting may be unsafe, consider updating Python')
    my_tarfile.extractall()
3 Likes

Nice analysis. These examples of how to write a secure when possible / warn when not across all versions code snippet are an ideal type of thing to add to a tarfile documentation update that goes along with the security fix.

Agreed that a magic write only attribute api for the backports doesn’t make sense in this case given the hasattr test and desire to encourage people’s code to log when something may be hazardous.

1 Like

The behavior should possibly be as close to zipfile as possible to prevent confusion and make them as interchangeable as possible for shutil. Am am mostly uncertain about the chosen behavior for user and group ownership… zipfile says that it tries to preserve as much information as possible except for stripping some paths like .. and absolut paths. IMO the data policy feels to extreme as a default and I would like a mix between tar and data which might ignore pipes and other special file-like entries but preserves ownership of possible.

IMO, data is a pretty good approximation of zipfile behavior, simply because zipfile doesn’t support *nix-specific features. (At least when extracting. For example, as far as I can see from the source, permissions are stashed in the archive, as external_attributes, but ignored when unpacking.)
data does honor the executable permission bit, unlike zipfile. That part is inspired by Git, FWIW.

As for ownership info, note that tarfile itself does not set it unless you are the *nix superuser.
It’s quite specialized piece of metadata. As far as I know, zipfile doesn’t support it at all.

1 Like

The TarFile initialiser accepts a tarinfo argument, which sets the type of tarinfo objects (it defaults to the TarInfo class).

Q: Could we implement the policies described above in subclasses of TarInfo? It might be easier to retrofit. E.g.:

import tarfile
tarball = tarfile.TarFile.open(..., tarinfo=tarfile.FullyTrustedTarInfo)