You might have heard of CVE-2007-4559, the
tarfile directory traversal vulnerability. Discussion about it has been going on and off since 2007, but no conclusion was reached.
tarfile’s previous maintainer, Lars Gustäbel, wrote a post on the issue before stepping down. I agree with the conclusion:
I understand the community’s desire to fix this issue no matter how. Unfortunately, fixing these kinds of issues is not trivial and requires a lot of effort.
Anyway, I was asked to fix this on our systems. I need to fix the issue, rather soon. And I’d rather fix it in CPython, and fix it well, so everyone can benefit.
I spent the last few days (literally) reading up on this, and here’s what I came up with. It’s long (sorry) and roughly PEP-shaped – probably appropriate after 15 years of discussions.
And it starts from a different angle than fixing a CVE.
tar format is user for several use cases, many of which have different needs. For example:
- A backup of a UNIX workstation should faithfully preserve all kinds of details like file permissions, symlinks to system configuration, and various kinds of special files.
- When unpacking a data bundle, it’s much more important that the unpacking will not have unintended consequences – like exposing password file by symlinking it to a public place.
To support all the use cases,
tar has very many features. But in many cases, it’s best to ignore or disallow some of them when extracting an archive.
Python allows extracting
tar archives using
tarfile.TarFile.extractall(), whose docs warn to never extract archives from untrusted sources without prior inspection. However, it’s not clear what kind inspection should be done. Indeed, it’s quite tricky to do such an inspection correctly.
And so many people don’t bother, or do the check incorrectly, resulting in security issues such as CVE-2007-4559.
Since tarfile was first written, it’s become more accepted that warnings in documentation are not enough. Whenever possible, an unsafe operation should be explicitly requested; potentially dangerous operations should look dangerous. But,
TarFile.extractall looks benign in a code review.
Tarfile extraction is also exposed via
shutil.unpack_archive, which allows the user to not care about the kind of archive they’re dealing with. The API is very inviting to extracting archives without prior inspection, even though the docs again warn against it.
It could be argued that Python is not wrong – it behaves exactly as documented – but that’s beside the point. Let’s improve the situation rather than assign/avoid blame. Python and its docs are the best place to improve things.
How do we improve things? Unfortunately, we will need to change the defaults, which implies breaking backwards compatibility.
TarFile.extractall() is what people reach for when they need to extract a tarball. Its default behaviour needs to change.
What would be the best behaviour? That depends on the use case. So, I propose adding several general “policies” to control extraction. They are based on use cases, and ideally they should have straightforward security implications:
- Current behavior: trusting the archive. Suitable e.g. as a building block for libraries that do the check themselves, or extracting an archive you just made yourself.
- Unpacking a UNIX archive: roughly following GNU
tar(e.g. stripping leading
- Unpacking a general data archive: the
Eventually, after a deprecation period, the last option – the most limited but most secure one – would become the default, and the others would be easy to specify.
Even with better general defaults, users should still verify the archives they extract, and perhaps modify some of the metadata. Superficially, the following looks like a reasonable way to do this today:
- Verify or modify each member’s
- Pass the result to
However, there are some issues with this approach:
TarInfodoes not officially allow modification. You can modify it, but you can get surprising results since
TarInfoobjects are cached/shared.
getmemberscan be expensive and it requires a seekable archive.
- When verifying members in advance, it may be necessary to track how each member would have changed the disk, e.g. how symlinks are being set up. This is hard. We can’t expect users to do it.
To solve these issues we’ll:
- Provide a supported way to modify
TarInfowhen needed, without unnecessary work when not. A
replacemethod, similar to what dataclass or namedtuple have, should do the trick.
- Provide a way to hook into
extractall’s loop to fitler/modify members as they are processed.
- Require that this hook is called just before extracting each member, so it can scan the current state of the disk. This will greatly simplify the implementation of policies (both in stdlib and user code), at the cost of not being able to do a precise “dry run”.
The default policies described above will be implemented using the API for user-defined policies, so they can be used as building blocks or examples.
I work for Red Hat, a redistributor of Python with different security needs and support periods than CPython in general. We’re planning to carry vendor patches to:
- Allow configuring the defaults system-wide, and
- Change the default as soon as possible, even in older Python versions.
The proposal should make this easy to do, and it allows users to query the settings.
TarInfo class will gain a new method,
replace(), which will work similarly to
dataclasses.replace. It will return a deep copy of the
TarInfo object with attributes replaced as specified by keyword arguments:
Any of these, except
linkname, will be allowed to be set to
extractall encounters such a
None, it will not set that piece of metadata, leaving it as default (current user/group, mode based on
umask, current time).
addfile encounters such a
None, it will raise an error. (It could also not store the attribute, if the format allows it, but that’s a possible future enhancement.)
The documentation will mention why the method is there:
TarInfo objects retreived from
TarFile.getmembers() are “live”, and modifiying them directly will affect subsequent unrelated operations.
TarFile.extractall methods will grow a
policy parameter, which take a function with the signature:
policy(member: TarInfo) -> TarInfo|None
filter be a better name than
When used it will be called on each member as it is extracted, and extraction will work with the result. On
None the member will be skipped.
The function can also raise an exception. This can, depending on
Tarfile.errorlevel, abort the extraction or cause the member to be skipped.
We will also provide a set of defaults for common use cases. In addition to a function, the
policy argument will take one of the following strings: (XXX better names welcome)
'fully_trusted': Current behavior: honor the metadata as is. Should be used if the user trusts the archive completely, or implements their own complex verification.
'tar': Roughly follow defaults of the GNU
tarcommand (when run as a normal user without
- Strip leading
- Refuse to extract files with a
..component in the filename
- Refuse to extract files whose absolute path (after following symlinks) would end up outside the destination. (Note that GNU
tarinstead delays creating some links.)
- Clear high mode bits (setuid, setgid, sticky) and group/other write bits (
S_IWGRP|S_IWOTH). This is an approximation of
tar’s default, which limits the mode by the current
- Strip leading
'data': Extract a “data” archive, disallowing common attack vectors but limiting functionality. In addition to
- Refuse to extract links (hard or soft) which end up linking to a path outside of the destination.
- Refuse to extract device files (incl. pipes)
- For regular files and hard links:
- Set the owner read and write permissions (
(By now only the executable bits depend on information in the archive.)
- Remove the group & other executable permission (
S_IXGRP|S_IXOTH) if the user doesn’t have it (
- Set the owner read and write permissions (
- For other files (directories), ignore mode entirely (set it to
- Ignore user and group info (set
'fully-trusted', but emit a
DeprecationWarningfor each member that would be changed or removed under the
The default will be
'warn' in Python 3.12-3.13, and change to
'data' in Python 3.14.
The corresponding functions will be available as
tarfile.tar_policy(), etc, so they can be easily used in custom policies.
A new exception,
PolicyError, will be added. It’ll have several new subclasses: one for each of the refusal reasons above.
member attribute will contain the relevant
In the lists above, “refusing" to extract a file means that a
PolicyError will be raised. If the
TarFile.errorlevel is 1 or more, this will abort the extraction. With
errorlevel=0, the error will be logged and the member will be ignored, but extraction will continue.
extractall() may leave the archive partially extracted; it is the user’s responsibility to clean up.
errorlevel is currently 1, not 0 as the documentation says. The docs will be fixed.)
A new method,
TarInfo.apply_policy(self, policy=...), will raise
PolicyError or return a (possibly modified)
TarInfo to be extracted.
The docs will warn that calling this for all members does not do a “dry run” extraction (previously extracted files might make changes on disk that affect later members, e.g. set up symlinks).
The module will get new constants
DEFAULT_POLICY (name of the default policy, i.e.
'warn' for now) and
POLICIES (a dict of available
Users are encouraged to look for these, rather than the Python version, to check whether
policy is supported.
Applications and system integrators may wish to change
DEFAULT_POLICY to suit their requirements.
Even with the proposed changes,
tarfile will not be suited for extracting untrusted file without prior inspection. Among other issues, the proposed policies don’t prevent denial-of-service attacks. Users should do additional checks.
New docs will tell users to consider:
- extracting to a new empty directory,
- checking filenames against an allow-list of characters (to filter out control characters, confusables, etc.),
- checking that filenames have expected extensions (discouraging files that execute when you “click on them”, and extension-less files like Windows special device names),
- limiting the total size of extracted data, size of individual files, and number of files,
- checking for files that would be shadowed on case-insensitive filesystems.
Also, the docs will note that:
- tar files commonly contain multiple versions of the same file: later ones are expected to overwrite earlier ones on extraction.
- tarfile does not protect against issues with “live” data, e.g. an attacker tinkering with the destination directory while extraction (or adding) is going on (see GNU tar manual for more info)
This list is not comprehensive, but the documentation is a good place to collect such general tips. It can be moved into a separate document if grows too long or if it needs to be consolidated with
shutil (which is out of scope for this proposal).
shutil.unpack_archive will use
tarfile’s default policy:
'warn' for a deprecation period, then
An initial idea from Lars Gustäbel was to provide a separate class that implements security checks (tarfile: Traversal attack vulnerability · Issue #65308 · python/cpython · GitHub). There are two major issues with this approach:
- The name is misleading. General achive operations can never be made “safe” from all kinds of unwanted behavior, without impacting legitimate use cases.
- It does not solve the problem of unsafe defaults.
However, many of the ideas behind SafeTarFile can be used.
A minimal change to check the “CVE resolved” box doesn’t go far enough to protect the unaware, nor to empower the dilligent and curious.
This proposal is based on prior work and discussions by many people, in particular Lars Gustäbel, Larry Hastings, Joachim Wagner, Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, and many others.