Hello,
You might have heard of CVE-2007-4559, the tarfile
directory traversal vulnerability. Discussion about it has been going on and off since 2007, but no conclusion was reached.
tarfile
’s previous maintainer, Lars Gustäbel, wrote a post on the issue before stepping down. I agree with the conclusion:
I understand the community’s desire to fix this issue no matter how. Unfortunately, fixing these kinds of issues is not trivial and requires a lot of effort.
Anyway, I was asked to fix this on our systems. I need to fix the issue, rather soon. And I’d rather fix it in CPython, and fix it well, so everyone can benefit.
I spent the last few days (literally) reading up on this, and here’s what I came up with. It’s long (sorry) and roughly PEP-shaped – probably appropriate after 15 years of discussions.
And it starts from a different angle than fixing a CVE.
Motivation
The tar
format is user for several use cases, many of which have different needs. For example:
- A backup of a UNIX workstation should faithfully preserve all kinds of details like file permissions, symlinks to system configuration, and various kinds of special files.
- When unpacking a data bundle, it’s much more important that the unpacking will not have unintended consequences – like exposing password file by symlinking it to a public place.
To support all the use cases, tar
has very many features. But in many cases, it’s best to ignore or disallow some of them when extracting an archive.
Python allows extracting tar
archives using tarfile.TarFile.extractall()
, whose docs warn to never extract archives from untrusted sources without prior inspection. However, it’s not clear what kind inspection should be done. Indeed, it’s quite tricky to do such an inspection correctly.
And so many people don’t bother, or do the check incorrectly, resulting in security issues such as CVE-2007-4559.
Since tarfile was first written, it’s become more accepted that warnings in documentation are not enough. Whenever possible, an unsafe operation should be explicitly requested; potentially dangerous operations should look dangerous. But, TarFile.extractall
looks benign in a code review.
Tarfile extraction is also exposed via shutil.unpack_archive
, which allows the user to not care about the kind of archive they’re dealing with. The API is very inviting to extracting archives without prior inspection, even though the docs again warn against it.
It could be argued that Python is not wrong – it behaves exactly as documented – but that’s beside the point. Let’s improve the situation rather than assign/avoid blame. Python and its docs are the best place to improve things.
Rationale
How do we improve things? Unfortunately, we will need to change the defaults, which implies breaking backwards compatibility. TarFile.extractall()
is what people reach for when they need to extract a tarball. Its default behaviour needs to change.
What would be the best behaviour? That depends on the use case. So, I propose adding several general “policies” to control extraction. They are based on use cases, and ideally they should have straightforward security implications:
- Current behavior: trusting the archive. Suitable e.g. as a building block for libraries that do the check themselves, or extracting an archive you just made yourself.
- Unpacking a UNIX archive: roughly following GNU
tar
(e.g. stripping leading/
from filenames). - Unpacking a general data archive: the
shutil.unpack_archive
use case.
Eventually, after a deprecation period, the last option – the most limited but most secure one – would become the default, and the others would be easy to specify.
Even with better general defaults, users should still verify the archives they extract, and perhaps modify some of the metadata. Superficially, the following looks like a reasonable way to do this today:
- Call
TarFile.getmembers()
- Verify or modify each member’s
TarInfo
- Pass the result to
extractall
’smembers
However, there are some issues with this approach:
TarInfo
does not officially allow modification. You can modify it, but you can get surprising results sinceTarInfo
objects are cached/shared.- Calling
getmembers
can be expensive and it requires a seekable archive. - When verifying members in advance, it may be necessary to track how each member would have changed the disk, e.g. how symlinks are being set up. This is hard. We can’t expect users to do it.
To solve these issues we’ll:
- Provide a supported way to modify
TarInfo
when needed, without unnecessary work when not. Areplace
method, similar to what dataclass or namedtuple have, should do the trick. - Provide a way to hook into
extractall
’s loop to fitler/modify members as they are processed. - Require that this hook is called just before extracting each member, so it can scan the current state of the disk. This will greatly simplify the implementation of policies (both in stdlib and user code), at the cost of not being able to do a precise “dry run”.
The default policies described above will be implemented using the API for user-defined policies, so they can be used as building blocks or examples.
Full disclosure & redistributor info
I work for Red Hat, a redistributor of Python with different security needs and support periods than CPython in general. We’re planning to carry vendor patches to:
- Allow configuring the defaults system-wide, and
- Change the default as soon as possible, even in older Python versions.
The proposal should make this easy to do, and it allows users to query the settings.
Specification
Modifying and forgetting member metadata
The TarInfo
class will gain a new method, replace()
, which will work similarly to dataclasses.replace
. It will return a deep copy of the TarInfo
object with attributes replaced as specified by keyword arguments:
name
mtime
mode
linkname
uid
gid
uname
gname
Any of these, except name
and linkname
, will be allowed to be set to None
.
When extract
or extractall
encounters such a None
, it will not set that piece of metadata, leaving it as default (current user/group, mode based on umask
, current time).
When addfile
encounters such a None
, it will raise an error. (It could also not store the attribute, if the format allows it, but that’s a possible future enhancement.)
The documentation will mention why the method is there: TarInfo
objects retreived from TarFile.getmembers()
are “live”, and modifiying them directly will affect subsequent unrelated operations.
Policies
TarFile.extract
and TarFile.extractall
methods will grow a policy
parameter, which take a function with the signature:
policy(member: TarInfo) -> TarInfo|None
(XXX would filter
be a better name than policy
?)
When used it will be called on each member as it is extracted, and extraction will work with the result. On None
the member will be skipped.
The function can also raise an exception. This can, depending on Tarfile.errorlevel
, abort the extraction or cause the member to be skipped.
We will also provide a set of defaults for common use cases. In addition to a function, the policy
argument will take one of the following strings: (XXX better names welcome)
-
'fully_trusted'
: Current behavior: honor the metadata as is. Should be used if the user trusts the archive completely, or implements their own complex verification. -
'tar'
: Roughly follow defaults of the GNUtar
command (when run as a normal user without--absolute-names
):- Strip leading
/
from filenames - Refuse to extract files with a
..
component in the filename - Refuse to extract files whose absolute path (after following symlinks) would end up outside the destination. (Note that GNU
tar
instead delays creating some links.) - Clear high mode bits (setuid, setgid, sticky) and group/other write bits (
S_IWGRP|S_IWOTH
). This is an approximation oftar
’s default, which limits the mode by the currentumask
setting.
- Strip leading
-
'data'
: Extract a “data” archive, disallowing common attack vectors but limiting functionality. In addition totar
:- Refuse to extract links (hard or soft) which end up linking to a path outside of the destination.
- Refuse to extract device files (incl. pipes)
- For regular files and hard links:
- Set the owner read and write permissions (
S_IRUSR|S_IWUSR
).
(By now only the executable bits depend on information in the archive.) - Remove the group & other executable permission (
S_IXGRP|S_IXOTH
) if the user doesn’t have it (S_IXUSR
).
- Set the owner read and write permissions (
- For other files (directories), ignore mode entirely (set it to
None
). - Ignore user and group info (set
uid
,gid
,uname
,gname
toNone
).
-
'warn'
: Like'fully-trusted'
, but emit aDeprecationWarning
for each member that would be changed or removed under the'data'
policy.
The default will be 'warn'
in Python 3.12-3.13, and change to 'data'
in Python 3.14.
The corresponding functions will be available as tarfile.fully_trusted_policy()
, tarfile.tar_policy()
, etc, so they can be easily used in custom policies.
A new exception, PolicyError
, will be added. It’ll have several new subclasses: one for each of the refusal reasons above. PolicyError
’s member
attribute will contain the relevant TarInfo
.
In the lists above, “refusing" to extract a file means that a PolicyError
will be raised. If the TarFile.errorlevel
is 1 or more, this will abort the extraction. With errorlevel=0
, the error will be logged and the member will be ignored, but extraction will continue.
Note that extractall()
may leave the archive partially extracted; it is the user’s responsibility to clean up.
(The default errorlevel
is currently 1, not 0 as the documentation says. The docs will be fixed.)
A new method, TarInfo.apply_policy(self, policy=...)
, will raise PolicyError
or return a (possibly modified) TarInfo
to be extracted.
The docs will warn that calling this for all members does not do a “dry run” extraction (previously extracted files might make changes on disk that affect later members, e.g. set up symlinks).
Introspection
The module will get new constants DEFAULT_POLICY
(name of the default policy, i.e. 'warn'
for now) and POLICIES
(a dict of available policy
values).
Users are encouraged to look for these, rather than the Python version, to check whether policy
is supported.
Applications and system integrators may wish to change DEFAULT_POLICY
to suit their requirements.
Hints for further verification
Even with the proposed changes, tarfile
will not be suited for extracting untrusted file without prior inspection. Among other issues, the proposed policies don’t prevent denial-of-service attacks. Users should do additional checks.
New docs will tell users to consider:
- extracting to a new empty directory,
- checking filenames against an allow-list of characters (to filter out control characters, confusables, etc.),
- checking that filenames have expected extensions (discouraging files that execute when you “click on them”, and extension-less files like Windows special device names),
- limiting the total size of extracted data, size of individual files, and number of files,
- checking for files that would be shadowed on case-insensitive filesystems.
Also, the docs will note that:
- tar files commonly contain multiple versions of the same file: later ones are expected to overwrite earlier ones on extraction.
- tarfile does not protect against issues with “live” data, e.g. an attacker tinkering with the destination directory while extraction (or adding) is going on (see GNU tar manual for more info)
This list is not comprehensive, but the documentation is a good place to collect such general tips. It can be moved into a separate document if grows too long or if it needs to be consolidated with zipfile
or shutil
(which is out of scope for this proposal).
Shutil
shutil.unpack_archive
will use tarfile
’s default policy: 'warn'
for a deprecation period, then 'data'
.
Rejected ideas
SafeTarFile
An initial idea from Lars Gustäbel was to provide a separate class that implements security checks (tarfile: Traversal attack vulnerability · Issue #65308 · python/cpython · GitHub). There are two major issues with this approach:
- The name is misleading. General achive operations can never be made “safe” from all kinds of unwanted behavior, without impacting legitimate use cases.
- It does not solve the problem of unsafe defaults.
However, many of the ideas behind SafeTarFile can be used.
Add absolute_path option to tarfile
A minimal change to check the “CVE resolved” box doesn’t go far enough to protect the unaware, nor to empower the dilligent and curious.
Thanks
This proposal is based on prior work and discussions by many people, in particular Lars Gustäbel, Larry Hastings, Joachim Wagner, Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, and many others.