PEP 721: Using tarfile.data_filter for source distribution extraction

Per PEP-706 (Filter for tarfile.extractall), Python 3.12 adds a warning for unpacking tarballs, and 3.14 will change default behaviour. This affects tools that unpack source distributions.
It’s a chance to write down how the unarchiving should work, and what kind of metadata sdists should preserve.

It seems like a PEP is the best way for this, so I looked a bit into how tools unpack sdists, and tried to write something in standardese.

Note that generally, after a packaging tool unpacks a sdist it immediately executes unpacked code. So this proposal is not about solving security issues.

Does the following sound reasonable?


Abstract

Extracting a source distribution archive should normally use the data
filter added in :pep:706.
We clarify details, and specify the behaviour for tools that cannot use the
filter directly.

Motivation

The source distribution (sdist) is defined as a tar archive.
The tar format is designed to capture all metadata of Unix-like files.
Some of these are dangerous, unnecessary for source code, and/or
platform-dependent.
As explained in :pep:706, when extracting a tarball, one should always either
limit the allowed features, or explicitly give the tarball total control.

Rationale

For source distributions, the data filter introduced in :pep:706
is enough. It allows slighty more features than git and zip (both
commonly used in packaging workflows).

However, not all tools can use the data filter,
so we specify an explicit set of behaviours and expectations.
This PEP specifies an explicit set of expectations.
The aim is that the current behaviour of pip download
and setuptools.archive_util.unpack_tarfile is valid,
except cases deemed too dangerous to allow.
Another consideration is ease of implementation for non-Python tools.

Unpatched versions of Python

Tools are allowed to ignore this PEP when running on Python pithout tarfile
filters.

The feature has been backported to all versions of Python supported by
python.org. Vendoring it in third-party libraries is tricky,
and we should not force all tools to do so.
This shifts the responsibility to keep up with security updates from libraries
to the users.

Permissions

Common tools (git, zip) don’t preserve Unix permissions (mode bits).
Telling users to not rely on them in sdists, and allowing tools to handle
them relatively freely, seems fair.

The only exception is the executable permission.
We recommend, but not require, that tools preserve the it.
Given that scripts are generally platform-specific, it seems fitting to
say that keeping them executable is tool-specific behaviour.

Note that while git preserves executability, zip (and thus wheel)
doesn’t do it natively. (It is possible to encode it in “external attributes”,
but Python’s ZipFile.extract does not honor that.)

Specification

The following will be added to the PyPA source distribution format spec <https://packaging.python.org/en/latest/specifications/source-distribution-format/>_
under Source distribution archive features:

Because extracting tar files as-is is dangerous, and the results are
platform-specific, archive features of source distributions are limited.

Unpacking with the data filter

When extracting a source distribution, tools MUST either use
tarfile.data_filter (e.g. TarFile.extractall(..., filter='data')), OR
follow the Unpacking without the data filter section below.

As an exception, on Python interpreters without hasattr(tarfile, 'data_filter')
(:pep:706), tools that normally use that filter (directly on indirectly)
MAY warn the user and ignore this specification.
The trade-off between usability (e.g. fully trusting the archive) and
security (e.g. refusing to unpack) is left up to the tool.

Unpacking without the data filter

Tools that do not use the data filter directly (e.g. for backwards
compatibility, allowing additional features, or not using Python) MUST follow
this section.
(At the time of this writing, the data filter also follows this section,
but it may get out of sync in the future.)

The following files are invalid in a sdist archive.
Upon encountering such an entry, tools SHOULD notify the user,
MUST NOT unpack the entry, and MAY abort with a failure:

  • Files that would be placed outside the destination directory.
  • Links (symbolic or hard) pointing outside the destination directory.
  • Device files (including pipes).

The following are also invalid. Tools MAY treat them as above,
but are NOT REQUIRED to do so:

  • Files with a .. component in the filename or link target.
  • Links pointing to a file that is not part of the archive.

Tools MAY unpack links (symbolic or hard) as regular files,
using content from the archive.

When extracting sdist archives:

  • Leading slashes in file names SHOULD be dropped.
    (This is nowadays standard behaviour for tar unpacking.)

  • For each mode (Unix permission) bit, tools MUST either:

    • use the platform’s default for a new file/directory (respectively),
    • set the bit according to the archive, or
    • use the bit from rw-r--r-- (0o644) for non-executable files or
      rwxr-xr-x (0o755) for executable files and directories.
  • High mode bits (setuid, setgid, sticky) MUST be cleared.

  • It is RECOMMENDED to preserve the user executable bit.

Permission bits

To create a portable sdist, tools SHOULD use only rw-r--r-- (0o644)
for non-executable files, and rwxr-xr-x (0o755) for executable files
and directories.

Users MAY rely on the user read and write permissions being set.
The other bits are tool- and platform-specific.
(Platforms where these permissions are not set by default MAY be ignored.)

Further hints

Tool authors are encouraged to consider how hints for further
verification
in tarfile documentation apply for their tool.

Backwards Compatibility

The existing behaviour is unspecified, and treated differently by different
tools.
This PEP makes the expectations explicit.

There is no known case of backwards incompatibility, but some project out there
probably does rely on details that aren’t guaranteed.
This PEP bans the most dangerous of those features, and the rest is
made tool-specific.

Security Implications

The recommended data filter is believed safe against common exploits,
and is a single place to amend if flaws are found in the future.

The explicit specification includes protections from the data filter.

How to Teach This

The PEP is aimed at authors of packaging tools, who should be fine with
a PEP and an updated packaging spec.

Reference Implementation

TBD

Rejected Ideas

None yet.

Open Issues

None yet.

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

Considering the change to tarfile directly addresses an existing CVE, can’t we just file this against tools and say “you can say that you addressed this CVE when you merge this change”?

“Addressing CVEs” is increasingly used to bully maintainers into making changes, so I’d prefer to go deeper and show the actual benefit to the tools and their users.

Alas, for some tools the only benefit is avoiding the warning that I added to tarfile, and handling the pending behaviour change. I hope they can understand why I did it, and forgive the disruption.

1 Like

“Implementing PEPs” is also often seen as the same kind of bullying, so I don’t think there’s a winning path here.

Maybe posting issues as a heads-up that the feature is coming and would enable various $THINGS for them? Though now it’s posted here, I suspect most of the relevant maintainers are going to see it anyway.

Yeah, I do hope this is visible enough. I did file some issues:

The advantage of a PEP is to document what we’ve agreed to do, as a reference for users who appear and say things like “my project relies on being able to store a device file in the sdist and you broke it”. It’s much easier to refer to a standard that addresses a CVE in a general manner to have that argument on a case by case basis.

I’m basically +1 on the PEP. Most of the details I have no opinion on (as all I really care about is having well-defined and standardised behaviour) so I’ll let others thrash out any questions over them.

This is now PEP 721.
What’s next? Should I open a new topic? Alert the tool authors somehow?

The key for me is that there’s a reasonable discussion of the PEP and interested parties have had a chance to comment. It doesn’t feel like we should need a new topic, but because this was split from another discussion (at least I assume it was, given the way the first post reads like it was a continuation of something - I don’t recall where this originated) maybe it needs something to pull the discussion together.

I’m not after a big debate - I hope this is a relatively uncontroversial proposal. A link back to the original discussion may be enough, I couldn’t find it on a quick search.

Once you feel that there’s been enough feedback (or enough time with no objections), submit it for pronouncement.

Whooops! I thought I was adding a reply, but instead I edited the initial post. No Idea how I managed to do that. Sorry for the mess!
I’ve restored the initial post

Changes between the initial post and PEP 721:

Removed “Permission bits” section:

To create a portable sdist, tools SHOULD use only rw-r--r-- (0o644)
for non-executable files, and rwxr-xr-x (0o755) for executable files
and directories.

Users MAY rely on the user read and write permissions being set.
The other bits are tool- and platform-specific.
(Platforms where these permissions are not set by default MAY be ignored.)

This doesn’t concern extraction, and IMO isn’t worth standardizing.

Leading slashes in file names MUST be dropped.

SHOULD → MUST. If tools don’t do this, the files would usually be placed outside the destination directory, making the sdist invalid.

And some wording tweaks

I’ve posted on the PyPA Discord.
I plan to submit the PEP for pronouncement after EuroPython.

lol, that explains my confusion! Thanks. This thread is definitely fine for discussion.

Your schedule sounds fine to me, I’ll be happy to pronounce at that point (assuming no massive concerns arise, which I doubt).

I’m back from EuroPython (and some other stuff that kept me from computers)!

Could you please make the pronouncement?

Hi, While there’s not been much (any, TBH!) feedback on this proposal, I’m happy with it. It’s well-written and explicitly defines behaviour in an important area that’s been entirely implementation dependent for now.

I therefore confirm that this PEP is formally accepted. Congratulations @encukou, and thanks for putting the work into both this and the tarfile enhancement that’s behind it!

3 Likes