Following up on my talk from the Python Packaging Summit - PyCon US 2022, here is a proposal PEP, rendered inline as markdown for now:
Title: External Data for Python Packages
Author: Steven Silvester
PEP-Delegate: <PEP delegate’s real name>
Type: Standards Track
PEP 427 describes a data directory as follows: “The .data directory contains subdirectories with the scripts, headers, documentation and so forth from the distribution.”"
To date, there has not been a standard mechanism for build backends to enable packagers to target the data directory.
The goal of this PEP is to standardize a suggested practice for build backends
to expose this capability, and a set of suggestions to package authors for
the intended use of the data directory.
The data directory is useful for shared content to be discoverable outside of
a package’s site-package directory, for things such as man-pages or shared
discoverable data such as the Jupyter extension system.
However, there is not a standard way to provide such data,
or a recommendation as to what best practices should be used for the
PEP 427 defined a “Data Directory” but did not specify how build backends
should make use of the feature. By defining a standard, backends and package
authors can use the feature in a supported manner.
Build backends should provide a simple mechanism to provide files to the
data directory of a wheel. The actual implementation and
semantics can be backend-specific. Such variations could include
whether to specify the files using glob patterns or as a single directory
to map to the
Backends should link to this PEP specification when providing such an option,
or appropriate section of the Python Packaging User Guide, so that centralized
context and guidelines can be given to package authors.
Such guidance includes using appropriate namespaces for the data.
For example, the Jupyter extension ecosystem uses
/etc/jupyter for runtime and configuration data, respectively.
Alternatives such as
entry points should be considered where appropriate
for plugin systems.
data directory should only be used for truly shared data, while internal
data files should still be contained as package data within the package, and
contained within the namespaced
There exist three reference implementations.
The original, deprecated
setuptools was called
The data was specified as “a list of data files to install” in the setup script.
The files given were mapped to the
data directory in the wheel.
flit implemented “external data”, which is given as directory which is copied explicitly into the
data directory of the wheel with no modification. Additionally,
flit specifies that for editable installs (PEP 660), these files are copied to their destination, so changes there won’t take effect until you reinstall the package.
hatch implemented “shared-data”, a “mapping similar to the explicit selection option corresponding to data that will be installed globally in a given Python environment, usually under sys.prefix”.
Additionally, there is a proposed external data feature for
setuptools, that would
follow the convention of the
Discouraging backends from providing this feature. We discussed the implications of supporting this feature, and its potential for abuse.
site-packages directory is by definition scoped by package name,
data directory allows files to be installed at the sys.prefix
level. However, there are valid reasons to want to provide data at the
sys.prefix level, as long as appropriate messaging is given to package authors
about intended usage and namespacing. An additional concern is that for system level installs, sys.prefix can be a system-wide package. However, there is precedent for system-wide installs for man-pages.
A final concern raised was that
large files could be provided in the
data directory, but such a risk
already exists with in-package data.
We had also discussed using
entry points instead of data files.
We had explored this possibility for Jupyter extensions, but had rejected it
because the configuration files need to be scanned at runtime, and the
Jupyter data files need to be served by a web server at runtime. In both cases it is
beneficial to have them co-located to avoid disk scanning penalties
across multiple locations.
We also discussed making data files part of the core package metadata, but rejected it because it is a build-time concern that is
not relevant to the installed package.
Since the data files are not explicitly namespaced, we would have to have a full
manifest of the installed data files for it
to be useful.
- Should there be a new standard for installing arbitrary data files?
- Is there a specification for how data files are included in sdists and bdists (wheels)?
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.