Following up on my talk from the Python Packaging Summit - PyCon US 2022, here is a proposal PEP, rendered inline as markdown for now:
PEP: TBD
Title: External Data for Python Packages
Author: Steven Silvester
Sponsor:
PEP-Delegate: <PEP delegate’s real name>
Discussions-To: TBD
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 9-May-2022
Post-History: 9-May-20222
Resolution: TBD
Abstract
PEP 427 describes a data directory as follows: “The .data directory contains subdirectories with the scripts, headers, documentation and so forth from the distribution.”"
To date, there has not been a standard mechanism for build backends to enable packagers to target the data directory.
The goal of this PEP is to standardize a suggested practice for build backends
to expose this capability, and a set of suggestions to package authors for
the intended use of the data directory.
Motivation
The data directory is useful for shared content to be discoverable outside of
a package’s site-package directory, for things such as man-pages or shared
discoverable data such as the Jupyter extension system.
However, there is not a standard way to provide such data,
or a recommendation as to what best practices should be used for the
data directory.
Rationale
PEP 427 defined a “Data Directory” but did not specify how build backends
should make use of the feature. By defining a standard, backends and package
authors can use the feature in a supported manner.
Specification
Build backends should provide a simple mechanism to provide files to the
data
directory of a wheel. The actual implementation and
semantics can be backend-specific. Such variations could include
whether to specify the files using glob patterns or as a single directory
to map to the data
directory.
Backends should link to this PEP specification when providing such an option,
or appropriate section of the Python Packaging User Guide, so that centralized
context and guidelines can be given to package authors.
Such guidance includes using appropriate namespaces for the data.
For example, the Jupyter extension ecosystem uses /share/jupyter
and /etc/jupyter
for runtime and configuration data, respectively.
Alternatives such as entry points
should be considered where appropriate
for plugin systems.
Additionally, the data
directory should only be used for truly shared data, while internal
data files should still be contained as package data within the package, and
contained within the namespaced site-packages
folder.
Reference Implementation
There exist three reference implementations.
The original, deprecated
feature in setuptools
was called data_files
.
The data was specified as “a list of data files to install” in the setup script.
The files given were mapped to the data
directory in the wheel.
Next, flit
implemented “external data”, which is given as directory which is copied explicitly into the data
directory of the wheel with no modification. Additionally, flit
specifies that for editable installs (PEP 660), these files are copied to their destination, so changes there won’t take effect until you reinstall the package.
Finally, hatch
implemented “shared-data”, a “mapping similar to the explicit selection option corresponding to data that will be installed globally in a given Python environment, usually under sys.prefix”.
Additionally, there is a proposed external data feature for setuptools
, that would
follow the convention of the flit
feature.
Rejected Ideas
Discouraging backends from providing this feature. We discussed the implications of supporting this feature, and its potential for abuse.
The site-packages
directory is by definition scoped by package name,
while the data
directory allows files to be installed at the sys.prefix
level. However, there are valid reasons to want to provide data at the
sys.prefix level, as long as appropriate messaging is given to package authors
about intended usage and namespacing. An additional concern is that for system level installs, sys.prefix can be a system-wide package. However, there is precedent for system-wide installs for man-pages.
A final concern raised was that
large files could be provided in the data
directory, but such a risk
already exists with in-package data.
We had also discussed using entry points
instead of data files.
We had explored this possibility for Jupyter extensions, but had rejected it
because the configuration files need to be scanned at runtime, and the
Jupyter data files need to be served by a web server at runtime. In both cases it is
beneficial to have them co-located to avoid disk scanning penalties
across multiple locations.
We also discussed making data files part of the core package metadata, but rejected it because it is a build-time concern that is
not relevant to the installed package.
Since the data files are not explicitly namespaced, we would have to have a full
manifest of the installed data files for it
to be useful.
Open Issues
- Should there be a new standard for installing arbitrary data files?
- Is there a specification for how data files are included in sdists and bdists (wheels)?
Copyright
This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.