PEP Proposal: External Data for Python Packages

Following up on my talk from the Python Packaging Summit - PyCon US 2022, here is a proposal PEP, rendered inline as markdown for now:

PEP: TBD
Title: External Data for Python Packages
Author: Steven Silvester
Sponsor:
PEP-Delegate: <PEP delegate’s real name>
Discussions-To: TBD
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 9-May-2022
Post-History: 9-May-20222
Resolution: TBD

Abstract

PEP 427 describes a data directory as follows: “The .data directory contains subdirectories with the scripts, headers, documentation and so forth from the distribution.”"

To date, there has not been a standard mechanism for build backends to enable packagers to target the data directory.

The goal of this PEP is to standardize a suggested practice for build backends
to expose this capability, and a set of suggestions to package authors for
the intended use of the data directory.

Motivation

The data directory is useful for shared content to be discoverable outside of
a package’s site-package directory, for things such as man-pages or shared
discoverable data such as the Jupyter extension system.

However, there is not a standard way to provide such data,
or a recommendation as to what best practices should be used for the
data directory.

Rationale

PEP 427 defined a “Data Directory” but did not specify how build backends
should make use of the feature. By defining a standard, backends and package
authors can use the feature in a supported manner.

Specification

Build backends should provide a simple mechanism to provide files to the
data directory of a wheel. The actual implementation and
semantics can be backend-specific. Such variations could include
whether to specify the files using glob patterns or as a single directory
to map to the data directory.

Backends should link to this PEP specification when providing such an option,
or appropriate section of the Python Packaging User Guide, so that centralized
context and guidelines can be given to package authors.

Such guidance includes using appropriate namespaces for the data.
For example, the Jupyter extension ecosystem uses /share/jupyter and /etc/jupyter for runtime and configuration data, respectively.

Alternatives such as entry points should be considered where appropriate
for plugin systems.

Additionally, the data directory should only be used for truly shared data, while internal
data files should still be contained as package data within the package, and
contained within the namespaced site-packages folder.

Reference Implementation

There exist three reference implementations.

The original, deprecated
feature in setuptools was called data_files.
The data was specified as “a list of data files to install” in the setup script.
The files given were mapped to the data directory in the wheel.

Next, flit implemented “external data”, which is given as directory which is copied explicitly into the data directory of the wheel with no modification. Additionally, flit specifies that for editable installs (PEP 660), these files are copied to their destination, so changes there won’t take effect until you reinstall the package.

Finally, hatch implemented “shared-data”, a “mapping similar to the explicit selection option corresponding to data that will be installed globally in a given Python environment, usually under sys.prefix”.

Additionally, there is a proposed external data feature for setuptools, that would
follow the convention of the flit feature.

Rejected Ideas

Discouraging backends from providing this feature. We discussed the implications of supporting this feature, and its potential for abuse.
The site-packages directory is by definition scoped by package name,
while the data directory allows files to be installed at the sys.prefix
level. However, there are valid reasons to want to provide data at the
sys.prefix level, as long as appropriate messaging is given to package authors
about intended usage and namespacing. An additional concern is that for system level installs, sys.prefix can be a system-wide package. However, there is precedent for system-wide installs for man-pages.

A final concern raised was that
large files could be provided in the data directory, but such a risk
already exists with in-package data.

We had also discussed using entry points instead of data files.
We had explored this possibility for Jupyter extensions, but had rejected it
because the configuration files need to be scanned at runtime, and the
Jupyter data files need to be served by a web server at runtime. In both cases it is
beneficial to have them co-located to avoid disk scanning penalties
across multiple locations.

We also discussed making data files part of the core package metadata, but rejected it because it is a build-time concern that is
not relevant to the installed package.
Since the data files are not explicitly namespaced, we would have to have a full
manifest of the installed data files for it
to be useful.

Open Issues

Copyright

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.

This doesn’t quite read correctly. Perhaps “to add files to the data directory`…” instead of “provide”?


I would link to the PyPUG guide on package data here, and for entry points above, so it’s super clear what the other options are.


Are there any rules on “ownership” of namespaces? PEP 518 has the rule that the owner of the name on PyPI owns the tool.<name> table. Presumably Jupyter would “own” /etc/jupyter – other people can put their config files there, but it corresponds to your specification. I shouldn’t come along and stick stuff there that’s unrelated to Jupyter.

I wasn’t at PyCon US, and I’m quite confused by this PEP.

Am I wrong in summarising this PEP as “backends should expose a mechanism for putting data in the .data section of the wheel”? Beyond that, it seems to be specifying an opinionated approach to using the mechanism provided.

What problem is this solving, and how?

This is not possible with the files that go in the data mechanisms provided via wheels today.

This maps to the location pointed to by “data” in the schemes. As an example, see what sysconfig.get_paths() prints on the latest Python.

This seems like a somewhat arbitrary recommendation, and it’s not clear what value this provides. Why should a package not store the data it needs to use, within the data directory?

This PEP draft does not currently provide a standard way of doing things. It basically lists the three most popular backends and what they do for allowing putting files in data.

2 Likes

Overall, this draft seems play sufficient fast and loose with standard terminology as to cause substantial confusion. and is vaugue enough that I don’t really understand what it intends to specify. Most prominently, it appears to fundamentally confound the top-level .data directory defined in the wheel specification, which contains all the various sub-levels defined by the sysconfig schemes, with the not-standardized data label inside the top-level .data directory which is used ad-hoc currently to store arbitrary data files under the installation prefix. Without clearing up this confusion, to start with, it is difficult to figure out what it actually wants to achieve, much less help ensure it actually expresses that.

FWIW, I was at PyCon US,was involved in the discussion, but am equally rather confused, considering I seem to recall us delving at least a somewhat into a discussion involving at least some degree of actual specifics and concrete mechanics, which I don’t see really represented in this PEP.

Indeed; this appears to be all the “specification” section seems to say, which is merely restrating the goal of the PEP (which belongs in the Abstract and/or a Goals, Motivation or Rationale section). The Specification could really use some specifics, otherwise it could just be at most an Informational PEP, or perhaps just a blog post.

Yup, its hard to see how the PEP is actually useful to achieve tool interoperability (i.e. the purpose of packaging PEPs) aside from just telling tools they should do something in a tool-specific way (that they apparently already all do, in a tool-specific way).

I agree. It’s very important to note that even if the subject was discussed in person at PyCon, the audience for the PEP (and the feature) will include a lot of people who weren’t party to that discussion, and it’s crucial to make sure the PEP requires no familiarity with the discussion.

I’m also very confused because there was an extensive discussion about this here on Discourse, and although that discussion is linked in the proposal, none of the points made there seem to have been incorporated in the proposal (in particular the suggested approach for Jupyter).

1 Like

Again, just to stress—I was part of the discussion at PyCon, and I’m no less confused myself. Various informed parties delved into at least some level of useful specifics regarding an actual design and implementation, and I don’t really see much if any of that reflected in the PEP, much less the level of concrete, fleshed out detail in the previous thread. As of now, this draft is just a statement that backends should provide some means of allowing users to specify external data, which is already de-facto the case.

There certainly seems to be pretty good support for a PEP on this topic, but this needs a pretty throughout rewrite to actually meet that need, both to garner the support of the packaging community, and to fulfill the minimum baseline for completeness expected of a Standards Track PEP.

1 Like

Frankly, project Jupyter’s needs are met by the status quo, and any confusion you read in the PEP draft is due to my confusion from trying to parse through various opinions on the topic. Given the feedback thus far, I’d just assume retract this proposal and move forward with using Flit or Hatch to meet Jupyter’s needs.

1 Like

Yeah. I am also a bit confused. The .data/data directory is part of the wheel specification, backends can already use it. I guess the goal with this PEP would be to recomend build backends to actually expose a configuration to the user to use it? Please correct me if I am wrong.

In general, I believe a PEP for that would only be necessary if build backends don’t want to do this without one, or need a common standard for it before implementing something themselves, and it is substantial and potentially controversial enough to be more than just a PR to add a note to an existing spec. And as @blink1073 mentions here and details in the PEP, the major backends already do provide a means of offering this.

(Also, I’m not sure I can recall an accepted Standards Track PEP that basically just said that external tools should provide some way to do thing X, without any specifics on the details—it could potentially be an Informational PEP, though)

That said, I felt the discussion @blink1073 sparked did have significant value, and additional standardization on this topic may certainly be of benefit, as @blink1073 and the previous thread certainly made a case for, if someone wants to pick it up.

Ha, I never saw that! I’m happy I’m not the only one who thought the word “shared” was better for the option name :slightly_smiling_face:

1 Like