Should there be a new standard for installing arbitrary data files?

AWhetter · July 11, 2021, 1:29am

Thank you for everyone’s input into this discussion so far. With everyone’s feedback, and many of @pf_moore’s suggestions in particular, I’ve been able to create my first attempt at writing up a proposed solution to this problem that focuses on the main use case that we’ve identified, which is for packages to share data or configuration files in common directories with applications installed in the Python environment. The proposal also attempts to consider applications that may want different files in different places based on the platform that it is installed to.

We would add two new path names to sysconfig; the share path and the config path.
The share path would map to the following locations dependent on the scheme:

nt: f"{base}/Share"
nt_user: f"{userbase}/Share"
posix_posix and posix_home: f"{base}/share"
posix_user and osx_framework_user: f"{userbase}/share"

The config path would map to the following locations dependent on the scheme:

nt: f"{base}/Config"
nt_user: f"{userbase}/Config"
posix_posix and posix_home: f"{base}/etc"
posix_user and osx_framework_user: f"{userbase}/etc"

To utilise these locations through setuptools, a new keyword argument to setup() would be added called shared_files. This argument would take a dictionary as its value that would look like the following:

shared_files={
    "<platform_pattern>": {
        "<path_name>": {
            "<dest_path>": [
                "<src_path>",
            ],
        },
    },
},

For example:

shared_files={
    "any": {
        "config": {
            "jupyter/nbconfig/notebook.d": [
                "widgetsnbextension/widgetsnbextension.json",
            ],
        },
        "share": {
            "jupyter/kernels/python3": glob("kernelspec/*"),
        },
    },
},

or as a setup.cfg:

[options.shared_files.any.config]
jupyter/nbconfig/notebook.d = widgetsnbextension/widgetsnbextension.json

[options.shared_files.any.share]
jupyter/kernels/python3 = kernelspec/kernel.json, kernelspec/logo-32x32.png, kernelspec/logo-64x64.png

These examples would result in the following files in the wheel:

{distribution}-{version}.data/config/jupyter/nbconfig/notebook.d/widgetsnbextension.json
{distribution}-{version}.data/share/jupyter/kernels/python3/kernel.json
{distribution}-{version}.data/share/jupyter/kernels/python3/logo-32x32.png
{distribution}-{version}.data/share/jupyter/kernels/python3/logo-x64.png

<platform_pattern> is a glob style pattern that matches against the platform tag of the resulting wheel (defined in PEP-425 as "distutils.util.get_platform() with all hyphens - and periods . replaced with underscore _" and presumably soon to be "sysconfig.get_platform() with all hyphens - and periods . replaced with underscore _"). By making this a glob style pattern, users can do things like "win*" to match against all Windows platforms or "linux_*" to match all Linux platforms. The "any" pattern will match against all platforms. If multiple patterns match against a platform, the union of all <dest_path>-<src-path> pairs will be used. If this causes a destination file to have multiple matching source paths, an error will be raised during the packaging process (though raising this error during the packaging of an sdist would require knowledge of all possible platform tags or for setuptools to have a way of determining if two patterns overlap. Yikes!). Exposing platform tags in this way will require additional documentation to provide clarity around the possible values (and/or layouts) of platform tags.
<path_name> is a subset of the result of sysconfig.get_path_names(). To begin with, only “config” and “share” would be supported. The other path names that exist, excluding “data”, have support through other actively maintained keyword arguments to setuptools. The data_files keyword can install to “data”, but it will remain deprecated. The “data” path won’t be supported through shared_files to prevent installation to non-standard locations in sys.prefix and to places supported by other keywords.
<dest_path> is the location that the listed files will be installed to, relative to sysconfig.get_path(<path_name>, ...). Therefore the location of a file can be queried for at run time by calling sysconfig.get_path(<path_name>, ...), using whichever schemes the application decides to support. Absolute destination paths are not supported. Paths that include .. are not supported. In both these cases, setuptools will raise an exception during parsing of the configuration.
<src_path> is the location of a file to be installed into the relevant <dest_path>. Symlinks are not supported to prevent the inclusion of files that exist outside of sys.prefix. Each source file will be included in an sdist without needing to be specified in a MANIFEST.in file.

When a develop install is performed, each individual destination file will be symlinked to the relevant <src_file> on platforms that support it, or copied on platforms that do not.

This proposal does not cover the possibility of two packages installing conflicting files to the same destination, and defers to the pip issue #4625.

Variations considered:

Allow the copying of subtrees to a destination path: To make configuration less verbose, rather than specifying individual source files and their destination, users could specify source directories to allow the copying of the whole directory along with subdirectories. Using this type of configuration would require the internals of setuptools to translate the configuration to the “per file source” data structure proposed, to allow for individual files to be symlinked in a develop install and for individual files to be removed from directories that are shared across many packages during the uninstall of a package. Although this alternative method of configuration is less verbose, it may result in unintended files ending up in the destination directory (eg .DS_STORE files).
Specify source files with MANIFEST.in style syntax: Given that these source files are included in an sdist without further configuration, it might make sense for them to be configured in a similar way. MANIFEST.in files strike a good middle ground of being explicit without being too verbose. The difference in needs between shared_files and MANIFEST.in though is that the source and destination of things specified in a MANIFEST.in are both relative to the root directory. With shared_files, the destination of a file may not share the same directory hierarchy as the source file. For example, it is less clear whether the destination layout of the configuration "jupyter/kernels/python3": ["include kernels/*/kernel.json"], should be jupyter/kernels/python3/kernels/{x,y,z,...}/kernel.json or jupyter/kernels/python3/kernel.json.
Alternative names for shared_files: Given that the primary use case for this new functionality is to integrate with applications in the Python environment, the names application_files, and application_data_files make this use case clear. However, despite the discussions we’ve had on the topic, there may well be other uses for these additional files that we have not considered. Therefore including “application” in the name might limit how people think of these files unnecessarily. Another alternative name, shared_data_files, makes the shared nature of these additional files clear. However using “data files” in the name might create confusion with the “data” path name in sysconfig. A user may ask “why do my shared data files end up in the ‘config’ and ‘share’ paths rather than the ‘data’ path”.
Use exact platform tags instead of a <platform_pattern>: A downside of allowing patterns is that we can create conflicts where two different source files can map to a single destination. Furthermore, there isn’t a great mechanism for exclusion of files. If a user were to specify a file in "win*", the user cannot prevent that file from being included in a more specific "win-amd64" section without splitting up the "win*" section into individual platform tags. These downsides, plus a more complicated mapping scheme, were chosen to serve a more common use case of OS specific file schemes over platform specific ones. However we could choose to simplify the method of configuration and say that a single platform tag (or perhaps a tuple of platform tags) maps to a single dictionary of sources to destinations (if using tuples, a platform tag should exist as a key only once).
Policing the directories created immediately under the location of a path name: The idea here was that because these shared files are supposed to be consumed by an application, perhaps an application would have to advertise that it accepts these files for an installer to allow them to be installed. Such strictness might help prevent the abuse of shared_files to install things in unusual places by, for example, requiring the shared files for an application to be under a subdirectory in the destination named after the application. However this puts a lot of extra responsibility on installers. It can also be easily worked around by a package simply advertising that it is an application and definitely wants to install those files where it wants to.