Should there be a new standard for installing arbitrary data files?

Yeah, I think so. An equivalent answer for —user as well.

I don’t think resolving etc and share separately is important to us, as long as the parent of where they actually go is accessible and reliable. sys.prefix usually happens to work for this, just not always.

--user is covered by the appropriate “user” scheme, so yes, that works.

Could you not just use sysconfig.get_path("data")? That’s basically “sys.prefix with a different value in the user scheme”. And it’s already available in wheels, because it’s an existing sysconfig path.

It’s worth also linking to this pip issue though. We’re trying to switch from distutils to sysconfig for getting scheme paths, in preparation for the deprecation of distutils, and there’s an impressive number of cases where the two are out of line. Which suggests that either there’s a certain lack of clarity over what “the correct locations” are in many cases, or distributions have been caught out by the fact that there’s no “single source of truth” for this data right now, when they have patched stuff.

Also, I did misread the wheel spec previously - all it refers to is “a dictionary of install paths”, with a note that “this version of the wheel specification is based on the distutils install schemes and does not define how to install files to other locations”.

But I think the following are reasonable clarifications for the spec:

  1. “distutils” becomes “sysconfig” with the impending deprecation of distutils.
  2. The “dictionary of install paths” is clarified, to note that by default installers should use one of the available sysconfig schemes.
  3. The spec doesn’t state which scheme an installer should use by default, nor does it put any requirements on what capabilities installers might have to customise the set of install locations.

The Jupyter issue of locating the install location of data files at runtime, becomes a question of “what was the install scheme used to install this package?” And there’s no complete answer to that, because it’s not recorded anywhere (and things like --target or editable installs make things even more messy). But looking in sysconfig locations, and declaring custom install schemes unsupported, seems like a reasonable starting point.

Could you not just use sysconfig.get_path("data") ?

Yes, I think we can! As long as it’s documented that this is where data files go by default for pip install and pip install --user. I think that’s what’s been missing—a documented queryable api that e.g. returns /usr/local on debian. Then it could be considered a bug that debian debian sysconfig.get_path("data") returns /usr instead of /usr/local. As you said, every possible --target install is not critical for us to support. We have post-install configuration to deal with those cases, we mainly care about the really common ones getting us 90% there without that extra second step.

I’ll confess that I was surprised to learn that data_files is deprecated in the docs with the message:

It does not work with wheels, so it should be avoided.

Mainly because we use data_files in wheels all the time and it works great. So I don’t quite understand what the current situation is. We’ve always used ‘relative’ paths, e.g.

data_files=[("share/jupyter/kernels/python3", glob("kernelspec/*")),]

So what part of data_files is really deprecated? Is it only absolute path support? We’ve never needed that. Is the plan to have a new spec that can reasonably be supported and un-deprecate data_files with a more limited scope, or a new keyword arg with a new, very similar spec?

So from the Jupyter perspective, everything is working pretty well today except for an official consistent API that would return /usr/local on debian instead of /usr. If the fix is for debian to patch sysconfig in the same way they do distutils as proposed in that pip issue, that would probably work great for us. We’ve handled that well enough so far with a bit of special-casing for /usr/local anyway, so we have no pressing issues, unless data_files is actually going to be removed.

The Jupyter issue of locating the install location of data files at runtime, becomes a question of “what was the install scheme used to install this package?”

We actually want something that I think should be even simpler than that, since we don’t actually know what packages we are asking about (and critically, they might not be Python packages). Instead, the question we want to ask is:

  • what is the default install scheme, and where would it put data files
  • what is the --user install scheme, and where would it put data files

so that we can put those on our default search path to make life easier for wheel installs. That’s really it.

To put it another way, data_files for us is distinct from package_data because it is for making files available to other tools or packages, which may even not be Python at all (e.g. xeus in c++ or nteract in javascript. Having a consistent $PREFIX/share|etc scheme is our approach to that.

2 Likes

Hey all, I’m another Jupyter person who has been lately looking into the issues around data_files. The current thrust of the conversation is encouraging, especially since the ongoing work on using entry_points in place of data_files seems to have stalled a bit. Mostly over the fact that no matter what, using entry_points means having to scan through the metadata of potentially 1000s of packages on every startup looking for config and extensions, whereas using data_files means that currently we only have to check ~4 fixed paths. Thus, it seems like data_files has entry_points thoroughly beat in terms of performance and simplicity.

So the idea of using sysconfig.get_path("data") or similar to get a single fixed dir (per schema) to install files to sounds appealing, since it should be even simpler and faster than data_files. My question, though, is how would we actually implement installation to the "data" dir? I can think of a bunch of hacky ways to copy files during install. But in general those copies won’t automagically get included in the pkg metadata, right? Which means that pip uninstall won’t work.

So what’s the correct way?

Put them in {distribution}-{version}.data/data/ in the wheel. If you’re asking what build backend (e.g. setuptools/flit) arguments are needed to do that, I don’t know, but if the backend doesn’t let you do this right now, that’s just a backend feature request - in the context of this thread, the important point is that it doesn’t need a new standard to make this work.

Cool. I’ll go try that out. The only other wrinkle I can think of is that the files will need to end up in something like ${datarootdir}/share/jupyter, rather than the base ${datarootdir}. So in terms of the wheel schema you refer to, could you stash data at:

{distribution}-{version}.data/data/share/jupyter

Is that valid? I just read through the 2 wheel PEPs (PEP 427 and PEP 491) and I’m still unclear

The PEP says each directory contains a “subtree” so I believe this is completely valid.

1 Like

I’m still unclear on whether setuptools should allow users to place things wherever they like under sys.prefix (ie users can do whatever they want to {distribution}-{version}.data/data in the wheel)? Given that we can’t find any real categories of files, like docs, man pages, etc, it’s what makes sense to me.
Does there need to be restrictions on where they can be placed though? Do we need to discourage users trying to use data files as a backdoor to install files that should be installed some other way (eg trying to put scripts in bin/)? Perhaps pip should raise a warning when there is any overlap at least any directories under the the data category and destinations for the other 7 categories of paths in sysconfig? Ideally this warning would get raised at packaging time, but we would have to assume the default versions of each of the other 7 path categories, or read them from sysconfig at packaging time but then whether the warning gets raised depends on the packager’s Python installation. Maybe that’s fine though?

What should pip do about file conflicts? I don’t think that we need to discuss this here because there’s already a pip issue about the problem: https://github.com/pypa/pip/issues/4625

What should happen to data_files with develop installs? Symlinks have been around on Windows since Vista I believe. Are symlinks well supported enough at this point in time that we could use them for data files in develop installs?

Do we think that a replacement to data files needs to support allowing different destinations for files based on platform or different files to be installed depending on the platform? The fact that data_files can’t do this currently is apparently one of the reasons we aren’t keen on the current implementation of them.

Thank you for everyone’s input into this discussion so far. With everyone’s feedback, and many of @pf_moore’s suggestions in particular, I’ve been able to create my first attempt at writing up a proposed solution to this problem that focuses on the main use case that we’ve identified, which is for packages to share data or configuration files in common directories with applications installed in the Python environment. The proposal also attempts to consider applications that may want different files in different places based on the platform that it is installed to.

We would add two new path names to sysconfig; the share path and the config path.
The share path would map to the following locations dependent on the scheme:

  • nt: f"{base}/Share"
  • nt_user: f"{userbase}/Share"
  • posix_posix and posix_home: f"{base}/share"
  • posix_user and osx_framework_user: f"{userbase}/share"

The config path would map to the following locations dependent on the scheme:

  • nt: f"{base}/Config"
  • nt_user: f"{userbase}/Config"
  • posix_posix and posix_home: f"{base}/etc"
  • posix_user and osx_framework_user: f"{userbase}/etc"

To utilise these locations through setuptools, a new keyword argument to setup() would be added called shared_files. This argument would take a dictionary as its value that would look like the following:

shared_files={
    "<platform_pattern>": {
        "<path_name>": {
            "<dest_path>": [
                "<src_path>",
            ],
        },
    },
},

For example:

shared_files={
    "any": {
        "config": {
            "jupyter/nbconfig/notebook.d": [
                "widgetsnbextension/widgetsnbextension.json",
            ],
        },
        "share": {
            "jupyter/kernels/python3": glob("kernelspec/*"),
        },
    },
},

or as a setup.cfg:

[options.shared_files.any.config]
jupyter/nbconfig/notebook.d = widgetsnbextension/widgetsnbextension.json

[options.shared_files.any.share]
jupyter/kernels/python3 = kernelspec/kernel.json, kernelspec/logo-32x32.png, kernelspec/logo-64x64.png

These examples would result in the following files in the wheel:

{distribution}-{version}.data/config/jupyter/nbconfig/notebook.d/widgetsnbextension.json
{distribution}-{version}.data/share/jupyter/kernels/python3/kernel.json
{distribution}-{version}.data/share/jupyter/kernels/python3/logo-32x32.png
{distribution}-{version}.data/share/jupyter/kernels/python3/logo-x64.png
  • <platform_pattern> is a glob style pattern that matches against the platform tag of the resulting wheel (defined in PEP-425 as "distutils.util.get_platform() with all hyphens - and periods . replaced with underscore _" and presumably soon to be "sysconfig.get_platform() with all hyphens - and periods . replaced with underscore _"). By making this a glob style pattern, users can do things like "win*" to match against all Windows platforms or "linux_*" to match all Linux platforms. The "any" pattern will match against all platforms. If multiple patterns match against a platform, the union of all <dest_path>-<src-path> pairs will be used. If this causes a destination file to have multiple matching source paths, an error will be raised during the packaging process (though raising this error during the packaging of an sdist would require knowledge of all possible platform tags or for setuptools to have a way of determining if two patterns overlap. Yikes!). Exposing platform tags in this way will require additional documentation to provide clarity around the possible values (and/or layouts) of platform tags.
  • <path_name> is a subset of the result of sysconfig.get_path_names(). To begin with, only “config” and “share” would be supported. The other path names that exist, excluding “data”, have support through other actively maintained keyword arguments to setuptools. The data_files keyword can install to “data”, but it will remain deprecated. The “data” path won’t be supported through shared_files to prevent installation to non-standard locations in sys.prefix and to places supported by other keywords.
  • <dest_path> is the location that the listed files will be installed to, relative to sysconfig.get_path(<path_name>, ...). Therefore the location of a file can be queried for at run time by calling sysconfig.get_path(<path_name>, ...), using whichever schemes the application decides to support. Absolute destination paths are not supported. Paths that include .. are not supported. In both these cases, setuptools will raise an exception during parsing of the configuration.
  • <src_path> is the location of a file to be installed into the relevant <dest_path>. Symlinks are not supported to prevent the inclusion of files that exist outside of sys.prefix. Each source file will be included in an sdist without needing to be specified in a MANIFEST.in file.

When a develop install is performed, each individual destination file will be symlinked to the relevant <src_file> on platforms that support it, or copied on platforms that do not.

This proposal does not cover the possibility of two packages installing conflicting files to the same destination, and defers to the pip issue #4625.

Variations considered:

  • Allow the copying of subtrees to a destination path: To make configuration less verbose, rather than specifying individual source files and their destination, users could specify source directories to allow the copying of the whole directory along with subdirectories. Using this type of configuration would require the internals of setuptools to translate the configuration to the “per file source” data structure proposed, to allow for individual files to be symlinked in a develop install and for individual files to be removed from directories that are shared across many packages during the uninstall of a package. Although this alternative method of configuration is less verbose, it may result in unintended files ending up in the destination directory (eg .DS_STORE files).
  • Specify source files with MANIFEST.in style syntax: Given that these source files are included in an sdist without further configuration, it might make sense for them to be configured in a similar way. MANIFEST.in files strike a good middle ground of being explicit without being too verbose. The difference in needs between shared_files and MANIFEST.in though is that the source and destination of things specified in a MANIFEST.in are both relative to the root directory. With shared_files, the destination of a file may not share the same directory hierarchy as the source file. For example, it is less clear whether the destination layout of the configuration "jupyter/kernels/python3": ["include kernels/*/kernel.json"], should be jupyter/kernels/python3/kernels/{x,y,z,...}/kernel.json or jupyter/kernels/python3/kernel.json.
  • Alternative names for shared_files: Given that the primary use case for this new functionality is to integrate with applications in the Python environment, the names application_files, and application_data_files make this use case clear. However, despite the discussions we’ve had on the topic, there may well be other uses for these additional files that we have not considered. Therefore including “application” in the name might limit how people think of these files unnecessarily. Another alternative name, shared_data_files, makes the shared nature of these additional files clear. However using “data files” in the name might create confusion with the “data” path name in sysconfig. A user may ask “why do my shared data files end up in the ‘config’ and ‘share’ paths rather than the ‘data’ path”.
  • Use exact platform tags instead of a <platform_pattern>: A downside of allowing patterns is that we can create conflicts where two different source files can map to a single destination. Furthermore, there isn’t a great mechanism for exclusion of files. If a user were to specify a file in "win*", the user cannot prevent that file from being included in a more specific "win-amd64" section without splitting up the "win*" section into individual platform tags. These downsides, plus a more complicated mapping scheme, were chosen to serve a more common use case of OS specific file schemes over platform specific ones. However we could choose to simplify the method of configuration and say that a single platform tag (or perhaps a tuple of platform tags) maps to a single dictionary of sources to destinations (if using tuples, a platform tag should exist as a key only once).
  • Policing the directories created immediately under the location of a path name: The idea here was that because these shared files are supposed to be consumed by an application, perhaps an application would have to advertise that it accepts these files for an installer to allow them to be installed. Such strictness might help prevent the abuse of shared_files to install things in unusual places by, for example, requiring the shared files for an application to be under a subdirectory in the destination named after the application. However this puts a lot of extra responsibility on installers. It can also be easily worked around by a package simply advertising that it is an application and definitely wants to install those files where it wants to.
1 Like

The config path would map to the following locations dependent on the scheme:

  • posix_posix and posix_home : f"{base}/etc"
  • posix_user and osx_framework_user : f"{userbase}/etc"

I don’t know about macOS but this is really not what one’d expect on a XDG platform. As @aragilar said earlier, using something like appdirs would allow us to avoid having to declare these platform-specific paths in the standard we’re discussing here. There’s a new fork called platformdirs where the specification regarding those paths can reside. What do you think, @Julian and @ofek?

I think the question comes down to how much we want these shared files to interact with the rest of the system. Keeping them out of XDG compliant locations means that only those applications that are Python aware (ie will be querying for these file locations through sysconfig.get_path()) can use them.
User installs of packages don’t really have an effect on a user’s global environment unless they’ve specifically configured that to happen by adding the relevant folders to *PATH environment variables (eg by adding ~/.local/bin to PATH). So I think it’s reasonable to require a user to opt into making these additional files global as well, by setting the relevant *PATH-like variable for whatever application they want to pick up those files.

To me, the issue is that Python packaging should only have control of things under a Python installation’s prefix, otherwise a package installed multiple times in separate virtual environments will fight against each other and produce terrible results, especially when different versions are installed. And allowing packages to install files outside of the prefix defeats arguably the reason virtual environments exist in the first place, so what you want (pip install anywhere and just install to some XDG location) will never be acceptable.

But. What if a Python installation does control ~/.local? This is actually already a thing: PEP 370. When doing a user-scheme package install on POSIX, the operating Python prefix is basically ~/.local, so it is not inconceivable to add additional locations to the scheme that comforms to XDG. Those directories should map to XDG locations only when a user-scheme installation is performed (e.g. pip install --user), so the environment separation promise is not broken (and we should have similar mechanisms for basic installation separation like how PEP 370 separates packages installed for different Python versions). And when another scheme is used, those locations should map to directories inside a Python environment instead, so the virtual environments promise is not broken.

2 Likes

The locations I’ve proposed already match the XDG spec for system installs, so I think we’re only talking about user scheme installs.
I don’t think we can add locations that both conform to the XDG spec and split by Python version. If we were following the XDG spec strictly, we can only put things under ~/.config/subdir/filename (where the subdir/filename bit would be what the package defines). So there’s no space to add a directory based on the Python version. We can follow the spec a bit more loosely and put things under ~/.config/pythonX.Y, but then what if Python itself ever wants to put config files somewhere? It also means that because we aren’t following the spec exactly, applications either need to know to look under the pythonX.Y subfolder (which is maybe fine because the applications we’re talking about are supposed to be “Python aware”) or the user needs to opt into these files being used by setting an appropriate environment variable (maybe some *_PATH like variable for the application or $XDG_CONFIG_HOME).
I realise that ~/.local/etc isn’t a great place for these files either, but I don’t know if there is a great place.

Python already allows packages to put files directly under ~/.local/include/pythonX.Y without the per-package qualifier, so I don’t think that’s an issue.

That’s up to how the spec can be interpreted IMO. We can define an application NAME installed under Python version X.Y to be named pythonX.Y/NAME for XDG. If that does not work for XDG interoperations, no amount of work can ever make Python packaging compatible with XDG, and this entire topic is a dead end, unfortunately.

Tzu-ping Chung wrote:

To me, […] Python packaging should only have control of things under
a Python installation’s prefix, […] virtual environments will fight
[…] And allowing packages to install files outside of the prefix
defeats arguably the reason virtual environments exist in the first place

On GNU/Linux I’ve been seeing many new applications written in Python
in the recent years. One responsibility of Python packaging
is to install the files to a fake root/base that can be packaged
by downstream distributions. IMHO virtual environments are sandboxing
hacks and they should adapt to the system, not the other way around,
for example on XDG it can set the XDG_* variables.

Ashley Whetter wrote:

The locations I’ve proposed already match the XDG spec for system installs,
so I think we’re only talking about user scheme installs.

Per the XDG specifications:

If $XDG_CONFIG_DIRS is either not set or empty,
a value equal to /etc/xdg should be used.

At present, everything in a --user install goes under the ‘user base’ directory - ~/.local by default on Linux, but controllable with the PYTHONUSERBASE environment variable. This is like the prefix for non-user installs.

The old distutils docs explicitly say that with --user, “Files will be installed into subdirectories of site.USER_BASE”. Pip’s docs don’t quite guarantee that nothing will be installed outside that, but it seems like a fairly obvious expectation, especially as it’s been that way for over a decade already.

So, while I’m generally in favour of the XDG base directories, I don’t think ~/.config is a viable option here. The XDG spec also says it’s configurable with XDG_CONFIG_HOME, which is even harder to reconcile with configuring PYTHONUSERBASE.

In a similar vein, prefix usually defaults to /usr or /usr/local for systemwide installation, so the typical /etc directory for systemwide config is outside the prefix.

Maybe if you need config files to be installable, they can just be considered data files. E.g. systemd unit files are found in both config and data directories: the former is intended for manually created units, the latter for units installed as part of packages.

So, while I’m generally in favour of the XDG base directories,
I don’t think ~/.config is a viable option here. The XDG spec also says
it’s configurable with XDG_CONFIG_HOME, which is even harder
to reconcile with configuring PYTHONUSERBASE.

I interpret the spec as the user config directory is at XDG_CONFIG_HOME
which falls back to ~/.config if undefined. Similar was said for
system-wide. If we are to support configuration files on XDG-compliant
systems, we would be implicitly installing them inside /etc/xdg
or ~/.config most of the times. There would be no point defining
a Python packaging specific location since the installed application
would not be expecting such non-standard location.

Maybe if you need config files to be installable,
they can just be considered data files.

From packaging perspective, probably. On the other hand, an user should
never need to edit data files, while the config is either written
by the program or directly by the user. It is common to see application
offering an option to write the default config or automatically do it,
and this is what we have been doing without the proposal. One other
concern is that if the config file is managed by the package manager
there should be an option to override or keep it upon update.

Personally I am fine with how configuration files are handled as-is,
and I’d love to see a standard making it more easy and portable,
as long as it is semantically correct and respectful to the surrounding
environment.

Personally, I think this is the best approach, even given the possibility of creating a config file at install time.

Users have been known to delete the config file and still expect the program to launch, so if you can handle that, you can handle a first-launch with no config being installed.

That approach is appropriate for configuration files that users can edit, but the types of configuration files that the use case talks about is those that a package can provide, unedited, to integrate itself with a Python aware application. The configuration is owned by a package to dictate either how it integrates with an application or how an application should behave when this package is installed. The configuration files get installed and uninstalled with a package. The files aren’t for a user to edit.
Not conforming to the XDG spec also makes sense for a similar reason. The XDG spec is a fairly “user facing” location and often contains files for editing by a user. However the the expectation outlined in the proposal would be for “Python aware” applications to get these files by querying sysconfig, rather than looking in a location determined by the XDG spec. Perhaps we shouldn’t call these files “config files” to prevent this misunderstanding.

Side questions: To the “decision makers” in this thread, although this thread is proving useful for collecting feedback, I’m feeling as though it’s not the best place to iterate on the proposal itself. Is there an existing proposal process that we can promote this discussion to? Is this discussion ready for promotion to that process? Is a PEP appropriate?