Should there be a new standard for installing arbitrary data files?

Hey all, I’m another Jupyter person who has been lately looking into the issues around data_files. The current thrust of the conversation is encouraging, especially since the ongoing work on using entry_points in place of data_files seems to have stalled a bit. Mostly over the fact that no matter what, using entry_points means having to scan through the metadata of potentially 1000s of packages on every startup looking for config and extensions, whereas using data_files means that currently we only have to check ~4 fixed paths. Thus, it seems like data_files has entry_points thoroughly beat in terms of performance and simplicity.

So the idea of using sysconfig.get_path("data") or similar to get a single fixed dir (per schema) to install files to sounds appealing, since it should be even simpler and faster than data_files. My question, though, is how would we actually implement installation to the "data" dir? I can think of a bunch of hacky ways to copy files during install. But in general those copies won’t automagically get included in the pkg metadata, right? Which means that pip uninstall won’t work.

So what’s the correct way?

Put them in {distribution}-{version}.data/data/ in the wheel. If you’re asking what build backend (e.g. setuptools/flit) arguments are needed to do that, I don’t know, but if the backend doesn’t let you do this right now, that’s just a backend feature request - in the context of this thread, the important point is that it doesn’t need a new standard to make this work.

Cool. I’ll go try that out. The only other wrinkle I can think of is that the files will need to end up in something like ${datarootdir}/share/jupyter, rather than the base ${datarootdir}. So in terms of the wheel schema you refer to, could you stash data at:

{distribution}-{version}.data/data/share/jupyter

Is that valid? I just read through the 2 wheel PEPs (PEP 427 and PEP 491) and I’m still unclear

The PEP says each directory contains a “subtree” so I believe this is completely valid.

1 Like

I’m still unclear on whether setuptools should allow users to place things wherever they like under sys.prefix (ie users can do whatever they want to {distribution}-{version}.data/data in the wheel)? Given that we can’t find any real categories of files, like docs, man pages, etc, it’s what makes sense to me.
Does there need to be restrictions on where they can be placed though? Do we need to discourage users trying to use data files as a backdoor to install files that should be installed some other way (eg trying to put scripts in bin/)? Perhaps pip should raise a warning when there is any overlap at least any directories under the the data category and destinations for the other 7 categories of paths in sysconfig? Ideally this warning would get raised at packaging time, but we would have to assume the default versions of each of the other 7 path categories, or read them from sysconfig at packaging time but then whether the warning gets raised depends on the packager’s Python installation. Maybe that’s fine though?

What should pip do about file conflicts? I don’t think that we need to discuss this here because there’s already a pip issue about the problem: https://github.com/pypa/pip/issues/4625

What should happen to data_files with develop installs? Symlinks have been around on Windows since Vista I believe. Are symlinks well supported enough at this point in time that we could use them for data files in develop installs?

Do we think that a replacement to data files needs to support allowing different destinations for files based on platform or different files to be installed depending on the platform? The fact that data_files can’t do this currently is apparently one of the reasons we aren’t keen on the current implementation of them.

Thank you for everyone’s input into this discussion so far. With everyone’s feedback, and many of @pf_moore’s suggestions in particular, I’ve been able to create my first attempt at writing up a proposed solution to this problem that focuses on the main use case that we’ve identified, which is for packages to share data or configuration files in common directories with applications installed in the Python environment. The proposal also attempts to consider applications that may want different files in different places based on the platform that it is installed to.

We would add two new path names to sysconfig; the share path and the config path.
The share path would map to the following locations dependent on the scheme:

  • nt: f"{base}/Share"
  • nt_user: f"{userbase}/Share"
  • posix_posix and posix_home: f"{base}/share"
  • posix_user and osx_framework_user: f"{userbase}/share"

The config path would map to the following locations dependent on the scheme:

  • nt: f"{base}/Config"
  • nt_user: f"{userbase}/Config"
  • posix_posix and posix_home: f"{base}/etc"
  • posix_user and osx_framework_user: f"{userbase}/etc"

To utilise these locations through setuptools, a new keyword argument to setup() would be added called shared_files. This argument would take a dictionary as its value that would look like the following:

shared_files={
    "<platform_pattern>": {
        "<path_name>": {
            "<dest_path>": [
                "<src_path>",
            ],
        },
    },
},

For example:

shared_files={
    "any": {
        "config": {
            "jupyter/nbconfig/notebook.d": [
                "widgetsnbextension/widgetsnbextension.json",
            ],
        },
        "share": {
            "jupyter/kernels/python3": glob("kernelspec/*"),
        },
    },
},

or as a setup.cfg:

[options.shared_files.any.config]
jupyter/nbconfig/notebook.d = widgetsnbextension/widgetsnbextension.json

[options.shared_files.any.share]
jupyter/kernels/python3 = kernelspec/kernel.json, kernelspec/logo-32x32.png, kernelspec/logo-64x64.png

These examples would result in the following files in the wheel:

{distribution}-{version}.data/config/jupyter/nbconfig/notebook.d/widgetsnbextension.json
{distribution}-{version}.data/share/jupyter/kernels/python3/kernel.json
{distribution}-{version}.data/share/jupyter/kernels/python3/logo-32x32.png
{distribution}-{version}.data/share/jupyter/kernels/python3/logo-x64.png
  • <platform_pattern> is a glob style pattern that matches against the platform tag of the resulting wheel (defined in PEP-425 as "distutils.util.get_platform() with all hyphens - and periods . replaced with underscore _" and presumably soon to be "sysconfig.get_platform() with all hyphens - and periods . replaced with underscore _"). By making this a glob style pattern, users can do things like "win*" to match against all Windows platforms or "linux_*" to match all Linux platforms. The "any" pattern will match against all platforms. If multiple patterns match against a platform, the union of all <dest_path>-<src-path> pairs will be used. If this causes a destination file to have multiple matching source paths, an error will be raised during the packaging process (though raising this error during the packaging of an sdist would require knowledge of all possible platform tags or for setuptools to have a way of determining if two patterns overlap. Yikes!). Exposing platform tags in this way will require additional documentation to provide clarity around the possible values (and/or layouts) of platform tags.
  • <path_name> is a subset of the result of sysconfig.get_path_names(). To begin with, only “config” and “share” would be supported. The other path names that exist, excluding “data”, have support through other actively maintained keyword arguments to setuptools. The data_files keyword can install to “data”, but it will remain deprecated. The “data” path won’t be supported through shared_files to prevent installation to non-standard locations in sys.prefix and to places supported by other keywords.
  • <dest_path> is the location that the listed files will be installed to, relative to sysconfig.get_path(<path_name>, ...). Therefore the location of a file can be queried for at run time by calling sysconfig.get_path(<path_name>, ...), using whichever schemes the application decides to support. Absolute destination paths are not supported. Paths that include .. are not supported. In both these cases, setuptools will raise an exception during parsing of the configuration.
  • <src_path> is the location of a file to be installed into the relevant <dest_path>. Symlinks are not supported to prevent the inclusion of files that exist outside of sys.prefix. Each source file will be included in an sdist without needing to be specified in a MANIFEST.in file.

When a develop install is performed, each individual destination file will be symlinked to the relevant <src_file> on platforms that support it, or copied on platforms that do not.

This proposal does not cover the possibility of two packages installing conflicting files to the same destination, and defers to the pip issue #4625.

Variations considered:

  • Allow the copying of subtrees to a destination path: To make configuration less verbose, rather than specifying individual source files and their destination, users could specify source directories to allow the copying of the whole directory along with subdirectories. Using this type of configuration would require the internals of setuptools to translate the configuration to the “per file source” data structure proposed, to allow for individual files to be symlinked in a develop install and for individual files to be removed from directories that are shared across many packages during the uninstall of a package. Although this alternative method of configuration is less verbose, it may result in unintended files ending up in the destination directory (eg .DS_STORE files).
  • Specify source files with MANIFEST.in style syntax: Given that these source files are included in an sdist without further configuration, it might make sense for them to be configured in a similar way. MANIFEST.in files strike a good middle ground of being explicit without being too verbose. The difference in needs between shared_files and MANIFEST.in though is that the source and destination of things specified in a MANIFEST.in are both relative to the root directory. With shared_files, the destination of a file may not share the same directory hierarchy as the source file. For example, it is less clear whether the destination layout of the configuration "jupyter/kernels/python3": ["include kernels/*/kernel.json"], should be jupyter/kernels/python3/kernels/{x,y,z,...}/kernel.json or jupyter/kernels/python3/kernel.json.
  • Alternative names for shared_files: Given that the primary use case for this new functionality is to integrate with applications in the Python environment, the names application_files, and application_data_files make this use case clear. However, despite the discussions we’ve had on the topic, there may well be other uses for these additional files that we have not considered. Therefore including “application” in the name might limit how people think of these files unnecessarily. Another alternative name, shared_data_files, makes the shared nature of these additional files clear. However using “data files” in the name might create confusion with the “data” path name in sysconfig. A user may ask “why do my shared data files end up in the ‘config’ and ‘share’ paths rather than the ‘data’ path”.
  • Use exact platform tags instead of a <platform_pattern>: A downside of allowing patterns is that we can create conflicts where two different source files can map to a single destination. Furthermore, there isn’t a great mechanism for exclusion of files. If a user were to specify a file in "win*", the user cannot prevent that file from being included in a more specific "win-amd64" section without splitting up the "win*" section into individual platform tags. These downsides, plus a more complicated mapping scheme, were chosen to serve a more common use case of OS specific file schemes over platform specific ones. However we could choose to simplify the method of configuration and say that a single platform tag (or perhaps a tuple of platform tags) maps to a single dictionary of sources to destinations (if using tuples, a platform tag should exist as a key only once).
  • Policing the directories created immediately under the location of a path name: The idea here was that because these shared files are supposed to be consumed by an application, perhaps an application would have to advertise that it accepts these files for an installer to allow them to be installed. Such strictness might help prevent the abuse of shared_files to install things in unusual places by, for example, requiring the shared files for an application to be under a subdirectory in the destination named after the application. However this puts a lot of extra responsibility on installers. It can also be easily worked around by a package simply advertising that it is an application and definitely wants to install those files where it wants to.
1 Like

The config path would map to the following locations dependent on the scheme:

  • posix_posix and posix_home : f"{base}/etc"
  • posix_user and osx_framework_user : f"{userbase}/etc"

I don’t know about macOS but this is really not what one’d expect on a XDG platform. As @aragilar said earlier, using something like appdirs would allow us to avoid having to declare these platform-specific paths in the standard we’re discussing here. There’s a new fork called platformdirs where the specification regarding those paths can reside. What do you think, @Julian and @ofek?

I think the question comes down to how much we want these shared files to interact with the rest of the system. Keeping them out of XDG compliant locations means that only those applications that are Python aware (ie will be querying for these file locations through sysconfig.get_path()) can use them.
User installs of packages don’t really have an effect on a user’s global environment unless they’ve specifically configured that to happen by adding the relevant folders to *PATH environment variables (eg by adding ~/.local/bin to PATH). So I think it’s reasonable to require a user to opt into making these additional files global as well, by setting the relevant *PATH-like variable for whatever application they want to pick up those files.

To me, the issue is that Python packaging should only have control of things under a Python installation’s prefix, otherwise a package installed multiple times in separate virtual environments will fight against each other and produce terrible results, especially when different versions are installed. And allowing packages to install files outside of the prefix defeats arguably the reason virtual environments exist in the first place, so what you want (pip install anywhere and just install to some XDG location) will never be acceptable.

But. What if a Python installation does control ~/.local? This is actually already a thing: PEP 370. When doing a user-scheme package install on POSIX, the operating Python prefix is basically ~/.local, so it is not inconceivable to add additional locations to the scheme that comforms to XDG. Those directories should map to XDG locations only when a user-scheme installation is performed (e.g. pip install --user), so the environment separation promise is not broken (and we should have similar mechanisms for basic installation separation like how PEP 370 separates packages installed for different Python versions). And when another scheme is used, those locations should map to directories inside a Python environment instead, so the virtual environments promise is not broken.

2 Likes

The locations I’ve proposed already match the XDG spec for system installs, so I think we’re only talking about user scheme installs.
I don’t think we can add locations that both conform to the XDG spec and split by Python version. If we were following the XDG spec strictly, we can only put things under ~/.config/subdir/filename (where the subdir/filename bit would be what the package defines). So there’s no space to add a directory based on the Python version. We can follow the spec a bit more loosely and put things under ~/.config/pythonX.Y, but then what if Python itself ever wants to put config files somewhere? It also means that because we aren’t following the spec exactly, applications either need to know to look under the pythonX.Y subfolder (which is maybe fine because the applications we’re talking about are supposed to be “Python aware”) or the user needs to opt into these files being used by setting an appropriate environment variable (maybe some *_PATH like variable for the application or $XDG_CONFIG_HOME).
I realise that ~/.local/etc isn’t a great place for these files either, but I don’t know if there is a great place.

Python already allows packages to put files directly under ~/.local/include/pythonX.Y without the per-package qualifier, so I don’t think that’s an issue.

That’s up to how the spec can be interpreted IMO. We can define an application NAME installed under Python version X.Y to be named pythonX.Y/NAME for XDG. If that does not work for XDG interoperations, no amount of work can ever make Python packaging compatible with XDG, and this entire topic is a dead end, unfortunately.

Tzu-ping Chung wrote:

To me, […] Python packaging should only have control of things under
a Python installation’s prefix, […] virtual environments will fight
[…] And allowing packages to install files outside of the prefix
defeats arguably the reason virtual environments exist in the first place

On GNU/Linux I’ve been seeing many new applications written in Python
in the recent years. One responsibility of Python packaging
is to install the files to a fake root/base that can be packaged
by downstream distributions. IMHO virtual environments are sandboxing
hacks and they should adapt to the system, not the other way around,
for example on XDG it can set the XDG_* variables.

Ashley Whetter wrote:

The locations I’ve proposed already match the XDG spec for system installs,
so I think we’re only talking about user scheme installs.

Per the XDG specifications:

If $XDG_CONFIG_DIRS is either not set or empty,
a value equal to /etc/xdg should be used.

At present, everything in a --user install goes under the ‘user base’ directory - ~/.local by default on Linux, but controllable with the PYTHONUSERBASE environment variable. This is like the prefix for non-user installs.

The old distutils docs explicitly say that with --user, “Files will be installed into subdirectories of site.USER_BASE”. Pip’s docs don’t quite guarantee that nothing will be installed outside that, but it seems like a fairly obvious expectation, especially as it’s been that way for over a decade already.

So, while I’m generally in favour of the XDG base directories, I don’t think ~/.config is a viable option here. The XDG spec also says it’s configurable with XDG_CONFIG_HOME, which is even harder to reconcile with configuring PYTHONUSERBASE.

In a similar vein, prefix usually defaults to /usr or /usr/local for systemwide installation, so the typical /etc directory for systemwide config is outside the prefix.

Maybe if you need config files to be installable, they can just be considered data files. E.g. systemd unit files are found in both config and data directories: the former is intended for manually created units, the latter for units installed as part of packages.

So, while I’m generally in favour of the XDG base directories,
I don’t think ~/.config is a viable option here. The XDG spec also says
it’s configurable with XDG_CONFIG_HOME, which is even harder
to reconcile with configuring PYTHONUSERBASE.

I interpret the spec as the user config directory is at XDG_CONFIG_HOME
which falls back to ~/.config if undefined. Similar was said for
system-wide. If we are to support configuration files on XDG-compliant
systems, we would be implicitly installing them inside /etc/xdg
or ~/.config most of the times. There would be no point defining
a Python packaging specific location since the installed application
would not be expecting such non-standard location.

Maybe if you need config files to be installable,
they can just be considered data files.

From packaging perspective, probably. On the other hand, an user should
never need to edit data files, while the config is either written
by the program or directly by the user. It is common to see application
offering an option to write the default config or automatically do it,
and this is what we have been doing without the proposal. One other
concern is that if the config file is managed by the package manager
there should be an option to override or keep it upon update.

Personally I am fine with how configuration files are handled as-is,
and I’d love to see a standard making it more easy and portable,
as long as it is semantically correct and respectful to the surrounding
environment.

Personally, I think this is the best approach, even given the possibility of creating a config file at install time.

Users have been known to delete the config file and still expect the program to launch, so if you can handle that, you can handle a first-launch with no config being installed.

That approach is appropriate for configuration files that users can edit, but the types of configuration files that the use case talks about is those that a package can provide, unedited, to integrate itself with a Python aware application. The configuration is owned by a package to dictate either how it integrates with an application or how an application should behave when this package is installed. The configuration files get installed and uninstalled with a package. The files aren’t for a user to edit.
Not conforming to the XDG spec also makes sense for a similar reason. The XDG spec is a fairly “user facing” location and often contains files for editing by a user. However the the expectation outlined in the proposal would be for “Python aware” applications to get these files by querying sysconfig, rather than looking in a location determined by the XDG spec. Perhaps we shouldn’t call these files “config files” to prevent this misunderstanding.

Side questions: To the “decision makers” in this thread, although this thread is proving useful for collecting feedback, I’m feeling as though it’s not the best place to iterate on the proposal itself. Is there an existing proposal process that we can promote this discussion to? Is this discussion ready for promotion to that process? Is a PEP appropriate?

There is already a standardized data directory for packages, I really really do not understand why you need to be able to install to arbitrary locations for your use-cases.

For package configuration, like the use-case you describe, there is also a more pythonic way, having some API to query things.
If you need something more dynamic, you also have entrypoints. Packages can register entrypoints with an arbitrary key, which you could then easily discover.

If there isn’t any good enough API for you to interact with these mechanisms, importlib.metadata could be extended, but I wholeheartedly think that allowing packages to install to arbitrary locations is a big mistake.

4 Likes

This discussion has included the fact that the data directory in wheel’s package-1.0.data/data/ is only there because there was a same-named category in distutils. The category named data is not “a useful place to put your data” because, by mistake, virtualenv and non-virtualenv installations place those files in different locations.

Hence my suggestion for importlib.metadata to be extended. But even ignoring its existence, you can just install the data as a Python package, and you have a pretty good API to access it (importlib.resources), I really do not see the issue. The only new use-case covered by the proposals on this thread is installing to arbitrary locations.

1 Like

I just wanted to note that this is a problem for the Roundup Issue Tracker as well.
I just spent a couple of hours changing code to try to make it work with a wheel
format install and I am still not close to done.

We have 3 types of files that should not be installed under site_packages:

  • man pages
  • locale translations
  • template files

Setuptools supports entry_points which puts command line programs
on a normal user path. These other items need to be handled similarly IMO.

Thanks to entry_points the commands are available to the users, but the man pages for the commands are buried under site-packages/usr/share/man/… Where no man (command) has ever gone before. (Yes, the user can change MANPATH but there is no way for me (AFAIK) to advise the user that they need to do this when they run pip install roundup.)

Also gettext._default_locale points to a useless directory as the locale files aren’t installed there. The application has to search to find the locale files. Distutils installed them (as data files) in the _default_locale (sys.prefix/share/locale…) dir
so it “just worked”.

Also roundup has template files for the different tracker use cases. These are
useful for the user to view. Having to grovel into .../python3.10/site-packages/.../share/roundup/templates rather than being available under /usr/share/roundup/... where other packages install similar materials fails the principle of least surprise to say the least.

I am trying to figure out if I can detect when it’s being installed in a container or virtualenv so I can allow a wheel install. This will put the three types of files above in an isolated environment. Outside of those environment, use a standard install to make template, locale and man pages available in expected locations.

1 Like