Should there be a new standard for installing arbitrary data files?

In setuptools, the data_files keyword to setup() can be used to install files outside of the Python package. These files get installed to a user specified location relative to sys.prefix (or relative to site.USER_BASE for a user install). This keyword is considered deprecated.
I’m hoping to discuss whether there are use cases that justify a replacement of this functionality, and if so what that replacement might look like. If not then I hope that this discussion will at least provide clarification for users (and maybe maintainers!) about what the future of this functionality is, if there is indeed a future at all!

Note that as a new user I can only post two links so I have had to split links into follow up comments.

Why data_files is deprecated

I am no authority on why the data_files keyword is deprecated, I can only interpret what I’ve seen posted about them in the past. So that’s what I’ll attempt to do here, but anyone who expressed opinions in the sources that I use may have since changed their opinions.

  • It exposes platform specific behaviour ([#1]): data_files are installed unconditionally when the package as a whole is installed. But the locations of these sorts of files, as well as the files themselves, are usually platform specific.
  • It goes outside the bounds of what a Python package is supposed to be ([#2]): Nothing in the Python ecosystem should make use of arbitrarily installed files because anything interacting with packages this way can be using some other mechanism within the bounds of the ecosystem. For example a file usually read from /etc/foo.conf.d could also be installed to and read from $VIRTUAL_ENV/foo.conf.d so that it can be overridden in a virtualenv.
  • It encourages usage of sudo pip ([#3]): A user seeing an absolute Unix-filesystem-like path might therefore conclude that a package can be installable directly onto the system.

Use cases

  • I’ll describe my use case first but it’s arguably a weak one. I work in visual effects (VFX). We use third party programs like Maya [#4] and Houdini [#5]. These programs have their own scripting languages that we make use of. However, given that the vast majority of code that we produce is Python, we distribute our code via pip and develop code with virtualenvs. The way that the third party programs we use find the source files written in their proprietary languages is via *_PATH variables. For example Maya finds *.mel files on MAYA_SCRIPT_PATH ([#6]) and Houdini finds *.vex files on HOUDINI_PATH ([#7]).

    • Should we be distributing these files with setuptools? Possibly not. We do have a build system for non-Python packages so we could be installing these other files with, alongside the Python package.
    • But how do these extra variables work with virtualenv? We have to extend what a virtualenv is to make it work. We have a virtualenv plugin that makes sure that these locations are on the relevant *_PATH when Maya/Houdini/etc is launched. So from a technical standpoint we’re already going off piste. From a user standpoint, it looks like a virtualenv and it quacks like a virtualenv so it’s a virtualenv.
    • We use virtualenvs because they’re familiar to people, especially for newcomers and juniors. If data_files was to go away today, we would need to rethink things. These aren’t really reasons enough to make a specific case for in setuptools though.
  • .desktop files (Mentioned in the comments of [#8]): These are platform specific. They only make sense when installed onto the system (or the user). They cannot be used from a virtualenv because they must exist in /usr/share/applications or ~/.local/share/applications ([#9]).

  • Man pages ([#10]): Similarly to .desktop file, these are platform specific. They only make sense when either installed to the system, or when installed into a virtualenv and the location included in $MANPATH though this is an usual case.

  • Files in /etc such as configuration files and systemd units ([#11] and [#12]): These are platform specific. They only make sense when installed to the system. This would be installing both to an absolute location and one outside of sys.prefix. This shouldn’t be getting installed on the system via pip so I don’t know how to justify that these files could be the responsibility of setuptools and not a distro’s package manager, other than it makes the distro packager’s job easier because they can pip install into the rpm/pacman/etc package without any extra steps or digging for supplementary files.

  • Binaries ([#13]): These only make sense when installed to the system or the user. They do not work in a virtualenv because they must exist on PATH to work, but this is only the case when a virtualenv is activated and a virtualenv does not have to be activated to be valid. Console scripts are valid in a package because they expose a function that is importable as a script. So other packages should be importing that function and using it directly. The equivalent for a binary is to do the same and expose that functionality through a library, even if it is a compiled one, even if there needs to be a layer on top of the compiled library to allow for the same import location if the name/location of the library itself is different on different platforms.

Other notes

  • setuptools documents that it does not pass data_files values to wheel, but from testing, it does for relative values. Both through pip and through python setup.py bdist_wheel. Absolute locations are not included however.
    • Therefore the only way that these can be getting installed at the moment is if someone is doing python setup.py install.
    • As can be deduced from the above, the wheel format does already support data files installed to a location relative to sys.prefix.
  • I wonder if the desire to package .desktop files, man pages, and similar files is a distribution problem. If a package provides extra functionality through these sorts of files, how should a package maintainer advertise that to the packagers of the package on a Linux distro for example? It’s up to the distro packagers to read documentation and make sure that they’ve packaged the complete package properly. It’s not a problem for setuptools to solve. But I think that there’s the expectation from a packager that if it’s a Python package then there will be nothing else to install outside of doing a pip install and doing anything specific to the distro like moving license files to the correct place.
  • I’ll admit that as I started writing this I was thinking “how are these files not supported properly? It would be so simple!”. After compiling this post, I see why. But as I said in the beginning, even if I can provide clarity for future direction then this long post has been worthwhile! I now think of data_files more of a left over feature from a time when pip was supposed to be a full blown package manager. They are, however, a useful backdoor for private use.
    • The fact that the wheel format supports already data files places anywhere relative to sys.prefix is, I think, what made me and others wonder why these files are not properly supported. setuptools/pip would basically be doing a (possibly platform-specific conditional) pass-through of the file. Everything else “just works”. But just because it’s easy, doesn’t mean we should do it.
  • setuptools still has a reference to usage of an absolute path in data_files [#14] so that’s something that possibly needs removing.
4 Likes

Apparently I’ve run out of links that I’m allowed to post so I’ll have to describe where to find the links :sweat_smile:
[#1] setuptools issue #2341
[#2] wheel issue #92
[#3] setuptools issue #1387
[#4] search for “autodesk Maya”
[#5] Search for “sidefx houdini”
[#6] Search for “Maya 2020 environment variables”
[#7] Search for “list of environment variables in Houdini”.
[#8] Poetry issue #890
[#9] Search for “gnome desktop files putting your application in the desktop menus”
[#10] Poetry issue #890
[#11] Poetry issue #890
[#12] Search for “Is there an idiomatic way to install systemd units with setuptools” on stackoverflow
[#13] setuptools issue #1728
[#14] In docs/userguide/declarative_config.rst of the setuptools repo

1 Like

I’ve been asked similar questions by people develop system (or internal) libraries for Fedora/Red Hat, and my answer always was to use RPM rather than pip for things that aren’t vierualenv-friendly.

With that, I guess the relevant question becomes:

I’ve used a Makefile with a note in the docs. I got a few pull requests (and learned stuff about writing cross-*nix Makefiles), so I guess it worked for communication at least partially.

2 Likes

To get around the anti-spam limitation, Ashley sent me the links by mail. Here they are:
[#1]: Can I choose where data files are installed? · Issue #2341 · pypa/setuptools · GitHub
[#2]: https://github.com/pypa/wheel/issues/92#issuecomment-421078938
[#3]: Docs for data_files differ from the general practice · Issue #1387 · pypa/setuptools · GitHub
[#4]: Maya Software | Computer Animation & Modeling Software | Autodesk
[#5]: https://www.sidefx.com/products/houdini/
[#6]: File path variables | Maya 2020 | Autodesk Knowledge Network
[#7]: Environment variables
[#8]: Support for data_files · Issue #890 · python-poetry/poetry · GitHub
[#9]: Desktop files: putting your application in the desktop menus
[#10]: Support for data_files · Issue #890 · python-poetry/poetry · GitHub
[#11]: Support for data_files · Issue #890 · python-poetry/poetry · GitHub
[#12]: https://stackoverflow.com/q/61865481
[#13]: Support for environment markers for data_files · Issue #1728 · pypa/setuptools · GitHub
[#14]: setuptools/declarative_config.rst at a4dbe3457d89cf67ee3aa571fdb149e6eb544e88 · pypa/setuptools · GitHub

2 Likes

This is a feature that has been requested periodically. I think it would be a great idea to replace the feature.

Long ago virtualenv and/or setuptools/pkg_resources broke data_files more or less by accident. Maybe it wasn’t supported very well in the .egg format. And with virtualenv putting your files in different environments there’s no longer a consistent root directory for your data_files, so your program can’t find them at runtime. Eventually you can be tempted to make the argument that this thing we broke on accident was a bad idea.

One version of packaging proposed that every file in a distribution could be relocatable. A manifest would map each individual file to its install location. In wheel we compromise by relocating categories of files but we don’t record where each category was installed.

Suppose we allow the packager to define their own wheel categories with variable substitution, we made something like the automake standard directory variables available, then

"a" : "${docdir}/x" in a categories definition
+
package-1.0.data/a/README.txt in the wheel
could be installed to
<root of environment>/share/doc/package-1.0/x/README.txt

The packager has a way to express intent, and the person doing the installing still controls exactly where the files are installed.

The usual objection is that the limited feature set of post-virtualenv Python packaging makes it easier to use free libraries. I think you’d get new kinds of software that would make up for it.

2 Likes

Allowing installing things relative to sys.prefix and sys.exec_prefix (and maybe even allow some variable replacement in some limited ways) sounds fine to me, but the top post is talking about arbitrary locations, as in anywhere on the user’s machine. That should definitely not be possible IMO.

Python packaging is expected (at least these days, and increasingly so) to only modify the current Python installation (e.g. for reproducibility), so anything that has side effect outside of the Python prefix should not be supported.

7 Likes

Agreed. If you want to install files in machine-wide locations, you should be using an OS-level installer (which is designed to manage for this use case) not a Python-specific one.

There’s a need for better ways to build OS-level packages from Python code, but that’s a separate question, and not what wheel and existing packaging tools are focused on.

3 Likes

We’re not talking about having the installer automatically modify paths outside the virtual environment. The person doing the installation would retain complete control.

Instead, give the package a better way to install, and later find, packages installed outside of the site-packages location, where all *.py is found - but still inside the virtualenv. Give the package a way to say “by the way, here is a configuration file”.

Then you can offer a gentler transition for something like a lightweight utility, that would benefit from being distributed on pypi but is also useful standalone, compared to asking them to build a separate Python-focused application packaging system. You could support a more automatic translation between the Python “relocatable categories of files” model into RPM or another system level installer.

2 Likes

I think there’s two set of “arbitrary locations”. One is those files under sys.prefix, and the other is absolutely anywhere on the machine. Despite the fact that I don’t think that it makes sense to support installation outside of sys.prefix either, I was posing the question about those two sets of files separately because I wanted to discuss all potential use cases.

1 Like

I don’t understand what the categories are used for. Are you saying that setuptools (or some standard) would support a set of valid categories and anything outside of these categories would not be allowed? What would that give us or users?

This.

We already have a bunch of issues with file conflicts as is, introducing a way to install files outside the Python installation will result in even nastier ones. Not only that but different systems work in different ways and have different behaviors in subtle, and not so subtle, ways. Coming up with a good standard would be extremely hard. As a Linux distro packager I would highly recommend against this.

For specific things like .desktop files and systemd services, we could add an option for the Python interpreter to explicitly opt-in into supporting installing these. The Python distributors would opt-in and correctly configure the paths, or installation mechanism, for the system and then installers like pip could use this if supported. Though even this is tricky.

2 Likes

Here a category is nothing more than a directory of files that can be moved together, in the same way that the scripts currently go into the bin or the Scripts directory depending on the platform, and the Python files go into a different directory depending on the interpreter version.

e.g. the source:

src/myprogram.py
assets/photo.png

setup.py replacement would say: ./assets is in the assets category. install assets into ${datadir}/myspecialsubdir

The variables would have platform-appropriate values. UNIX platforms probably have stronger opinions on where these should go, while sane platforms install everything into per-application folders.

One of the variables e.g. ${root} would be sys.prefix or the root of the virtualenv, so you could install files relative to that directory. But if you start from e.g. ${docsdir} your files are more likely to go into a platform appropriate destination like $docsdir = ${root}/share/docs

The benefit is only to have a way to install, and then locate, files under sys.prefix but outside of site-packages. If the categories are done nicely then it is convenient for a repackager to say "the assets should go here".

From the discussion so far it seems like there’s some interest in trying to replace the functionality that data_files provides. What’s the best way of moving this forwards? I feel like there’s a lot to figure out still before jumping to implementation.

I see. That makes sense. Presumably a repackager would need a way of overriding these variables? If a the default values provided by setuptools don’t match where a repackager wants them to go then do we need to provide a way to override them?

In terms of how this all gets output to wheels, if we were representing it with the current wheel specification then I think that packages that could otherwise use universal wheels would need to produce platform specific wheels because you would be producing a wheel for each possible platform-dependent location of ${root}. Repackagers would be using sdists so there’s no need to override the destinations of category variables at this point and setuptools will have backed the final destination of files into the wheel already.

With the above, I think that also means that a user’s selection of which files are installed and where ultimately hinge on the list of available platform tags for wheels. Environment markers seem like the natural choice for platform selection because users already know them. I think it would be a case of evaluating the markers at the time of building the wheel but I don’t know if that gets messy when cross compiling (if that’s even possible)?

No, the root value is provided by Python stdlib. Paths in wheel metadata are relative to it, and the wheel installer performs substitution to get the actual target location. All these are already in the wheel specification.

You’re right. I think I was thinking more about the contents of ${root}. For example if a user wants something installed to ${root} but only on Linux, I don’t think that the wheel specification can represent that in a single wheel file.
Similarly for any other category variables we decide to add. If ${docsdir} is a category variable with platform dependent install locations, currently that would have to be represented by a wheel file per platform.

The same restriction applies to source code in a wheel as well, there’s no way to “partially install” a wheel, which (from my understanding) is a feature.

I don’t understand. Why are docs not supposed to be installed on all platforms?

It’s not that docs (in the abstract) should not be installed on all platforms, but that each platform might prefer docs in different places. For instance, the Ubuntu-provided package for dateutil installs docs in a directory /usr/share/doc/python3-dateutil/, which might not be appropriate on RHEL / CentOS, and isn’t going to make sense on Windows.

Another interesting example would be Unix “man pages”, which are still useful. Installation may require compression on some platforms, and there’s often a utility that needs to be run to update a database. (Though that’s usually an application concern, so less relevant to wheel installation per se.)

Doc formats might also vary a lot; a package might want to provide a .chm file (or whatever modern equivalent there is) on Windows, and something else on Unix.

Lately, I’ve wanted to include three different groups of documentation, where a process that assembles an application package knows how to deal with them. It’s not unusual to have more than one of Python API docs, web API docs, and user-facing docs, all from one package. I don’t think there’s currently a “right way” to get them into a wheel, and it’s even less clear how to get them installed from a wheel in a way that can be driven by a tool generating a system package (such as a .deb or .rpm).

-Fred

Looking at this from a different perspective, why would you want the docs installed in every virtualenv you use the package in? I’d have expected documentation to be installed globally (if at all - most packages seem to have documentation on the web, these days) so it’s not suitable for inclusion in a wheel.

Precisely. Under sys.prefix is certainly possible (and support for more flexibility here is something that could be added). Outside of the environment is not a use case that wheels are intended to support.

My expectation is that I’d like whatever is using the wheel to decide whether & where to install any particular set of additional files. Docs are just one example.

A package that provides a web API might provide resources that represent the API in various ways (OpenAPI definitions, JSON Schemas, documentation for client developers, etc.). Some application that incorporates the package will determine (probably via a packaging tool of some sort), which sets of resources are needed and where they should land in an installation.

I’m running into this issue for an application that supports extensions which can register additional routes for a web framework, but I don’t think that’s particularly unusual way to structure things. And yes, this is really about allowing packages to provide additional resources for something that builds “an application”; it’s not really about typical wheel installs.

-Fred

The emphasis on correctness here is for similar reasons to why using a screwdriver on a nail is not going to work well. You could make it work in some weird ways, but that’s far from optimal. It might even introduce faults that’ll cause a failure later. :slight_smile:


@nhatkhai Welcome to the this forum. I’d encourage you to not be dismissive of the efforts of volunteers who maintain Python packaging tooling. By being dismissive of their approach, you’re very directly contributing to more frustration that’ll lead to fewer maintainers of these tools. :slight_smile:

Take into account that while most users are just looking to solve their problems with the tooling, the maintainers of these tools are outnumbered by a ratio of 1:millions, and have to think about usecases beyond the ones of those specific users.

3 Likes