Should there be a new standard for installing arbitrary data files?

I’m certainly that I do not known what you looking for, and how many example would count enough of a full people. I had tried, but I’m failing you I guess.

My initial post was the collection of all related use cases requested by users over the years in various discussions about data_files that I could find. The only one we haven’t discussed so far is my own, very specific, use case. You could generalise that use case to mean “any integrations with any applications that have some sort of search path mechanism and understand virtualenvs”. But that’s not particularly useful as something to drive a new design of data_files.

The only other use case that I can find is Jupyter’s extension mechanism (Distributing Jupyter Extensions as Python Packages — Jupyter Notebook 5.7.6 documentation), and coincidentally it seems to fit the above description. Jupyter looks for both configuration files and data files in directories created using data_files (Jupyter Paths priority order - General - Jupyter Community Forum). Data files are usually javascript files that form the main content of an extension. Both configuration files and data files go into specific directories that Jupyter searches to find extensions.
However the shortcoming of data files means that packages also need to provide an additional installation step that is used when the data files don’t end up in the expected place. This additional installation step requires the data files to be duplicated as package data so that the package can locate them to report to jupyter. This also extends jupyter’s search paths with an additional directory per extension.
For the above reason, plus the fact that data_files are considered deprecated in setuptools and not supported by any other packaging backends yet, there has been a proposal to replace the current extension mechanism to not rely on data_files (RFP for successor to data_files-based extension discovery? · Issue #351 · jupyter-server/jupyter_server · GitHub). However there doesn’t seem to have been any ideas that address the same problems that the additional install command has, and there’s still a preference for data_files to stick around. Perhaps @minrk or someone else from the Jupyter team can speak more on this and possibly even on what they would want to see in a redesign of data_files.

Sorry. I glossed over those because I thought the discussion had moved on, but they do deserve comment. See below.

I think the consensus is pretty clear at this point that “absolutely anywhere on the machine” isn’t acceptable. In particular, I agree with @uranusjr’s comment:

I’m therefore going to limit myself to proposals that only allow “arbitrary data files” to be installed under sys.prefix.

Looking at your use cases (I’ll defer the first one for now, as it’s your specific case and you noted that hasn’t been discussed yet):

  1. .desktop files. You said yourself they are platform specific and only useful if installed into system locations, i.e. not under sys.prefix. So the solutions we’re looking at won’t work for these.
  2. Manual pages. Same as .desktop files, platform specific and need to be in a system location. I’ve already called these out specifically as a case where my experience is that placing them under sys.prefix where OS utilities can’t see them is pointless.
  3. Files in /etc - again system locations, and not supported by solutions that install under sys.prefix.
  4. Binaries - these work right now (there’s a “scripts” location that things get installed to, and console scripts go there so we know this works). I don’t understand your comment about not working in a virtualenv. Unless you mean “install to /usr/local/bin” or something like that, but we’re back to locations outside sys.prefix again.

So all of your use cases basically require “arbitrary locations” and I believe that we already have a general feeling that this isn’t acceptable. If you want to still argue for that behaviour, I suggest that you make a specific proposal that describes what you want. But be prepared for it to be rejected - I will definitely vote -1 on it as I’ve already said.

I’m not entirely sure about your VFX use case, but it seems like your solution using virtualenv with an application-specific plugin is a reasonably good approach, so I’m not sure anything more is needed. And if you want to install the files to a location that’s already on MAYA_SCRIPT_PATH, you’re back to installing outside of sys.prefix

So to be more specific, I’m looking for examples of use cases where it would be necessary to be able to install “arbitrary data files” into particular locations under sys.prefix other than the ones that are already covered by the existing sysconfig locations.

3 Likes

Perhaps @minrk or someone else from the Jupyter team can speak more on this and possibly even on what they would want to see in a redesign of data_files.

Jupyter extension packages typically include files that should go in one or both of these:

  • configuration files to enable extensions /specify config (in $prefix/etc)
  • static resources in $prefix/share (javascript extension sources, html templates, kernel specification files for discovery, etc.)

We never expect to write outside sys.prefix as part of package installation. Ideally, these are staged into the right place using data_files (or whatever its replacement should be) at install time, but there are some exceptions where that can’t/won’t work, so we provide our own jupyter ... install commands which stage files into $prefix/etc or $prefix/share after installation time. This two-step install has led to lots of mixed-up installations, as removing or upgrading a package is no longer associated with removing or changing other files associated with it. data_files installs work great for this today.

Critically for Jupyter, none of Jupyter specs are Python-specific, and many of these things are not part of Python packages at all, so we standardize on evaluating paths relative to $PREFIX rather than something python specific (we generally use sys.prefix for this), and we don’t want to assume that everything comes from a Python package.

For install time, all we really want is a reliable way to write {sys.prefix}/share/ (or prefix/etc) and a way at runtime that returns the same {sys.prefix}/share|etc that should work for installs:

  • in venv
  • not in env
  • --user

It’s the lack of symmetrical “where would you have put it?” API that’s been a challenge for us. We encourage data_files as the easiest way to do this which works almost all of the time (very reliably in venvs and conda envs), but we are aware of custom distutils install schemes like system Pythons where it can get weird, because sys.prefix is not actually where installed files end up.

1 Like

Cool, that makes a lot of sense to me. So if sysconfig added a new path for “share” (or “etc”), that would satisfy this use case? The wheel spec already covers the installation side of this, as the rule in the PEP applies to any scheme key that exists in sysconfig. So all you need is the lookup side, which sysconfig would provide.

1 Like

Fedora Python maintainer here. I need to pitch in as somebody who fundamentally disagrees that desktop files or manual pages "don’t work under sys.prefix" or that installing stuff to an arbitrary location under sys.prefix has an increased potential to create file-conflicts. It is matter of perspective. Let me describe a matrix of the following things I could think of:

Python modules in {sys.prefix}/.../site-packages

Python modules: sys.prefix is /usr

  • The modules work naturally because they are by definition in sys.path.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and distro-package manager.

Python modules: sys.prefix is /usr/local or ~/.local

  • The modules work naturally because they are by definition in sys.path.
  • There is a potential of file-conflict between different Python packages.

Python modules: sys.prefix is within Python virtual environment

  • The modules work naturally because they are by definition in sys.path.
  • There is a potential of file-conflict between different Python packages, albeit the chances are very limited, as virtual environments tend to be one-purpose and the package set usually does not grow without bounds.

Python modules: sys.prefix is another arbitrary location including Windows

  • The modules work naturally because they are by definition in sys.path.
  • There is a potential of file-conflict with different Python packages.

Commands (scripts) in {sys.prefix}/bin

Commands: sys.prefix is /usr

  • The commands work naturally because /usr/bin is (almost) always on $PATH.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and distro-package manager.
  • There is a potential of file-conflict between pip-installed packages and other language stack package managers that would also install scripts into this location, or manually created content.

Commands: sys.prefix is /usr/local

  • The commands usually work because /usr/local/bin tends to be on $PATH.
  • If needed, users can extend their $PATH easily to make it work.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and other language stack package managers that would also install scripts into this location, or manually created content.

Commands: sys.prefix is ~/.local

  • The commands usually work because ~/.local/bin tends to be on $PATH on modern distros.
  • If needed, users can extend their $PATH easily to make it work.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and other language stack package managers that would also install scripts into this location, or manually created content.

Commands: sys.prefix is within Python virtual environment

  • The commands don’t work unless the virtual environment is activated: activate script adds the directory to $PATH.
  • Users can add symbolic links to scripts in a virtual environment to directories on their $PATH.
  • There is a potential of file-conflict between different Python packages, albeit the chances are very limited, as virtual environments tend to be one-purpose and the package set usually does not grow without bounds.

Commands: sys.prefix is another arbitrary location

  • The commands don’t work unless user modifies their $PATH.
  • Users can add symbolic links to scripts in arbitrary locations to directories on their $PATH.
  • There is a potential of file-conflict witch anything else.

Commands: Windows

  • This is handled differently on Windows and it seems to work.

Manual pages in {sys.prefix}/share/man

Manpages: sys.prefix is /usr

  • The manual pages work naturally because /usr/share/man is (almost) always in manpath.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and distro-package manager.
  • There is a potential of file-conflict between pip-installed packages and other language stack package managers that would also install manual pages into this location, or manually created content.

Manpages: sys.prefix is /usr/local

  • The commands usually work because /usr/local/share/man tends to be in manpath.
  • If needed, users/distros can extend their config easily to make it work.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and other language stack package managers that would also install manual pages into this location, or manually created content.

Manpages: sys.prefix is ~/.local

  • The manual pages usually work because ~/.local/share/man tends to be on manpath on modern distros.
  • If needed, users/distros can extend their config easily to make it work.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and other language stack package managers that would also install manual pages into this location, or manually created content.

Manpages: sys.prefix is within Python virtual environment

  • The manual pages don’t work out of the box.
  • Users can make them work by setting/extending $MANPATH.
  • If deemed useful, the activate script could be improved to set/extend $MANPATH.
  • Users can add symbolic links to manual pages in a virtual environment to directories on their manpath.
  • There is a potential of file-conflict between different Python packages, albeit the chances are very limited, as virtual environments tend to be one-purpose and the package set usually does not grow without bounds.

Manpages: sys.prefix is another arbitrary location

  • The manual pages don’t work out of the box.
  • Users can make them work by setting/extending $MANPATH.
  • Users can add symbolic links to manual pages in arbitrary locations to directories on their manpath.
  • There is a potential of file-conflict witch anything else.

Manpages: Windows

  • The manual pages are not relevant there but they don’t hurt anything.

Desktop files in {sys.prefix}/share/applications

(This also applies to their icons in {sys.prefix}/share/icons or {sys.prefix}/share/pixmaps.)

Desktop files: sys.prefix is /usr

  • The desktop files work naturally because /usr/share/applications is used by default.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and distro-package manager.
  • There is a potential of file-conflict between pip-installed packages and other language stack package managers that would also install desktop files into this location, or manually created content.

Desktop files: sys.prefix is /usr/local or ~/.local

  • The desktop files work naturally because /usr/local/share/applications and ~/.local/share/applications is used by default.
  • There is a potential of file-conflict between different Python packages.
  • There is a potential of file-conflict between pip-installed packages and other language stack package managers that would also install desktop files into this location, or manually created content.

Desktop files: sys.prefix is within Python virtual environment

  • The desktop files don’t work.
  • Users might be able to make them work by some configuration (I have not explored this).
  • Users can add symbolic links to desktop files in a virtual environment to the directories that work with desktop files.
  • There is a potential of file-conflict between different Python packages, albeit the chances are very limited, as virtual environments tend to be one-purpose and the package set usually does not grow without bounds.
  • If somebody wants to explore new ideas, we can have a concept of “activating a virtual environment for your desktop environment”, but I don’t think it would be that useful.

Desktop files: sys.prefix is another arbitrary location

  • The desktop files don’t work.
  • Users might be able to make them work by some configuration (I have not explored this).
  • There is a potential of file-conflict witch anything else.

Desktop files: Windows

  • The desktop files are not relevant there but they don’t hurt anything.

Static application data in {sys.prefix}/share/{app_name}

(E.g. Jupyter kernels.)

  • They always work regardless of sys.prefix.
  • There is a potential of file-conflict witch anything else that uses app_name.

tl;dr

  • The stuff works quite fine for many values of sys.prefix; the degree of “works out of the box” varies, but there is some potential for improvement as well.
  • The potential for file-conflicts is not worse than the existing potential (you can nuke a system by installing bash or sh script quite fine already even without data_files).
  • Many files are useless on Windows but I consider that OK.
2 Likes

Thanks, @AWhetter for looping Jupyter in, and @minrk for the thorough summary of our challenges!

To the various venv/not venv/--user cases, I’d add:

  • uninstall leaving the file system “unharmed”
  • a key “can’t/won’t” work case: pip install -e (and whatever flit and poetry do). I don’t know how this can be solved in a cross-platform way, but our experiments in going from 4 search paths to hundreds (via entry_points) were… not encouraging, and would require non-python Jupyter components to shell out to get such a list.

Semi-related: having at least some warning (which could ideally be elevated to an error with e.g. an environment variable) when two packages try to write to the same file in {sys.prefix}/etc or {sys.prefix}/share would be helpful… today we kinda wait for downstreams (or users) to find these issues. Our flagship first-party consumers of etc have well-known (well, at least documented) {sys.prefix}/etc/jupyter_*_config.d/ folders that have made this more robust, but one (well, usually two) badly-behaved packages can make a right mess of things depending on the order of installation.

In practice, this is somewhat of a side issue. The question of whether desktop files or manual pages "work under sys.prefix" is only significant if someone is arguing that wheels need to support installation of files to locations outside of sys.prefix - and I don’t think anyone is still trying to make that argument.

What we’re left with now, as far as I can see, is that the wheel spec supports custom install locations for any named path that is supported by sysconfig. Currently, sysconfig supports 8 paths (from the Python docs):

  • stdlib : directory containing the standard Python library files that are not platform-specific.
  • platstdlib : directory containing the standard Python library files that are platform-specific.
  • platlib : directory for site-specific, platform-specific files.
  • purelib : directory for site-specific, non-platform-specific files.
  • include : directory for non-platform-specific header files.
  • platinclude : directory for platform-specific header files.
  • scripts : directory for script files.
  • data : directory for data files.

Adding extra install locations is as simple as getting some extra locations added to the sysconfig module. It’s also exactly as hard as doing that - doing anything outside of the stdlib still requires getting agreement on locations for all platforms, supporting virtualenvs, handling ways of letting distros customise the locations, etc. I suspect people have a view that there’s an “easier” way than getting sysconfig changed, but honestly, I doubt that’s the case in practice.

So I think the “new standard” here might simply be a matter of requesting new install locations for the sysconfig module.

1 Like

We will also have to deal with getting KeyError when a wheel’s .data/ directory has a name not in sysconfig. I’ve wanted to just leave the data directory in site-packages in that case and possibly print() a warning.

That should be the case right now. The PEP doesn’t insist that wheels stick to a list of names, so tools have to be prepared for KeyError already.

I’d be fine with someone checking what existing installers do (pip, wheel, installer, poetry, …?) and if they all have a common behaviour, adding that to the PEP as a clarification. If they do different things, we’d have to say it’s currently implementation-defined and if we want to define a specific behaviour, that would be a proper PEP update.

@AWhetter , @minrk, looking at the pySerial and dig depth into it. I realize the answer to our problem may fall into “Conda” tool - Conda for data scientists — conda 4.10.1.post2+b6d32c8d7 documentation

https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/

Yeah, I think so. An equivalent answer for —user as well.

I don’t think resolving etc and share separately is important to us, as long as the parent of where they actually go is accessible and reliable. sys.prefix usually happens to work for this, just not always.

--user is covered by the appropriate “user” scheme, so yes, that works.

Could you not just use sysconfig.get_path("data")? That’s basically “sys.prefix with a different value in the user scheme”. And it’s already available in wheels, because it’s an existing sysconfig path.

It’s worth also linking to this pip issue though. We’re trying to switch from distutils to sysconfig for getting scheme paths, in preparation for the deprecation of distutils, and there’s an impressive number of cases where the two are out of line. Which suggests that either there’s a certain lack of clarity over what “the correct locations” are in many cases, or distributions have been caught out by the fact that there’s no “single source of truth” for this data right now, when they have patched stuff.

Also, I did misread the wheel spec previously - all it refers to is “a dictionary of install paths”, with a note that “this version of the wheel specification is based on the distutils install schemes and does not define how to install files to other locations”.

But I think the following are reasonable clarifications for the spec:

  1. “distutils” becomes “sysconfig” with the impending deprecation of distutils.
  2. The “dictionary of install paths” is clarified, to note that by default installers should use one of the available sysconfig schemes.
  3. The spec doesn’t state which scheme an installer should use by default, nor does it put any requirements on what capabilities installers might have to customise the set of install locations.

The Jupyter issue of locating the install location of data files at runtime, becomes a question of “what was the install scheme used to install this package?” And there’s no complete answer to that, because it’s not recorded anywhere (and things like --target or editable installs make things even more messy). But looking in sysconfig locations, and declaring custom install schemes unsupported, seems like a reasonable starting point.

Could you not just use sysconfig.get_path("data") ?

Yes, I think we can! As long as it’s documented that this is where data files go by default for pip install and pip install --user. I think that’s what’s been missing—a documented queryable api that e.g. returns /usr/local on debian. Then it could be considered a bug that debian debian sysconfig.get_path("data") returns /usr instead of /usr/local. As you said, every possible --target install is not critical for us to support. We have post-install configuration to deal with those cases, we mainly care about the really common ones getting us 90% there without that extra second step.

I’ll confess that I was surprised to learn that data_files is deprecated in the docs with the message:

It does not work with wheels, so it should be avoided.

Mainly because we use data_files in wheels all the time and it works great. So I don’t quite understand what the current situation is. We’ve always used ‘relative’ paths, e.g.

data_files=[("share/jupyter/kernels/python3", glob("kernelspec/*")),]

So what part of data_files is really deprecated? Is it only absolute path support? We’ve never needed that. Is the plan to have a new spec that can reasonably be supported and un-deprecate data_files with a more limited scope, or a new keyword arg with a new, very similar spec?

So from the Jupyter perspective, everything is working pretty well today except for an official consistent API that would return /usr/local on debian instead of /usr. If the fix is for debian to patch sysconfig in the same way they do distutils as proposed in that pip issue, that would probably work great for us. We’ve handled that well enough so far with a bit of special-casing for /usr/local anyway, so we have no pressing issues, unless data_files is actually going to be removed.

The Jupyter issue of locating the install location of data files at runtime, becomes a question of “what was the install scheme used to install this package?”

We actually want something that I think should be even simpler than that, since we don’t actually know what packages we are asking about (and critically, they might not be Python packages). Instead, the question we want to ask is:

  • what is the default install scheme, and where would it put data files
  • what is the --user install scheme, and where would it put data files

so that we can put those on our default search path to make life easier for wheel installs. That’s really it.

To put it another way, data_files for us is distinct from package_data because it is for making files available to other tools or packages, which may even not be Python at all (e.g. xeus in c++ or nteract in javascript. Having a consistent $PREFIX/share|etc scheme is our approach to that.

2 Likes

Hey all, I’m another Jupyter person who has been lately looking into the issues around data_files. The current thrust of the conversation is encouraging, especially since the ongoing work on using entry_points in place of data_files seems to have stalled a bit. Mostly over the fact that no matter what, using entry_points means having to scan through the metadata of potentially 1000s of packages on every startup looking for config and extensions, whereas using data_files means that currently we only have to check ~4 fixed paths. Thus, it seems like data_files has entry_points thoroughly beat in terms of performance and simplicity.

So the idea of using sysconfig.get_path("data") or similar to get a single fixed dir (per schema) to install files to sounds appealing, since it should be even simpler and faster than data_files. My question, though, is how would we actually implement installation to the "data" dir? I can think of a bunch of hacky ways to copy files during install. But in general those copies won’t automagically get included in the pkg metadata, right? Which means that pip uninstall won’t work.

So what’s the correct way?

Put them in {distribution}-{version}.data/data/ in the wheel. If you’re asking what build backend (e.g. setuptools/flit) arguments are needed to do that, I don’t know, but if the backend doesn’t let you do this right now, that’s just a backend feature request - in the context of this thread, the important point is that it doesn’t need a new standard to make this work.

Cool. I’ll go try that out. The only other wrinkle I can think of is that the files will need to end up in something like ${datarootdir}/share/jupyter, rather than the base ${datarootdir}. So in terms of the wheel schema you refer to, could you stash data at:

{distribution}-{version}.data/data/share/jupyter

Is that valid? I just read through the 2 wheel PEPs (PEP 427 and PEP 491) and I’m still unclear

The PEP says each directory contains a “subtree” so I believe this is completely valid.

1 Like