Should there be a new standard for installing arbitrary data files?

In practice, this is somewhat of a side issue. The question of whether desktop files or manual pages "work under sys.prefix" is only significant if someone is arguing that wheels need to support installation of files to locations outside of sys.prefix - and I don’t think anyone is still trying to make that argument.

What we’re left with now, as far as I can see, is that the wheel spec supports custom install locations for any named path that is supported by sysconfig. Currently, sysconfig supports 8 paths (from the Python docs):

  • stdlib : directory containing the standard Python library files that are not platform-specific.
  • platstdlib : directory containing the standard Python library files that are platform-specific.
  • platlib : directory for site-specific, platform-specific files.
  • purelib : directory for site-specific, non-platform-specific files.
  • include : directory for non-platform-specific header files.
  • platinclude : directory for platform-specific header files.
  • scripts : directory for script files.
  • data : directory for data files.

Adding extra install locations is as simple as getting some extra locations added to the sysconfig module. It’s also exactly as hard as doing that - doing anything outside of the stdlib still requires getting agreement on locations for all platforms, supporting virtualenvs, handling ways of letting distros customise the locations, etc. I suspect people have a view that there’s an “easier” way than getting sysconfig changed, but honestly, I doubt that’s the case in practice.

So I think the “new standard” here might simply be a matter of requesting new install locations for the sysconfig module.

1 Like

We will also have to deal with getting KeyError when a wheel’s .data/ directory has a name not in sysconfig. I’ve wanted to just leave the data directory in site-packages in that case and possibly print() a warning.

That should be the case right now. The PEP doesn’t insist that wheels stick to a list of names, so tools have to be prepared for KeyError already.

I’d be fine with someone checking what existing installers do (pip, wheel, installer, poetry, …?) and if they all have a common behaviour, adding that to the PEP as a clarification. If they do different things, we’d have to say it’s currently implementation-defined and if we want to define a specific behaviour, that would be a proper PEP update.

@AWhetter , @minrk, looking at the pySerial and dig depth into it. I realize the answer to our problem may fall into “Conda” tool - Conda for data scientists — conda 4.10.1.post2+b6d32c8d7 documentation

https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/

Yeah, I think so. An equivalent answer for —user as well.

I don’t think resolving etc and share separately is important to us, as long as the parent of where they actually go is accessible and reliable. sys.prefix usually happens to work for this, just not always.

--user is covered by the appropriate “user” scheme, so yes, that works.

Could you not just use sysconfig.get_path("data")? That’s basically "sys.prefix with a different value in the user scheme". And it’s already available in wheels, because it’s an existing sysconfig path.

It’s worth also linking to this pip issue though. We’re trying to switch from distutils to sysconfig for getting scheme paths, in preparation for the deprecation of distutils, and there’s an impressive number of cases where the two are out of line. Which suggests that either there’s a certain lack of clarity over what “the correct locations” are in many cases, or distributions have been caught out by the fact that there’s no “single source of truth” for this data right now, when they have patched stuff.

Also, I did misread the wheel spec previously - all it refers to is “a dictionary of install paths”, with a note that “this version of the wheel specification is based on the distutils install schemes and does not define how to install files to other locations”.

But I think the following are reasonable clarifications for the spec:

  1. “distutils” becomes “sysconfig” with the impending deprecation of distutils.
  2. The “dictionary of install paths” is clarified, to note that by default installers should use one of the available sysconfig schemes.
  3. The spec doesn’t state which scheme an installer should use by default, nor does it put any requirements on what capabilities installers might have to customise the set of install locations.

The Jupyter issue of locating the install location of data files at runtime, becomes a question of “what was the install scheme used to install this package?” And there’s no complete answer to that, because it’s not recorded anywhere (and things like --target or editable installs make things even more messy). But looking in sysconfig locations, and declaring custom install schemes unsupported, seems like a reasonable starting point.

Could you not just use sysconfig.get_path("data") ?

Yes, I think we can! As long as it’s documented that this is where data files go by default for pip install and pip install --user. I think that’s what’s been missing—a documented queryable api that e.g. returns /usr/local on debian. Then it could be considered a bug that debian debian sysconfig.get_path("data") returns /usr instead of /usr/local. As you said, every possible --target install is not critical for us to support. We have post-install configuration to deal with those cases, we mainly care about the really common ones getting us 90% there without that extra second step.

I’ll confess that I was surprised to learn that data_files is deprecated in the docs with the message:

It does not work with wheels, so it should be avoided.

Mainly because we use data_files in wheels all the time and it works great. So I don’t quite understand what the current situation is. We’ve always used ‘relative’ paths, e.g.

data_files=[("share/jupyter/kernels/python3", glob("kernelspec/*")),]

So what part of data_files is really deprecated? Is it only absolute path support? We’ve never needed that. Is the plan to have a new spec that can reasonably be supported and un-deprecate data_files with a more limited scope, or a new keyword arg with a new, very similar spec?

So from the Jupyter perspective, everything is working pretty well today except for an official consistent API that would return /usr/local on debian instead of /usr. If the fix is for debian to patch sysconfig in the same way they do distutils as proposed in that pip issue, that would probably work great for us. We’ve handled that well enough so far with a bit of special-casing for /usr/local anyway, so we have no pressing issues, unless data_files is actually going to be removed.

The Jupyter issue of locating the install location of data files at runtime, becomes a question of “what was the install scheme used to install this package?”

We actually want something that I think should be even simpler than that, since we don’t actually know what packages we are asking about (and critically, they might not be Python packages). Instead, the question we want to ask is:

  • what is the default install scheme, and where would it put data files
  • what is the --user install scheme, and where would it put data files

so that we can put those on our default search path to make life easier for wheel installs. That’s really it.

To put it another way, data_files for us is distinct from package_data because it is for making files available to other tools or packages, which may even not be Python at all (e.g. xeus in c++ or nteract in javascript. Having a consistent $PREFIX/share|etc scheme is our approach to that.

2 Likes

Hey all, I’m another Jupyter person who has been lately looking into the issues around data_files. The current thrust of the conversation is encouraging, especially since the ongoing work on using entry_points in place of data_files seems to have stalled a bit. Mostly over the fact that no matter what, using entry_points means having to scan through the metadata of potentially 1000s of packages on every startup looking for config and extensions, whereas using data_files means that currently we only have to check ~4 fixed paths. Thus, it seems like data_files has entry_points thoroughly beat in terms of performance and simplicity.

So the idea of using sysconfig.get_path("data") or similar to get a single fixed dir (per schema) to install files to sounds appealing, since it should be even simpler and faster than data_files. My question, though, is how would we actually implement installation to the "data" dir? I can think of a bunch of hacky ways to copy files during install. But in general those copies won’t automagically get included in the pkg metadata, right? Which means that pip uninstall won’t work.

So what’s the correct way?

Put them in {distribution}-{version}.data/data/ in the wheel. If you’re asking what build backend (e.g. setuptools/flit) arguments are needed to do that, I don’t know, but if the backend doesn’t let you do this right now, that’s just a backend feature request - in the context of this thread, the important point is that it doesn’t need a new standard to make this work.

Cool. I’ll go try that out. The only other wrinkle I can think of is that the files will need to end up in something like ${datarootdir}/share/jupyter, rather than the base ${datarootdir}. So in terms of the wheel schema you refer to, could you stash data at:

{distribution}-{version}.data/data/share/jupyter

Is that valid? I just read through the 2 wheel PEPs (PEP 427 and PEP 491) and I’m still unclear

The PEP says each directory contains a “subtree” so I believe this is completely valid.

1 Like