Deprecating the `headers` wheel data key

FFY00 · February 9, 2023, 4:50pm

Hi everyone, I wanted to ask y’all’s opinion on deprecating the headers key in the wheel .data folder.

The issue

The headers install key is a leftover from distutils and there is no direct equivalent in sysconfig.

Currently, installers that moved away from distutils are manually building its path from include/platinclude, however, this has a big issue. include/platinclude are located in the Python installation itself (they are derived from installed_base/installed_platbase), so they are shared between all environments.

This means that, for eg., installing a wheel that uses the headers key on a virtual environment will make those files available in the default environment and any other virtual environment.

Viability

Fortunately, there aren’t that many packages that make use of this feature, though there are some. With a reasonable deprecation timeline, I don’t think it would be a big problem to migrate away from this.

As PEP 427 doesn’t specify any canonical list of valid keys, instead leaving that to the installation mechanism, I don’t think we need a PEP to drop the key. IMO we can just write documentation on migrating away from using the headers key, and installers can start raising a deprecation warning and pointing to that documentation.

I also think most installer authors would be happy to be able to drop their custom code to handle the backwards compatibility

steve.dower · February 9, 2023, 5:18pm

This sounds fine, though I guess the subsequent question is: should we properly define the canonical list of valid keys? It may be best to do that all at once and just not put headers in it, which then makes it less of a deprecation and more of a positive definition.

FFY00 · February 9, 2023, 5:33pm

Well, I’d like to introduce the concept of “installable locations” in sysconfig, probably via a new interface which better matches the current reality. I don’t have any concrete proposal yet, as it needs to be very well thought through. After that, I think it’d make sense to maybe have a PEP to better define the valid keys.

steve.dower · February 9, 2023, 6:10pm

“Installable locations” would be a great thing to have. Happy to bounce ideas as you have them.

I’d kinda like to deprecate all the schemes now and replace the whole thing with a mechanism that only works for the current runtime (i.e. where is my <whatever>), which lets the distributor define the whole thing and doesn’t claim to accurately (heh) tell you Windows paths on a POSIX system.

But based on the distutils experience, we need a fully fleshed out replacement ready to keep people from getting upset. And that’ll need at least 2 years to be released through the channels where it’s needed. Maybe it’s worth adding a note to sysconfig that we’re aware of its limitations, do not yet have an alternative, and any use should be tested thoroughly on all systems you’re intending to target? That’s pretty well beyond the reach of anyone wanting to write a generic installer, but at least we’ll have stated that it’s not a reliable option.

FFY00 · February 9, 2023, 6:30pm

Yeah, that makes sense. It would also go in line with simplifying and make the interpreter initialization more consistent (Make the interpreter paths initialization more consistent · Issue #98947 · python/cpython · GitHub). Other thing I’d like to try to introduce, is support for file-system abstractions, as we’ve been doing in importlib.resources with Traversable. Anyway, this is all very up in the air, and we are getting off-topic, we should follow up in some other place.

jakirkham · February 9, 2023, 8:57pm

@rgommers, is this relevant for you? Know NumPy ships headers, but not sure if this key is being used or not. Quick search suggest maybe the latter, but wanted to check

pf_moore · February 9, 2023, 9:31pm

I’d have assumed that numpy wheels contained headers, but on inspection it appears not. To be honest, if numpy can manage without the headers key, I’d be surprised if there’s anything that can’t. So +1 from me.

(It does beg the question of how should extensions with a C API publish the relevant headers? But it looks like projects like numpy are solving that without involving the core or needing packaging standards, so I guess there’s an approach that people who need to know are aware of, and that’s fine).

I’d love to see a new sysconfig API based around “installable locations”. At the risk of jumping into details too soon, we should consider

The locations user can ask for in pip (or other installers). At a minimum, these are the default, plus --user, --root, --prefix, and --target. I may have forgotten some Having common, clearly explained terminology here would be great - I’ve no real idea what the intended difference is between --prefix and --root.
The customisation needs of redistributors. Do they want to be able to influence the layout of a --prefix install? Should they be able to?

I hope that makes sense - I’m not an expert here, not least because the current machinery feels scary hard to me. So if you can make the new approach something I can understand, you’ll have won big time

oscarbenjamin · February 9, 2023, 9:52pm

NumPy provides the get_include function:

>>> np.get_include()
'.../site-packages/numpy/core/include'

That shows

$ ls 38venv/lib/python3.8/site-packages/numpy/core/include/numpy
arrayobject.h             ndarraytypes.h                npy_interrupt.h          random
arrayscalars.h            _neighborhood_iterator_imp.h  npy_math.h               __ufunc_api.h
experimental_dtype_api.h  noprefix.h                    npy_no_deprecated_api.h  ufunc_api.txt
halffloat.h               npy_1_7_deprecated_api.h      npy_os.h                 ufuncobject.h
libdivide                 npy_3kcompat.h                _numpyconfig.h           utils.h
__multiarray_api.h        npy_common.h                  numpyconfig.h
multiarray_api.txt        npy_cpu.h                     old_defines.h
ndarrayobject.h           npy_endian.h                  oldnumeric.h

Ideally it would not be necessary to use the get_include() function (which has to be called in setup.py). I’m not sure of the history of this but I guess that at least at some point in time it wasn’t possible to distribute headers reliably through other mechanisms.

jakirkham · February 10, 2023, 12:13am

Yeah this is my understanding as well. Ralf would likely know more.

This meant other projects that relied on NumPy headers as part of setup.py builds had to have NumPy already installed or do other hacks to grab NumPy headers once available. Expect this is less of an issue with pyproject.toml.

uranusjr · February 10, 2023, 7:00am

They still need to find the headers somewhere, the only difference (from what I can tell) is a pyproject.toml-based project can be sure numpy is present when the project is being built (via PEP 517 mechanisms).

This makes me think whether it’s a good idea to repurpose the headers key to point to the active prefix (e.g. virtual environment) instead. This may need a new key in sysconfig schemes, and pip will need some time to migrate, however, with only moderate improvements (dependent projects can avoid needing to import the dependency to find its header path).

njs · February 10, 2023, 8:27am

I think the current situation actually is ideal. It works, and b/c the lookup goes via sys.path, it works in any python environment no matter how you set it up – venv, PYTHONPATH, whatever, as long as you can import numpy you can build against numpy’s C API. Having a separate include directory just adds more ways for things to go wrong.

So yeah, strong agree that the include keys should be dropped from wheel. And ditto with data, tbh – no-one knows what this is supposed to point to or mean or how you’re supposed to find the data files again. Quoting the setuptools docs:

Historically, setuptools by way of easy_install would encapsulate data files from the distribution into the egg (see the old docs). As eggs are deprecated and pip-based installs fall back to the platform-specific location for installing data files, there is no supported facility to reliably retrieve these resources.

Instead, the PyPA recommends that any data files you wish to be accessible at run time be included inside the package.

That leaves just purelib, platlib, and scripts, which seems right to me: they correspond to sys.path and PATH, which are the two things that you’re guaranteed will exist in any Python environment. And that’s what wheels are – our abstraction layer for describing how to install a package into any Python environment.

rgommers · February 10, 2023, 9:21am

Deprecating the headers key should be fine I think, it’s not very useful right now. NumPy installs headers as data files within its own site-packages/numpy location.

The current situation is pretty bad actually. The recommended approach is to ship headers inside ones’ own package, which numpy and pybind11 both do and then make accessible through a get_include() function. A big problem there is that it’s not found by default when looking for headers, so you need to add the include path explicitly in your build. For which you then need to execute Python code. Which isn’t possible when you are cross-compiling. See pypackaging-native.github.io/other_issues/#no-good-way-to-install-headers-or-non-python-libraries.

The ideal way is to have headers in <prefix>/include/pkgname/. However, this isn’t going to happen any time soon because it requires having exactly one include/ directory and one Python install per environment - and that’s not the case in at least two circumstances:

virtualenvs don’t have their own include directory
there are system installs which allow installing multiple Python versions side-by-side

One thing I liked about Posy is that it seemed to work towards a “prefix environment”, which would improve on virtualenvs.

rgommers · February 10, 2023, 9:26am

For more context about pybind11 (vs. pybind11-global, which does install things into <prefix>/include), how to teach CMake about a nonstandard include dir, and a possible way forward, see meson-python/issues/240.

steve.dower · February 10, 2023, 4:44pm

Perhaps build backends could agree on an entry point name that can locate included files? Then a backend can enumerate the build environment’s entry points, invoke any that specify INCLUDE (or LIB?) additions, and then do their build.

(Or can an entrypoint be plain old data? In which case, a relative path to *waves hands* somewhere would also be fine.)

FFY00 · February 10, 2023, 5:04pm

This is possible, but I think it will cause some breakage, and it’s common for you to need something other than the headers when building native packages, so IMO we should try to followup with a new approach, instead of repurposing headers.

Multiple schemes might be active, so you’d have a bunch of include/ directories. IMO sysconfig should be able to detect which environments are active (even when cross-compiling), and give you a list of all the directories to consider when building.

Yeah, as long as this is just a lookup and doesn’t need to run any code, that is also a viable option.

jakirkham · February 10, 2023, 9:09pm

Right, I just mean one can rely on the dependency to be there. Finding headers still requires some Python code as Ralf said.

If it were to be repurposed, it would probably still make sense to deprecate and then reintroduce later with the new intended behavior (once that is determined).

njs · February 12, 2023, 10:22am

It would be awesome to have more standardized cross-compilation support, and having some way to look up include files from static metadata in .dist-info seems like a reasonable idea to me. But the wheel include key doesn’t help with any of that, so I’m still +1 on deprecating it. And I’m not sure how the static metadata per-package include thing would work when we don’t have a standard way to tell a build backend “here’s the environment you’re building against, which is different from the environment where you’re executing” – it seems like we’d want to figure out what that looks like first?

encukou · February 13, 2023, 10:49am

Let me just note that “where is my <whatever>” and “where should I install my <whatever>” are very different questions. Unfortunately they have the same answer in venvs without site-packages, so in the PyPI/PyPA world they tend to get conflated, and designs that assume they’re the same can be hard to untangle.
The first question can have multiple answers. sysconfig currently only answers the second one.

steve.dower · February 13, 2023, 2:20pm

Yeah, we started discussing this a while back, but it’s really messy. Honestly, about the only thing that’s likely to work is some environment variable that backends can agree to agree upon (possibly a config_setting, but that’s another discussion that isn’t really able to make progress…)

However, it seems to me that virtually every realistic cross-compiling scenario involves having the Python runtime for the environment you’re building the binaries for.^[1] Once you’ve somehow acquired that, provided you don’t actually have to run it, anything static can be read out of files.

Yes, I’m being careful to not use “target”, “build” or “host” here, because some compiler from years back redefined them and now nobody agrees on which is which. ↩︎

takluyver · February 15, 2023, 5:18pm

People use the data key from time to time to install things like man pages, .desktop files (Linux application launchers) and Jupyter extension definitions. These are for integration rather than data that packages can retrieve, and it’s often acceptable that you’re not 100% sure where they will go - putting a man page in <data>/share/man/man1 may not always make it findable, but it works often enough to be useful.

Enough people asked for this in Flit that I eventually added support for it (under the name ‘external data’).

If we were designing a packaging format from scratch, I imagine we wouldn’t have this generic ‘data’ directory. But as it works and people are using it, I’d expect pushback if you try to get rid of the data key with no replacement.

To be clear, I’m OK with dropping headers. But data should have a separate discussion.