The tooling that handles this information does not need to understand Python requirement specifiers and, relatedly, these will have different behaviours and semantics than regular Python dependencies. Separating things that have different behaviours and semantics is intentional.
You are completely right, and @tgamblin said something similar higher up:
These questions are nontrivial. A few thoughts:
Yes a central list seems reasonable to add to this PEP.
PURLs do not have to be pre-registered. They are pretty unambiguous and while a registry may be useful (for pkg:generic in particular), I don’t see a need to forbid using any PURL that is valid (parses correctly). It’s similar to Python dependencies: it’s your responsibility to put in a name that makes sense, but nothing is stopping you from writing dependencies = ['nonsensical-package'] I think.
For virtual dependencies the situation is slightly different, because (a) there will be far fewer of them, and (b) an interface or language standard with multiple implementations may not be unambiguously versioned.
There should still not be a validation step for virtuals that, for example, would stop pip/build from building a wheel or an upload to PyPI.
The central list that @msarahan describes can exist standalone, before the name mapping mechanism exists. All it needs is PURL/virtual string, and a description field (free-form).
Allowing duplicate PURLs for the same package may make sense and be allowed. E.g., pkg:generic/openblas and pkg:github/openmathlib/openblas are both reasonable, and authors may have different preferences here.
In that case, the central list may need an “is alias to” field as well.
If the central list does come into existence, it should be a project/repo under the PyPA org I’d think. Something like pypa/pyproject-external?
Regarding @tgamblin’s comment on package structure and non-default build options, I think that is something that can’t reasonably be handled I think. If we get to well-defined metadata that gives you the exact names and versions of dependencies that’s a significant step forward. I cannot think of a way to say “give me version 1.2 of package X built with flags --with-libfoo --no-gpl-components”. That level of detail has to remain a docs-only thing I’d say.
If the central list seems reasonable to everyone, I can try putting that together, seeding it with the virtual:* and pkg:generic from the demo I posted higher up.
That sounds great. I think that diagram should go in the name mapping PEP.
While I agree that it should be possible to build and submit a package that has PURL references that are currently invalid, I think this should be something that the build tool and/or upload tool checks for and warns about.
I think we can do a decent, but not perfect job here. The pattern of splitting up devel-type dependencies from runtime dependencies is common enough that we should probably include that distinction somehow. For package ecosystems that don’t split these, you’d just always get the whole, single package. On the flip side, if the PURL doesn’t specify one of these options, the mapping should rope in both/all of them. For example, let’s think about postgres. Let’s say a package includes a dependency on pkg:generic/postgres. On OS’s that split this into libpq and the server, along with devel packages, all of those packages should be installed. The existence of a PURL for a more specific component does not preclude the existence of a broader PURL.
This is mapping behavior more than central list behavior, so maybe it should wait until later for discussion.
Imposing overly strict requirements is something to be careful about, that doesn’t seem warranted.
“invalid” means to me “doesn’t parse”. that should indeed lead to an error, but that’s not what you meant I think.
Valid (parses correctly) PURLs that are not present in a central list is something a tool may choose to warn about, but it’d be quite an unusual thing to require. Furthermore, there’s a split in my mind:
PURLs that point at another package repository or VCS (e.g., pkg:cargo/ripgrep or pkg:github/apache/arrow) are 100% clear and unambiguous, just like package names on PyPI
PURLs that start with pkg:generic and virtual packages are not 100% unambiguous. So for these, a warning may be more justified.
The comment was not about split packages. Those I agree can be handled just fine. This was more about non-default build options. E.g., if you build libfoo with configure && make, but your package won’t work with such a libfoo unless you build it like configure --my-custom-flag && make. That is something that is impossible to express.
Hi! Thank you a lot for going through with this effort! I’ll just drop a few quick comments while I’m trying to grasp the PEP and gather my thoughts
pkgconfig specification as an alternative to ctypes.util.find_library (this was the most recent discussion)
My understanding is that the motivation for this PEP, at least partly, is to facilitate easier cross-compilation for python packages. Do you think it would make sense to include a deprecation plan for ctypes.util.find_library in the PEP and start discouraging python package developers from using it?
This might be important, because find_library never really behaves “correctly”: it relies on glibc and gcc internals, and it also silently returns fundamentally different types of dependencies depending on the environment. E.g. it might return libraries detected by ldconfig which contain code for the current platform (the build platform in the scenario considered by this PEP), or it might return library paths printed by the C compiler which would contain code for the wheel’s host platform. The latter would also be affected by LIBRARY_PATH as opposed to LD_LIBRARY_PATH.
Admittedly, the actual practice of using find_library appears to be predominantly the discovery of native libraries:
Nonetheless this clearly creates extra confusion around an already complex subject. I suppose it’s on glibc to provide a safe public interface for inspecting the dynamic linker and its search paths, but as the PEP points out there already are interfaces for discovering the host platform’s libraries during the build. We could start by making find_library return only the native libraries (meaning ones that would have been discovered by the dynamic linker, ld.so) and display a deprecation warning.
If a package is available under one name/PURL in Fedora, another in Conda and yet another in Chocolatey, what should I do?
I know this has already been addressed later in the comments e.g. by @rgommers, but I’d like to add that “virtually always” (yes, that’s a lie) native dependencies come with either CMake or pkg-config targets maintained upstream (e.g. meson is capable of generating .pc files, unless I’m mistaken). They define both a “canonical name”, and how a “dependency” can be linked. Notably, packages from package distributions usually do not explicitly specify the latter, so it’s probably not even very useful to refer to conda or Fedora package names in a python project’s metadata. E.g. a package that refers to python-devel would have to either infer the actual location of Python.h, or make additional assumptions (e.g. the merged FHS layout). OTOH, pkg-config --cflags python is explicit and platform agnostic
I think that in general, Python package authors should depend on a single package with its canonical name, and if distros split that package up into two or more separate installable packages in their package repository, they should be responsible for dealing with that complexity.
This is a great argument. First of all, I have an impression that this often already works in practice*. Second, the complexity of “managing toolchains and native dependencies” leaking into python or pip is questionable scoping and might do more damage than good. I imagine that tools/ecosystems like CMake and Meson are already costly to maintain, so it only makes sense to focus on interfacing with these tools rather than on duplicating the effort
Versioning compilers, versioning “virtuals”, C/CXX standards, compiler features
I wonder if there are two different directions we might be pursuing with this?
On one hand, a python package developer needs an interface through which to ask for an appropriate toolchain, an MPI implementation, etc. The lack of an agreeable interface forces package developers into implementing ad hoc mechanisms and, in turn, package distributions into writing ad hoc code to support those mechanisms. Tools like CMake, in fact, already provide subsystems addressing all of these issues. For example, say a python package developer declares target_compile_features(my_python_extension cxx_std_14 cxx_range_for). If a package distribution maintainer attempts building such a package with an old compiler, they get a clear and actionable error.
On the other hand there’s the need for better static(?) metadata and heuristics that could be used to automate scaffolding of packages (including constraint solving).
I don’t think it’s immediately obvious that we’d want the python packaging community to maintain standards and registries describing things like cxx_lambdas or cuda_std_17, as opposed to explicitly offloading this task to existing tools (have the pep517 back-end that runs CMake passthrough this metadata to vcpkg/conda/Spack/Nix/etc if they wish to use it)
libpng that would be used at runtime to ensure ABI compatibility
Just for the sake of completeness, we maybe should mention that at least partially (e.g. ignoring cross-compilation) this is addressed by versioned SONAMEs. E.g. looking at libpng’s .pc file right now:
…I see that the name to end up in DT_NEEDED is libpng16.so rather than libpng.so.
I maybe have one question right away that I didn’t find answers to in the comments: are the declared “external dependencies” to be validated by python build “front-ends” (not sure, what’s the appropriate name to summarize pypa-build and pip) prior to the build and how?
I presume that automatic validation/hardening/correctness enforcement is important to avoid a proliferation of packages that might ship misleading (outdated or plain wrong) metadata. It’s somewhat obvious how to sanity-check pkg-config and CMake targets, but the “generic” dependencies seem like a potential issue. On the other hand, a wheel built in reasonable isolation is already an evidence in support of correctness
Cross compilation is important, but not a primary motivation for this particular PEP. It is more a case of “we want good cross compilation support too at some point, hopefully soon, so let’s get the cross-compilation aspects of external dependencies first-time-right”.
I’d much prefer to not mix that into this PEP, since it’s a fairly orthogonal topic. However, I’d certainly discourage Python package developers from using ctypes (and certainly find_library) - there’s a host of reasons for why that is not reliable. Okay for prototyping or quick hacks, or very specific usage scenarios within a single package - but when things go wrong it’s much better to build a Python extension module and get an error at build time, rather than weird errors, crashes or incorrect results at runtime.
Your examples of how Python packages use find_library shows that we indeed need a good alternative. I’d say that that is pkg-config indeed, and that we could at least mention that as the better method in the ctypes.util.find_library docs.
Yes, it is a bit of a lie - but to make the situation better, I think there is little choice but to start using .pc / .cmake files more and treating their absence in a distro as a bug.
There is of course a complementary problem: we still need to solve the issue that you cannot sensibly distribute .pc/.cmake files inside a Python package (or more accurately, inside a wheel). It’s not hard to fix conceptually, but adding a new component/location to install schemes is in practice quite a bit of work. I think if that were to be tackling, it’d be best to tackle all this in one go and add it inside site-packages:
include/
lib/
lib/pkgconfig
lib/cmake
We have this problem for numpy right now - I’d really like to ship a numpy.pc file because there are so many NumPy C API users, but there is nowhere to install it to.
Agreed. I’d draw the line at a standard version, but not individual language features - that’s too detailed, and best left to build systems.
Those are front-ends, yes. That validation is not specified on purpose. For one because prescribing behavior of individual tools tends to be out of scope for packaging PEPs. For another, because it may shift over time. I’d imagine that we add metadata to packages first, then at some point front-ends will start using that to add better diagnostic messages. And then I’d expect a more experimental front-end to start validating more strictly and/or offering users to opt into installing external dependencies automatically. pip is probably going to be the last to do that, since it has to be quite conservative given its role and large user base.
Yes, that is what I’d like to rely on. It only needs a single CI job on one platform for a package to validate this. And anyway, most dependencies won’t change quickly or at all.
Improvements to Examples (validated with a full proof-of-concept implementation of using the metadata added by this PEP for automated builds, as detailed in this post higher up)
Improve abstract to state briefly what the PEP adds to pyproject.toml, (closely following the suggestion by @oscarbenjamin higher up).
Complete the “reference implementation” section
Add section on split pkg / pkg-dev packages, as well as how to treat needing Python development headers
And discussion (under Open Issues for now) on versioning and canonical names of virtual dependencies
The last two items above are the most interesting probably, and were what many of the review comments were about. I’m fairly confident about the pkg/pkg-dev and python-dev one, since I believe there was a reasonable amount of agreement on the approach described and it worked out quite well in the automated builds proof-of-concept. The “versioning and canonical names of virtual dependencies” I’m reasonably confident in too, however I left it under Open Issues for now because it’s important enough to fully prototype before deciding on it. This is, from this post higher up, a prototype for:
diff --git a/peps/pep-0725.rst b/peps/pep-0725.rst
index 172afbd8..13fc866d 100644
--- a/peps/pep-0725.rst
+++ b/peps/pep-0725.rst
@@ -18,6 +18,18 @@ This PEP specifies how to write a project's external, or non-PyPI, build and
runtime dependencies in a ``pyproject.toml`` file for packaging-related tools
to consume.
+This PEP proposes to add an ``[external]`` table to ``pyproject.toml`` with
+three keys: "build-requires", "host-requires" and "dependencies". These
+are for specifying three types of dependencies:
+
+1. ``build-requires``, build tools to run on the build machine
+2. ``host-requires``, build dependencies needed for host machine but also needed at build time.
+3. ``dependencies``, needed at runtime on the host machine but not needed at build time.
+
+Cross compilation is taken into account by distinguishing build and host dependencies.
+Optional build-time and runtime dependencies are supported too, in a manner analogies
+to how that is supported in the ``[project]`` table.
+
Motivation
==========
@@ -36,13 +48,13 @@ this PEP are to:
information.
Packaging ecosystems like Linux distros, Conda, Homebrew, Spack, and Nix need
-full sets of dependencies for Python packages, and have tools like pyp2rpm_
+full sets of dependencies for Python packages, and have tools like pyp2spec_
(Fedora), Grayskull_ (Conda), and dh_python_ (Debian) which attempt to
-automatically generate dependency metadata from the metadata in
+automatically generate dependency metadata for their own package managers from the metadata in
upstream Python packages. External dependencies are currently handled manually,
because there is no metadata for this in ``pyproject.toml`` or any other
standard location. Enabling automating this conversion is a key benefit of
-this PEP, making packaging Python easier and more reliable. In addition, the
+this PEP, making packaging Python packages for distros easier and more reliable. In addition, the
authors envision other types of tools making use of this information, e.g.,
dependency analysis tools like Repology_, Dependabot_ and libraries.io_.
Software bill of materials (SBOM) generation tools may also be able to use this
@@ -100,7 +112,7 @@ Cross compilation
Cross compilation is not yet (as of August 2023) well-supported by stdlib
modules and ``pyproject.toml`` metadata. It is however important when
translating external dependencies to those of other packaging systems (with
-tools like ``pyp2rpm``). Introducing support for cross compilation immediately
+tools like ``pyp2spec``). Introducing support for cross compilation immediately
in this PEP is much easier than extending ``[external]`` in the future, hence
the authors choose to include this now.
@@ -204,9 +216,9 @@ Virtual package specification
There is no ready-made support for virtual packages in PURL or another
standard. There are a relatively limited number of such dependencies though,
-and adoption a scheme similar to PURL but with the ``virtual:`` rather than
+and adopting a scheme similar to PURL but with the ``virtual:`` rather than
``pkg:`` scheme seems like it will be understandable and map well to Linux
-distros with virtual packages and the likes of Conda and Spack.
+distros with virtual packages and to the likes of Conda and Spack.
The two known virtual package types are ``compiler`` and ``interface``.
@@ -262,6 +274,39 @@ allow a version of a dependency for a wheel that isn't allowed for an sdist,
nor contain new dependencies that are not listed in the sdist's metadata at
all.
+Canonical names of dependencies and ``-dev(el)`` split packages
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+It is fairly common for distros to split a package into two or more packages.
+In particular, runtime components are often separately installable from
+development components (headers, pkg-config and CMake files, etc.). The latter
+then typically has a name with ``-dev`` or ``-devel`` appended to the
+project/library name. This split is the responsibility of each distro to
+maintain, and should not be reflected in the ``[external]`` table. It is not
+possible to specify this in a reasonable way that works across distros, hence
+only the canonical name should be used in ``[external]``.
+
+The intended meaning of using a PURL or virtual dependency is "the full package
+with the name specified". It will depend on the context in which the metadata
+is used whether the split is relevant. For example, if ``libffi`` is a host
+dependency and a tool wants to prepare an environment for building a wheel,
+then if a distro has split off the headers for ``libffi`` into a
+``libffi-devel`` package then the tool has to install both ``libffi`` and
+``libffi-devel``.
+
+Python development headers
+''''''''''''''''''''''''''
+
+Python headers and other build support files may also be split. This is the
+same situation as in the section above (because Python is simply a regular
+package in distros). *However*, a ``python-dev|devel`` dependency is special because
+in ``pyproject.toml`` Python itself is an implicit rather than an explicit
+dependency. Hence a choice needs to be made here - add ``python-dev`` implicitly,
+or make each package author add it explicitly under ``[external]``. For
+consistency between Python dependencies and external dependencies, we choose to
+add it implicitly. Python development headers must be assumed to be necessary
+when an ``[external]`` table contains one or more compiler packages.
+
Specification
=============
@@ -324,7 +369,7 @@ strings of the arrays MUST be valid PURL_ strings.
with values of arrays of PURL_ strings (``optional-dependencies``)
- `Core metadata`_: ``Requires-External``, N/A
-The (optional) dependencies of the project.
+The (optional) runtime dependencies of the project.
For ``dependencies``, it is a key whose value is an array of strings. Each
string represents a dependency of the project and MUST be formatted as either a
@@ -347,10 +392,13 @@ cryptography 39.0:
[external]
build-requires = [
+ "virtual:compiler/c",
"virtual:compiler/rust",
+ "pkg:generic/pkg-config",
]
host-requires = [
"pkg:generic/openssl",
+ "pkg:generic/libffi",
]
SciPy 1.10:
@@ -363,19 +411,14 @@ SciPy 1.10:
"virtual:compiler/cpp",
"virtual:compiler/fortran",
"pkg:generic/ninja",
+ "pkg:generic/pkg-config",
]
host-requires = [
"virtual:interface/blas",
"virtual:interface/lapack", # >=3.7.1 (can't express version ranges with PURL yet)
]
- [external.optional-host-requires]
- dependency_detection = [
- "pkg:generic/pkg-config",
- "pkg:generic/cmake",
- ]
-
-pygraphviz 1.10:
+Pillow 10.1.0:
.. code:: toml
@@ -384,9 +427,24 @@ pygraphviz 1.10:
"virtual:compiler/c",
]
host-requires = [
- "pkg:generic/graphviz",
+ "pkg:generic/libjpeg",
+ "pkg:generic/zlib",
+ ]
+
+ [external.optional-host-requires]
+ extra = [
+ "pkg:generic/lcms2",
+ "pkg:generic/freetype",
+ "pkg:generic/libimagequant",
+ "pkg:generic/libraqm",
+ "pkg:generic/libtiff",
+ "pkg:generic/libxcb",
+ "pkg:generic/libwebp",
+ "pkg:generic/openjpeg", # add >=2.0 once we have version specifiers
+ "pkg:generic/tk",
]
+
NAVis 1.4.0:
.. code:: toml
@@ -480,7 +538,22 @@ information about that in its documentation, as will tools like ``auditwheel``.
Reference Implementation
========================
-There is no reference implementation at this time.
+This PEP contains a metadata specification, rather that a code feature - hence
+there will not be code implementing the metadata spec as a whole. However,
+there are parts that do have a reference implementation:
+
+1. The ``[external]`` table has to be valid TOML and therefore can be loaded
+ with ``tomllib``.
+2. The PURL specification, as a key part of this spec, has a Python package
+ with a reference implementation for constructing and parsing PURLs:
+ `packageurl-python`_.
+
+There are multiple possible consumers and use cases of this metadata, once
+that metadata gets added to Python packages. Tested metadata for all of the
+top 150 most-downloaded packages from PyPI with published platform-specific
+wheels can be found in `rgommers/external-deps-build`_. This metadata has
+been validated by using it to build wheels from sdists patched with that
+metadata in clean Docker containers.
Rejected Ideas
@@ -516,6 +589,43 @@ Support in PURL for version expressions and ranges is still pending. The pull
request at `vers implementation for PURL`_ seems close to being merged, at
which point this PEP could adopt it.
+Versioning of virtual dependencies
+----------------------------------
+
+Once PURL supports version expressions, virtual dependencies can be versioned
+with the same syntax. It must be better specified however what the version
+scheme is, because this is not as clear for virtual dependencies as it is for
+PURLs (e.g., there can be multiple implementations, and abstract interfaces may
+not be unambiguously versioned). E.g.:
+
+- OpenMP: has regular ``MAJOR.MINOR`` versions of its standard, so would look
+ like ``>=4.5``.
+- BLAS/LAPACK: should use the versioning used by `Reference LAPACK`_, which
+ defines what the standard APIs are. Uses ``MAJOR.MINOR.MICRO``, so would look
+ like ``>=3.10.0``.
+- Compilers: these implement language standards. For C, C++ and Fortran these
+ are versioned by year. In order for versions to sort correctly, we choose to
+ use the full year (four digits). So "at least C99" would be ``>=1999``, and
+ selecting C++14 or Fortran 77 would be ``==2014`` or ``==1977`` respectively.
+ Other languages may use different versioning schemes. These should be
+ described somewhere before they are used in ``pyproject.toml``.
+
+A logistical challenge is where to describe the versioning - given that this
+will evolve over time, this PEP itself is not the right location for it.
+Instead, this PEP should point at that (to be created) location.
+
+Who defines canonical names and canonical package structure?
+------------------------------------------------------------
+
+Similarly to the logistics around versioning is the question about what names
+are allowed and where they are described. And then who is in control of that
+description and responsible for maintaining it. Our tentative answer is: there
+should be a central list for virtual dependencies and ``pkg:generic`` PURLs,
+maintained as a PyPA project. See
+https://discuss.python.org/t/pep-725-specifying-external-dependencies-in-pyproject-toml/31888/62.
+TODO: once that list/project is prototyped, include it in the PEP and close
+this open issue.
+
Syntax for virtual dependencies
-------------------------------
@@ -572,9 +682,10 @@ CC0-1.0-Universal license, whichever is more permissive.
.. _setuptools metadata: https://setuptools.readthedocs.io/en/latest/setuptools.html#metadata
.. _SPDX: https://spdx.dev/
.. _PURL: https://github.com/package-url/purl-spec/
+.. _packageurl-python: https://pypi.org/project/packageurl-python/
.. _vers: https://github.com/package-url/purl-spec/blob/version-range-spec/VERSION-RANGE-SPEC.rst
.. _vers implementation for PURL: https://github.com/package-url/purl-spec/pull/139
-.. _pyp2rpm: https://github.com/fedora-python/pyp2rpm
+.. _pyp2spec: https://github.com/befeleme/pyp2spec
.. _Grayskull: https://github.com/conda/grayskull
.. _dh_python: https://www.debian.org/doc/packaging-manuals/python-policy/index.html#dh-python
.. _Repology: https://repology.org/
@@ -585,3 +696,5 @@ CC0-1.0-Universal license, whichever is more permissive.
.. _auditwheel: https://github.com/pypa/auditwheel
.. _delocate: https://github.com/matthew-brett/delocate
.. _delvewheel: https://github.com/adang1345/delvewheel
+.. _rgommers/external-deps-build: https://github.com/rgommers/external-deps-build
+.. _Reference LAPACK: https://github.com/Reference-LAPACK/lapack
@encukou Would you be willing to be the PEP delegate for this?
I had a chat with @pf_moore out of band, who expressed preference to have someone who is familiar with the needs of downstream redistributors (and the problem space) as the delegate.
I’ll not answer that this year. (In a few days, I will try to disconnect from the internet until January.)
But I’ll probably need to say no. I can offer the point of view of a Fedora packager, but most likely can’t put in the time to learn all the other points of view.
As a Fedora packager, I’ll reiterate that ignoring the “-devel” split makes this… not great for us. Either we need to pull in unneeded dev dependencies at run time, or there’s no way to specify that some dev dependencies are needed at run time (e.g. for a tool that builds extensions for a particular C library). Or we go back to distro-specific metadata.
If such redistributors are part of the target audience, this IMO needs to be addressed better.
(I’ll note that PURL seems to be currently used for analyzing installed packages, not installing them or their dependencies. In those use cases, you don’t really care about header files – and in fact, if you’re analyzing a production system, they’re most likely not installed!)
The way it’s written now has gotten multiple thumbs up’s from devs from other distros and worked out pretty well so far; in GitHub - rgommers/external-deps-build all packages tried there could be handled for Fedora (which does a devel- split), for Arch Linux (which doesn’t split), and conda-forge (which mostly doesn’t split, but sometimes does). So it shouldn’t be far off from something that works well for Fedora too. In order to make your potential pain point concrete, would you be able to point out one or a couple of packages that may be problematic? That will make it a lot easier to respond and possibly adjust the design.
For build-requires, it is a key whose value is an array of strings. Each string represents a build requirement of the project and MUST be formatted as either a valid PURL string or a virtual: string.
For optional-build-requires, it is a table where each key specifies an extra set of build requirements and whose value is an array of strings. The strings of the arrays MUST be valid PURL strings.
Why is virtual: disallowed in optional-build-requires (and other optional- keys)?
If I have a package called proj which is pure Python but has an optional Cython accelerator module then I could use something like proj[accelerator] as an extra to build the accelerator module. I would expect to describe this as:
Thanks for pointing that out @oscarbenjamin. That is an editing bug I believe, rather than a difference introduced on purpose. There’s a lot of repetition in the formal specification of the various field, and some slight differences crept in. The readability of that section is suboptimal, but we kept it in that format to align with PEP 621 and pyproject.toml specification - Python Packaging User Guide. It probably deserves an a short summary at the top, stating that all fields accept the same PURL plus virtual specifiers, and that the only difference between dependencies and the other fields is that the former maps to the Requires-External core metadata field.
I don’t think I would make a good PEP delegate. I handed over nearly all my packager duties, and plan to focus on the core.
If you have multiple thumbs up’s from devs from other distros, that should be fine. I wouldn’t do much more than check what they had to say.
If you have trouble finding a delegate, I can reconsider – but in that case, for downstream experience I’d try to delegate to someone from my old team and rubber-stamp their decision.
As for my concern, if it does become an issue, it can be fixed in a future PEP. It’s a detail that doesn’t need to be perfect on the first try.
Looping back to this, it looks like we’re going to bounce this back to your court @pf_moore as the PEP delegate. Let me know that works for you.
In a similar vein, I’m trying to figure out what we’d need to move forward on this. We have a prototype showing how things would work with Greyskull and that this is a potentially feasible process (for at least a meaningful subset).
I think…
the primary external consumer of this information will be Conda but I can also imagine other redistributors potentially benefiting from this information. Maybe they’d have inputs?
a few opinions/inputs from build backend maintainers would be useful, specifically about the shape of the behaviours here that they’d need to provide around this.
a few opinions/inputs to from build frontend maintainers (other than the ones that I can reasonably represent), specifically about this metadata and handling of the external dependencies.
Should someone try proactively asking people to provide inputs on this, so that we can make process on this idea?
I’m not promising any insightful analysis, but if no-one else is willing to be PEP-delegate, I guess it falls to me (I don’t see any point pushing it to the SC). I will be heavily reliant on the PEP expressing all of the arguments clearly (I have no relevant experience of my own to draw on), so please make sure that happens
Allow me to share my thoughts as a Fedora packager and I’ll take the most recent structure of spglib as an example for this. I agree with @encukou that there needs to be some separation of devel packages, but I don’t think we need to be explicit about canonical naming.
First let’s consider how the project would be consuming these. If they are used within the build-system for packaging, then we should ask for whatever the package is consuming. Fedora relies on the Provides that are packaged in each relevant (sub-)package, e.g. cmake(Spglib) searches for whatever package provides SpglibConfig.cmake, pkgconfig(spglib) for spglib.pc, etc. This could be adapted to the PURL like pkg:cmake/Spglib, pkg:pkgconfig/spglib etc. If the package does not have a clear associated Provides e.g. if it doesn’t have cmake/pkgconfig or if a virtual is requested, than it could ask for the header files themselves pkg:header/spglib.h which would be passed into dnf whatprovides */spglib.h.
Second usage is to link to a library at runtime. Here you have the SONAME/SOVERSION provided by the library in the metadata, e.g. libsymspg.so.2()(64bit), which again could be mapped to pkg:lib/libsymspg@2. Of course asking the developer to navigate through all of those runtime libraries that are used is unfeasible, instead consider how Fedora does, where after the build, all of the libraries are scraped and a the SONAME/SOVERSION are extracted dynamically. Similarly this could be the responsibility of the build-system to scrape the artifacts for the shared libraries that it is linked to and populate a relevant field in PKG-INFO instead. Tools like auditwheel already do this part. If the linked library is not available in the library metadata, e.g. if it’s dlopen, then it should be the author’s responsibility to fill in the necessary field.
Now back to the package structure of spglib. This package is separated mainly as spglib, spglib-devel, python3-spglib on Fedora, but in PyPI, these are all grouped together in the same wheel. So for PyPI users, if they need any of libsymspg.so, SpglibConfig.cmake, spglib.py, then they only need to add spglib to build-system.requires or project.dependencies. For Fedora packaging however, we need more control on what is BuildRequires and what is Requires, and this is ambiguous on which of the 3 artifacts above do we need and for which stage? This would be solved if you populate external.build-requires with pkg:cmake/Spglib, or external.dependencies with pkg:lib/libsymspg@2, etc. The special case here being if you need spglib.py than populate the original build-system.requires/project.dependencies or have a pkg:pypi/spglib (leaning for the latter).
This, however, highlights an issue, how do you patch out the original build-system.requires/project.dependencies that were used for packaging to PyPI. One option could be to have the build-system populate the relevant parts from the external fields, but that could interfere with the current automation tools that rely on PEP517/PEP621.
Hopefully this perspective is useful to illustrate the issues that Fedora packagers would encounter, and @encukou let me know if this is sufficient context of if I missed something. I have pinged the Fedora Python matrix room if they can come and chime in as well.
PS: I probably butchered the PURL syntax (maybe it should change pkg: to cmake:?), hopefully you can fix it accordingly.
I have created GitHub - python-wheel-build/elfdeps: Python implementation of RPM elfdeps to dynamically extract ELF requirements and provides from ELF shared libraries. It’s a pure Python reimplementation of RPM’s elfdeps tool. Internally it uses pyelftools, which is also used by auditwheels. Provides are required because some packages like Torch provide a shared library that is required by other packages.
The RPM build tool uses information from its elfdeps to add automatic provides and requires to RPM packages.
Thanks @Lecris for sharing your thoughts. spglib is a nice example, since it’s indeed a little ambiguous with a C library with Python and Fortran bindings all in single repository, and treated either as a single package or split up into separate packages by different distros:
Fortran interface: libspglib-f08-2, headers in libspglib-f08-dev
Python package: python3-spglib
Ruby package: ruby-getspg
and, for completeness, debug symbol packages too: libsymspg2-dbgsym, libspglib-f08-2-dbgsym, python3-spglib-dbgsym, ruby-getspg-dbgsym
Arch Linux: a single spglib package containing the C library, headers and other dev files, the Python package, and Fortran bindings (see Arch Linux - spglib 2.5.0-2 (x86_64))
There isn’t too much rhyme or reason to this - the only thing that is consistent across package managers is the source repo, which can be referred to with a PURL as pkg:generic/spglib (or pkg:github/spglib/spglib, those are equivalent).
I don’t think this is right. In (almost?) every packaging system, the build-time and runtime dependencies for a package are given by package names of other packages, not specific files from another dependency. That seems mostly true for Fedora as well, e.g. from spglib.spec almost all Requires: and BuildRequires: dependency specifiers are package names. The exception there is BuildRequires: cmake(GTest), which is equivalent to using the package name like BuildRequires: gtest-devel (please correct me if I’m wrong there, since I’m not all that familiar with Fedora).
Using specific file names was considered and rejected, it really isn’t workable: PEP 725 – Specifying external dependencies in pyproject.toml | peps.python.org. It’s a lower-level concept than package names and most distros and packaging tooling won’t be able to deal with specifying files. So to the extent that Fedora needs this, it should stay inside Fedora. Which I think is fine? There is nothing in PEP 725 that will be a problem for Fedora continuing to do exactly what it does today whenever a package name isn’t enough [1].
I’m not sure I understand this point. Do you mean
Link to a shared library at build time and ensure it’s then available at runtime. Or,
Don’t link to it at build time, but access it at runtime through something like ctypes?
For (1), the answer would be to put the PURL for the package that contains the shared library in external.host-requires. For (2), put that PURL in external.dependencies.
Not quite. Shared libraries and .cmake/.pc files inside a Python package are not usable at build time by another package at all, since they’re not on the search paths of the relevant tools. If a Python package needs the C library of spglib, that’s an external dependency that cannot be obtained from PyPI. So it’s actually not that ambiguous: if a Python package has spglib in build-system.requires or project.dependencies, it’s the Python package only (so in Fedora always python3-spglib). And a dependency on the C library is until now a dependency that isn’t declared at all in pyproject.toml, and when we have the [external] metadata defined by this PEP it would be specified like:
Now the remaining ambiguousness is this: what if the other Python package needs the Fortran interface? This is something that can’t be cleanly expressed, and I don’t see a great way of taking care of such corner cases. There are a couple of ways of dealing with it I think:
If a distro did split the C and Fortran shared libraries, then the distro name mapping should return both of those packages for host-requires = ["pkg:generic/spglib"].
The name mapping can leave out the package with the Fortran bindings, and only add it if there’s a bug report for a real-world use case.
Try to be smarter and only add the Fortran bindings if a Fortran compiler was specified in external.build-requires.
All of those options seem reasonable, and it seems okay to leave that up to the distro.
I haven’t been able to find a single Python package that actually needs the spglib C library at build time. If you do know of one @Lecris, please point me to it and I can add it to https://github.com/rgommers/external-deps-build to verify that it actually will work as I described above.
I hope the above made clear that nothing needs patching?
Thank you, it is useful indeed. I do seem to detect a bit of a translation gap still. My impression still is that there won’t be any actual issues for Fedora, but that you do have packaging concepts in Fedora that don’t translate 1:1 to this PEP (all usage of file names in particular). The best way I can think of bridging this gap is to add more concrete examples for whatever other packages you think are particularly challenging. If you have any suggestions, I’m all ears.
Thanks for sharing @tiran. Is that meant for Fedora’s needs, or do you see that playing a role in Python packaging?
I interpret this as “Provides are required for Fedora”. If you meant something else, like “in PyTorch wheels on PyPI” then I’d like to hear more about what you’re thinking.
The way this works currently relies on preloading. There are indeed shared libraries like libtorch.so inside the torch wheels, and libtorch has a C++ API used by other packages like torchvision. If you unpack a torchvision wheel and check what its extension modules need, you see (macOS example):
% otool -L _C.so
_C.so:
@rpath/libc10.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libtorch.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libtorch_cpu.dylib (compatibility version 0.0.0, current version 0.0.0)
@rpath/libtorch_python.dylib (compatibility version 0.0.0, current version 0.0.0)
@loader_path/.dylibs/libc++.1.0.dylib (compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1345.100.2)
# RPATH entry is pointing to a random path on the AWS build machine:
% otool -l _C.so | rg -A2 RPATH
cmd LC_RPATH
cmdsize 80
path /Users/ec2-user/runner/_work/_temp/conda_environment_9864892836/lib (offset 12)
So torchvision builds against the shared library present inside the PyTorch wheels, doesn’t run auditwheel to vendor that shared library, and has an exact ==2.x.y runtime dependency on the version of torch it built against so it can assume that import torch will load the shared library that torchvision needs into memory.
There’s a couple of different ways to use PyTorch from C++, e.g.:
Libtorch is a separate package, so you’d want to declare this in external.host-requires (if you’re building a Python package that is; Libtorch is primarily for situations where you cannot use/deploy Python).
That said, I don’t quite see why Fedora actually needs this. BuildRequires: cmake(GTest) and BuildRequires: gtest-devel should act the same, and the latter is arguably a bit simpler since it avoids the “reverse lookup of package name” step. ↩︎
The main issue is that if you rely on something like pkgs:generic/spglib you are guaranteed to encounter ambiguities with different naming conventions, different packaging policies, etc. My main argument is to discourage a format like pkgs:generic/spglib as much as possible, with either:
Specifying specific variants for each distro debian:libsymspg2-dev, arch:spglib, etc.
Support the non-pkgs variant, i.e. cmake(spglib) which automatically translates to spglib-devel or any naming convention that are present on rhel, open-suse or any rpm-based packaging systems. If the distro cannot expand it, 1) it should, 2) it could use distro-specific labels, 3) how much do we want it to be actually supported
Another issue related to the naming convention is who distributes the name maps? Do the distros have to maintain such a name-map for their packages? Would it be the build-system’s responsibility to maintain them for all distros? Would it be on the consumer side to define for each variant?
On the other hand the cmake(Spglib) artifact does not require maintaining such a map, it can simply use
$ dnf whatprovides "cmake(Spglib)"
Last metadata expiration check: 0:00:08 ago on 2024年08月29日 18時30分39秒.
spglib-devel-2.2.0-2.fc40.i686 : Development files for spglib
Repo : fedora
Matched from:
Provide : cmake(Spglib) = 2.2.0
Than even when the naming changes e.g. a better spglib-ng comes along with compatible API, than we don’t need to update anything.
Another approach would be to rely on the known paths and ask dpkg, spack etc. to search for the package that provides that file.
$ dnf whatprovides "*/SpglibConfig.cmake"
Last metadata expiration check: 0:04:59 ago on 2024年08月29日 18時30分39秒.
spglib-devel-2.2.0-2.fc40.i686 : Development files for spglib
Repo : fedora
Matched from:
Filename : /usr/lib/cmake/Spglib/SpglibConfig.cmake
Ambiguous request
Let’s say we are requesting spglib, package, then what exactly do we need from it? Do we need the CMake files, then how do we know that the other packages provide the CMake files? Do we need the runtime files (more commonly this is the case for pre-processors: swig, fypp, etc.)?
This would again be resolved if we don’t request a specific package, but instead the precise artifact that we want cmake(Spglib), pkg-config(spglib), /usr/bin/swig, etc. with some syntactic sugar to make a PURL.
Python libraries can advertise what they depend on
At this point let’s say we have a successful build, a lot of the times we do not need to include the runtime dependencies. @tiran mentioned a tool that can be used within python which other build-systems can call. The idea is that after you build, you can introspect the libraries that it needs to run, and if you go with the artifact approach instead of the package mapping, than you can simply ask the system what package gives the relevant library
$ dnf whatprovides "libsymspg.so.2"
Last metadata expiration check: 0:17:40 ago on 2024年08月29日 18時30分39秒.
spglib-2.2.0-2.fc40.i686 : C library for finding and handling crystal symmetries
Repo : fedora
Matched from:
Provide : libsymspg.so.2
This is especially useful because there can be various compatibility packages, different versions of libraries, etc. which would not be apparent from the package name alone.
And in the case of runtime dependency, both dep and rpm systems allow to introspect this information, and the consuming project does not have to maintain these, you just need the build-system to run something like elfdeps and populate the relevant Python metadata files in dist-info
Beware of how each package builds
Many projects would have fall-back build processes, e.g. in CMake you have FetchContent(FIND_PACKAGE_ARGS) which runs a find_package and if it fails it downloads the dependency from FetchContent, or they would have bundled sources for the dependencies etc. This should be taken into account since you may not want to use those fallbacks, and just because a project was built after external-deps-build injected a dependency, that does not guarantee that it was used properly.
They actually are. This is what I’ve been working on recently in spglib and scikit-build-core is creating 1st-class support for that. cython-cmake shows partially this integration.
I don’t think it is that easy. Take rapidfuzz for example. It supports both a compiled C library and python-only implementation. How do we tell it that we want the python library rapidfuzz with or without the compiled library?
This is about “Python libraries can advertise what they depend on”.
Similarly with requesting the artifact instead of the package
$ dnf whatprovides "pkgconfig(spglib_f08)"
Last metadata expiration check: 0:43:01 ago on 2024年08月29日 18時30分39秒.
spglib-fortran-devel-2.2.0-2.fc40.i686 : Development files for spglib with Fortran bindings
Repo : fedora
Matched from:
Provide : pkgconfig(spglib_f08) = 2.2.0
The provides does not support CMake components because those are quite more tricky to introspect statically, but there are plenty of other artifacts that can be used.