PEP 725: Specifying external dependencies in pyproject.toml

Coming back to the thread, and the PEP again months later, and I feel confused again. Just to make sure I have it straight in my head:

“build requirements” means things that need to be available on the machine where the wheel is being built (the dev’s machine when preparing a wheel for distribution, or the user’s machine after downloading an sdist), that need to be looked up according to that machine?

“host requirements” means things that need to be available on the machine where the wheel is being built, but need to be looked up according to the end user’s machine, even when the dev is building it locally?

And “dependencies” means things that need to be available on the user’s machine after installation, to use the installed wheel - the same as with native Python dependencies (where the first two categories would both be “dev dependencies” for native Python)?

Re build/host requirements: that sounds about right @kknechtel. At least, I interpret what you wrote as in agreement with the cross-compilation section of the PEP. Maybe easiest to illustrate by example. Say you have a package with some C code, and that C code depends on libpng. Then your C compiler is a build dependency and libpng is a host dependency - because after the wheel is built, you don’t need the C compiler anymore, but you still need libpng. This also means that if you’re cross-compiling, say you’re building an aarch64 wheel on x86-64, your C compiler will be built for x86-64 while libpng should be a binary built for aarch64.

Pretty much, however note that this PEP does not say anything about the distribution method or how to make a runtime dependency actually available. That’s out of scope of the PEP, and nothing changes here from how it works today. If you’re building the wheel to upload to PyPI, you’d typically run a tool like auditwheel over the wheel, which will vendor native runtime dependencies into the wheel. So in the example above, auditwheel will perform all the steps needed to include libpng into the wheel, so the end user doesn’t really need to worry about this dependency.

2 Likes

Who is “you” in this sentence?

If “you” means the person who installs the wheel, then why would libpng be in [host-requires]? Shouldn’t it be in [dependencies]?

If “you” means the person who builds the wheel, regardless of where the wheel is installed, then why is libpng still needed after the wheel has been built?

Essentially, I’m trying to understand why there are three categories here.

In broad terms, how does this actually work? Is it looking for .o files within the wheel and scanning for their not-statically-linked symbols, or something like that? Once it decides to vendor something, is that any more complicated than adding the corresponding .so to the wheel and updating the RECORD file?

Edit: got turned around by different names, never mind me

Yes, “you” is the person building from source here. Why libpng is still needed after the build: the way this works is that in the build config files for the package (i.e., setup.py or meson.build or CMakeLists.txt), there is code to detect libpng. And then the libpng dependency info (things like paths to directories containing the libpng library and header files) gets passed along to the compile step that in the end produces a Python extension module. It may look like:

gcc -lpng -I/path/to/include-dir my_pyext_module.c -o my_pyext_module.so

Now that produces my_pyext_module.so, which has a reference (RPATH typically, except on Windows) to /path/to/libpng.so. And when the Python interpreter import the package and that imports my_pyext_module, that reference gets resolved.

This is all a little complex and the details are going to differ for different platforms, build systems, etc. But I hope it’s clear that libpng.so is still needed. In the end you now have a wheel that contains some reference to that .so file somewhere on the hard drive of the machine on which you built it. All fine if the end user is also the person who built the wheel. But that reference is not valid anymore as soon as you move the wheel to another machine. This is where auditwheel & co come in, to remove those references and vendor the .so files inside the wheel.

You’re asking basically how build systems and Python packaging in the presence of native extensions work. I don’t think I can explain it much better in this thread, and there’s no single place AFAIK that will explain everything from beginning to end, so I’ll give you a couple of links:

The above describes the basics. Reality is a lot more complicated - browsing the code, docs and issues on auditwheel, delocate and delvewheel will give you a better impression of that.

The third category, “dependencies” is something that hasn’t come up yet in the above libpng example. The PEP does contain several other examples. Spyder for example. It’s a pure Python package, but at runtime it needs ripgrep, tree-sitter-cli, and fzf. The first two are written in Rust, the third in Go - all not available from PyPI. Hence how does Spyder express that it needs those things? Today it can’t, with this PEP it can put them in dependencies = under [external].

1 Like

To be clear, it’s needed on the machine where the wheel is installed - either an existing copy that got referenced by building the wheel locally, or a vendored copy included in a wheel that was built remotely?

Say a developer builds and publishes such a wheel, neglecting to vendor libpng. I have libpng on my system already. Could I somehow install the wheel with Pip normally, and then use a similar tool to fix the references so they point at my local libpng? (Does such a tool exist? Would the Pip developers be interested in adding that functionality?) Similarly, what if libpng was vendored, but I want to relink to my system libpng and remove the copy to save space?

But in terms of the actual wheel contents, does the vendored dependency consist of more than just that .so file?

Would it make sense to record a hash for that .so file somewhere, in case a build system implements some kind of cache? Well, I guess RECORD would have a hash since it has a hash of each other file in the wheel. But might someone want to make a lockfile that records such a hash in order to avoid recompiling?

In the future that you envision, could the wheel for Spyder vendor those dependencies? Would they each conceivably be single files that make sense to just stick in the wheel, and expect a future Pip (or another installer) to know what to do with them (or delegate to something that does)? Would it make sense to record hashes for those?

You need to build against the libpng that would be used at runtime to ensure ABI compatibility. If pip is asked to swap out the vendored lib for a local system one then pip has no way of knowing if they could be expected to be ABI compatible. This PEP does not provide pip with any way to know that but realistically it would be impossible to attempt to do that anyway. What could be possible is declaring ABI relationships between wheels and then pip could be able to understand them from the wheel metadata but that is not being proposed here.

There is there is no general way to describe the ABI compatibility of shared libraries except by referring to a very specific collection of binaries that are known to have all been built together effectively on the same machine. This is basically something like “the set of all x86-64 packages for Ubuntu 22.04”. Without knowing that all packages were effectively built together for compatibility there is no way to know if other binaries in the wheel would be compatible with anything not shipped in the wheel.

2 Likes

Ouch. It starts to sound like a miracle that anyone can use dynamic libraries at all, in any ecosystem.

1 Like

I think it would be worth succinctly mentioning the “dependencies” case earlier in the PEP e.g. here is where the other two cases are explained:

The “dependencies” part is not mentioned until later.

Actually it would be good to just summarise the whole thing succinctly at the top:

This PEP proposes to add an [external] table to pyproject.toml with three subheadings “build-requires”, “host-requires” and “dependencies”. These are for specifying three types of dependency:

  1. build-requires, build tools to run on the build machine
  2. host-requires, built and needed for host machine but also needed at build time.
  3. dependencies, needed at runtime in the host machine but not needed at build time.

Ouch. It starts to sound like a miracle that anyone can use
dynamic libraries at all, in any ecosystem.

Which is one of the big reasons why classical “distributions” (e.g.
GNU/Linux or the *BSDs) exist, and why when someone in a
language-specific ecosystem says they don’t need a distro and are
just going to “wing it” everyone who has been around that block a
few times laughs and takes another shot.

Can compilers somehow be mapped to chocolatey and libraries to Vcpkg? (choco install visualstudio2022-workload-vctools for MSVC maybe? rust and llvm are also on there.)

It’s possible, but you need to track very well in the metadata which ABI (i.e. libraries and their verison) your library is dependent on.

Depending on the version scheme (vis-à-vis ABI stability) of the library you’re linking to, this could actually be a constraint like >={version_at_build_time},<{next_major_version}, doesn’t have to be 1:1 the same.

What I wrote above is exactly how conda/conda-forge does it. The big difference is that we’re shipping libpng as well (not dependent on an opaque artefact on the user’s system). If libpng changes its version/ABI, we can rebuild all dependent packages against that new version, and everything stays self-consistent.

Everything listed under short-term migrations on our status page is an ongoing operation of that kind, where some underlying package changed, and we’re rebuilding the ones depending on it for the new ABI.

4 Likes

Let me put it slightly differently: the ABI compatibility between binaries X1, X2, … and binaries Y1, Y2, … might not be as strict as one to one. However to know that X1 is ABI compatible with whatever Y is installed you do need to know which of Y1, Y2, … is installed. In the case of conda it is possible to identify uniquely the exact package that installed libpng. The correspondence between that package’s unique ID and all of the potentially ABI-relevant steps that produced the installed binaries is one to one so if conda knows that Y2 is installed then that is sufficient information in principle to determine ABI compatibility for the libpng binaries present in the system.

In the case of pip it is not even possible to know that any Y is installed at all and there is not even any language that could be used to describe the fact that this libpng is Y1 rather than Y2 let alone any way for pip to make use of that information. This is why I said that it could be possible for pip to handle ABI compatibility between wheels: at least in principle it could be possible to design packaging in such a way that pip could uniquely identify which wheels were installed. There is no possible way that pip could usefully understand ABI compatibility for system shared libraries.

It’s one reason conda is so great. :slight_smile:

FWIW, that’s how Linux distros do it as well.
(Somewhat tongue-in-cheek: it seems that the main difference between distros and Conda is that distros ship the OS kernel(s) as well…)

2 Likes

Absolutely!

If we zoom out as far as we can then:

  • Yes, conda doesn’t take some responsibilities of the OS (like the kernel and some other fundamental bits)
  • OTOH it takes on all major OSes simultaneously (~triple the infra, toolchain, idiosyncrasies & general packaging woes)
  • Distro packages are more on the “critical path” than conda packages, which generally live in environments that can be recreated. That means more flexibility, easier upgrades and less responsibility for conda
  • Distros generally have one version per package, while conda has many. You can easily install an older or newer version of an important package (than your distro) into your conda environment. We also have 5 concurrent cpython versions we build for, and so on.

I’d be curious how maintenance effort (e.g. in person hours) would compare between conda-forge and a noncommercial distro like Debian. I honestly don’t have a clue, but I suspect it’s roughly in the same ballpark.

My point above was mostly: it can be done. That doesn’t provide a solution, but at least a data point, resp. an example to learn from.

2 Likes

A few packages show some interesting dependencies where it wasn’t immediately clear what the correct dependency was. E.g., for psycopg2-binary I had to choose whether to depend on PostgresQL or libpq, and after choosing the latter I found that the mapping from:

host-requires = [
  "pkg:generic/libpq",
]

to distro-specific package names was non-uniform (libpq on Fedora and Conda-forge, but postgresql-libs on Arch Linux). The set of dependencies was interesting enough to serve as a proof of concept for a central name mapping repository as well I’d say.

I think a central name mapping is not as sustainable as maintenance of a mapping located and maintained in each ecosystem. What would be centralized instead is perhaps a list of URLs to match to a given ecosystem name; a link to a default mapping for that ecosystem. What I see here with centralized maintenance is that it spreads ownership of maps across too many groups. Who should review PRs? Do the mapping lists follow a release cycle? Who manages that? One nice thing about the link to a mapping instead of an actual mapping is that it would allow users to easily customize their behavior. Perhaps these could have a simple hierarchical behavior, where one mapping could override or extend another.

At the very least, the mapping should be split out into one file per ecosystem, such that the file separation would confer “ownership.”

The Postgres example is worst-case, certainly, where there is no recognizable relationship (in terms of simple string comparison) between package names that provide a common resource. We certainly cannot reliably count on mapping pkg:generic/something to any ecosystem-specific mention of something, though perhaps we could have some guessing based on package names and/or other metadata (description?)

In other words, I think what might be most helpful to define in this effort is not a central map, but instead a standard for the format of the mapping, and some effort to seed concrete implementations of this standard mapping into several major external dependency providers.

2 Likes

Thanks Oscar, good idea - we’ll include that in the next update.

I don’t see why not, since this is all just custom code. It doesn’t feel quite right in general of course, but if this is a special case that is necessary, one can just make a rule that’s specific to this OS / package manager combination.

Yes, fully agreed. This draft text aligns quite closely to what you are saying I believe: https://github.com/rgommers/peps/blob/pep-name-mapping/pep-9999.rst#maintenance-costs-of-name-mappings.

The first time a distro maintainer encounters pkg:generic/some-new-name, they’re indeed going to have to map that to the name of some-new-name in their own package repository. If Grayskull and other recipe generators would want to take a guess or simply report the unknown name, I’m not sure. I think in practice the latter would be perfectly okay. The win is to have it become a known name, so all common external dependencies get translated perfectly every time after that first encounter.

Indeed, this is very close or exactly what I had in mind, and sorry I wasn’t current on reading that draft.

When you talk about something becoming a known name, I think there’s two parts, and I’m not sure the PEP is explicit about this (or maybe I’m just dense).

  1. A centralized list of names that are defined. This is NOT the mapping. It is a central registry that serves to avoid redefinition of a given PURL identifier. This is maintained as a PyPA project.
  2. Per-distro mappings that follow some format defined by the PEP. These could be specified by a web URL or a file URL on a local filesystem. One mapping is maintained by each distro/package ecosystem.

Therefore, the process of a Python package maintainer looking to add external dependencies might do something like this:

1a. Look up their package name in their distro’s mapping file. This is an inverse name search, from their native package system which they are presumably familiar with, to the PURL that they may not be familiar with. If they find it, they use the PURL and need not continue.

1b. Look up their package in the central PURL registry directly. If they find it, they use the PURL and need not continue.

2a. The mapping may not be defined yet. The python package maintainer should proceed to 1b.

2b. The PURL for this dependency is not registered yet. The python package maintainer may submit a new entry to the PURL registry.

  1. If a new PURL registry entry has been accepted, all distros/package ecosystems now need to add a mapping. To facilitate this, we (PyPA, contributors to this effort) should provide a tool that facilitates notifying the package ecosystems of the PURLs that they should map at their earliest convenience.

Is this too concrete for a PEP? If I’m on the right track, let me know and I can spend more time fleshing this out and coming up with diagrams.

3 Likes

A quick question about the design of the proposal:

Is it explicitly desired to keep “external” dependencies listed in separate places from pure-Python dependencies that are used in the same context?

Or is this only a consequence of concerns for simplicity, backwards-compatibility etc. (e.g., PEP 508 can’t represent these dependencies, so existing tables would have to store either PEP 508 or PURL)?

The reason I ask is because I have an idea for using a table to describe a dependency that needs complex information (perhaps a package-specific index URI), or to reuse a complex specification (even just PEP 508 allows a fair bit) with a simple name that can be applied in multiple contexts; or just to write things out in a more explicit long form. It would just generally be more extensible, and it occurred to me that this could be designed to cover non-Python dependencies. That, in turn, could reduce the number of places that need to be specified to hold lists of those dependencies, if it’s okay to put them together. With this idea, “existing tables would have to store either” PEP 508 or a plain string name defined elsewhere in the data (and the determination could be made on that basis, for example).

(For what it’s worth, I’m -.5 on the idea of inventing a separate virtual: scheme, specifying it in a few sentences in the middle of a PEP, and then describing the resulting strings that start with either virtual: or pkg: as still being “PURL”.)