PEP 517 Backend bootstrapping

As a proof of concept, I’ve implemented the <bootstrap-backend> + magic file version in the pep517 project. It was very easy.

On the setuptools side, the implementation of __bootstrap_backend__.py will probably look like this:

import sys
sys.path.insert(0, '')
from setuptools.build_meta import *

(Note that this will only insert the source root into sys.path when building setuptools from source. When using setuptools to build your package, sys.path will not be affected).

It still seems a little inelegant to involve a magic string and an extra magic filename in the root directory of the repo when most self-hosted backends are going to have the same incantation - add a directory to sys.path and then load the hooks from the real backend. I’ll do it for Flit if that’s what’s decided, but I find the suggestions with declarative config for this much neater.

Honestly, elegance really doesn’t matter in this case. I know it’s tempting to design a generic and general interface that works for everyone because that’s what we always do because we’re library maintainers, but this is a super rare problem to have, and you can always abstract over it with a dedicated backend that emulates the semantics you want.

Most backends won’t or shouldn’t bother self-hosting, and of the ones that don’t want to build themselves from setuptools, you can get exactly the semantics you describe by simply using intreehooks for your build backend.

Similarly, all the other ideas in this thread can be implemented fairly easily by other self-bootstrapping meta-backends if anyone cares to use them.

I don’t see a reason for intreehooks to exist if there’s a standardised mechanism to do the same thing. I intend to switch Flit away from it once we have something well supported. We’re probably not talking huge numbers, but I’d imagine that most PEP 517 backends will want to be hosting and thus will use the mechanism we design.

In the early designs, PEP 517 explicitly considered bootstrap backends and found them to be out of scope. I’m pretty sympathetic to this view. I do not see why we have to design a “nice” mechanism for people to use for something they shouldn’t need to do in the first place.

Some PEP 517 backends will just not support being built indefinitely from source (they can bootstrap with wheels), some PEP 517 backends will just use a self-bootstrapping backend to build themselves, and others may use meta-backends like intreehooks.

The reason to add a standard self-bootstrapping mechanism is to make it possible to bootstrap. Once it’s possible to bootstrap, you can easily make meta-backends that give you whatever semantics your heart desires. Already we’re starting to discuss supporting backends that use the src layout or that have symlinks in their code organization - all edge cases that we can completely ignore by explicitly making the version that all frontends are required to support take a form that is super easy to implement and by design has no edge cases and is in fact deliberately inelegant in this way.

I made an alternative proof of concept for the bootstrap-backend-location design: https://github.com/pypa/pep517/pull/42

I made intreehooks to use with Flit, and I plan to move away from it as and when we standardise something. I’d rather bootstrap without relying on setuptools if possible, and I imagine that other build backends would want this to. So we’re not designing for n=1 here, though I agree that n is probably pretty small.

2 Likes

I’m fine with any of:

  1. @takluyver’s approach, assuming that the sys.path.pop(0) part of it is made part of the spec.
  2. The version with a magic-named build backend file.
  3. Requiring build backends to be built from wheels.

Hopefully this represents something like an end to the bikeshedding nightmare that this thread has become?

Edit: I have a suggested wording for the PEP 517 update in this PR. The change would be:

In order to allow build backends to build themselves, it is possible to supply
an additional location to search for build backends with the
build-backend-location option. If specified, this must be a path relative
to the root of the source tree which will be added to sys.path while the
build backend is imported. Build frontends must remove this location from the
module search path before executing the build backend hooks.

I don’t like personally @takluyver approach because it extends the API with magic a magic path thing. I prefer the wheels only over @pganssle approach in between the remaining two. Not religiously against either of those.

I’m worried that this part does add finicky edge cases… at a minimum, you need to say what the frontend should do if importing the backend mutates sys.path so that sys.path[0] is no longer the bootstrap location. And what if the backend does want to keep that directory on sys.path? For example, lots of large packages eventually start doing lazy imports, which could be a mess if directories are randomly disappearing from sys.path. Removing things from sys.path is a really rare thing in Python in general. Most packages are going to assume it never happens, and may react in weird ways.

In software, it’s usually better to do things in the most boring and unsurprising way if at all possible, which means setting up sys.path once at the beginning of your program, and then leaving it alone.

You’re right, it doesn’t have to be nice. If there was no nice option, we would pick a non-nice option. But, that doesn’t mean we should intentionally pick the ugliest option! We should pick the nicest of the options available to us.

We want to encourage lots of build backends to be developed. The nicer we can make the experience of developing a build backend, the better the packaging ecosystem will be for everyone.

I agree that your "<backend>" hack is easy to implement, but compared to build-backend-location it’s much harder to explain or remember or use – it’s full of arbitrary quirks. As the Zen says: “If the implementation is hard to explain, it’s a bad idea.”

And I still haven’t seen any example of a problem that people are worried will be caused by build-backend-location? I get that you’re worried that if we implement it, then people might use it, and you think they shouldn’t do that, or that it would be ugly or something. But I don’t know what bad consequences you’re afraid will happen if people use it, beyond “if people do an ugly thing, it will be ugly”. Do you have any concrete examples of a problem that it could cause?

A lot of people (especially at large companies) are wary of using pre-built wheels, because they want to keep reliable records of which source code they’re running in production, so that they can do things like audits (manual or automated). Using pre-built wheels can break this chain. For example, the npm folks recently dealt with a really severe compromise of the event-stream package, and part of how the malicious code was hidden was that it was only injected into the “pre-minified” package on npm, not the “source” package.

Of course there are ways to work around this; a company that wants to bootstrap their internal Python builds could add some sort of manual exceptions for the “root” packages, or build them by hand. Or they could do some recursive bootstrap thing, like C compilers do. These approaches all involve significant manual hassle, and introduce a lot of room for errors, neither of which are things that tend to make security auditors happy people. (Rust relies on recursive bootstrapping for their compiler, and it’s been such an issue that people have actually built an entire second compiler for the sole purpose of simplifying the bootstrap chain and checking for “trusting trust” attacks – Reddit - Dive into anything).

Also, this hassle scales with the number of different build backends in use. We want to encourage people to design new build backends. We definitely don’t want to end up in a situation where companies refuse to use any package that doesn’t use one of a small set of pre-selected build-backends that they’ve already set up special bootstrap hacks for.

So… you’re right that it in principle we could get away without having any special support for bootstrapping in PEP 517. But given that it’s simple to add, has minimal downsides, and will significantly simplify a lot of people’s lives, I think having some special support is a good idea.

3 Likes

Hi, sorry I was a bit unresponsive in this thread, since I’m currently on vacation. I was wanting to reply, but the Internet is more spotty than I anticipated here in Bohol (wonderful place though!) :stuck_out_tongue:

Thinking about it, my unwillingness isn’t really about putting . in sys.path, but that this is used to import the to-be-installed package before it is installed. This unnecessarily ties the metadata retrieval process (e.g. setup.py egg_info) to the assumption that the package is importable on the target machine, which causes problems to package management tools when the tool builds a (platform-agnostic) dependency tree.

I proposed a similar approach in the related GitHub issue that it may be a good idea to have built-in in-tree build tool discovery similar to @takluyver’s, and I also think there isn’t really a downside to that, AFAICT. A magically-named build-backend entry point is also fine, but I feel there is no harm to make an optional feature explicit.

I don’t like sys.path.pop(0) since it has a potential to break in edge cases (0 is too magical). Requiring wheels is an acceptable solution (essentially by telling people we don’t want to “solve” it); I have little experience working with large corporations, but @njs’s concern sounds reasonable to me.

This part of the discussion makes me wonder if there should be a mechanism for backends to declare whether they support recursively building from source. Otherwise, there won’t really be a way to find out without manually trying each version when it comes out, and possibly failing far into the build process (and not knowing whether it’s a bug or just an unsupported use case). (And would the cause even be clear from the failure message to let front-ends provide a good message?) Declaring would also let the process fail faster if --no-binary :all: is passed but not supported.

I think right now, 100% the backends we know about do-or-will support recursively building from source. And certainly we hope that most backends will support this. So it’s an interesting idea, but I think we should split that off into a separate discussion, and maybe don’t bother discussing it until we actually encounter a backend that would need this.

Nothing about this makes it harder to build a backend, to be clear. It just makes it slightly harder to build a self-bootstrapping backend, which is a very different proposition. But I’ve made this point repeatedly and evidently it’s not getting through.

So far, we’ve seen two front-end implementers, @pf_moore and @bernatgabor, plus @dstufft all in favor of “this is out of scope for PEP 517, just make them use wheels”. Given that it seems impossible to prevent scope creep in this proposal, I’m now in the wheel camp as well. The PEP is actually completely silent as to whether or not front-ends are required to give you an option to build everything from source, so it would not require a PEP update for pip to make --no-binary :all: not apply to building the isolated build environment.

I strongly recommend that you do not break the python path discovery semantics just to implement this dubious “feature”, but at this point I don’t think this conversation is going anywhere.

I think we can assume that most build-backend developers will want to actually use their own buildbackend to build their build-backend (say that 5 times fast). Heck, @takluyver actually created a second independent backend package just so he could use flit to package flit :slight_smile:. Making it easy for backend devs to use their own backend definitely counts as a better dev experience IMO.

Heh, overall this has seemed pretty productive to me so far! If you think this is bad you should have seen the original PEP 517 discussions…

No idea what you mean by “break the python path discovery semantics” tho. There is lots of precedent for having explicit ways to add stuff to sys.path, starting with PYTHONPATH and going from there.

2 Likes

Yes, but again only the bottom of the stack needs to be self-bootstrapping. intreehooks is a perfect example of what I’m talking about because it’s basically one backend that is self-bootstrapping that allows every other backend to get the semantics they want for their own backend. You are talking about allowing people to break the build isolation for specific backends in the configuration file. I am suggesting that PEP 517 should always be calling the hooks without manipulation of the PYTHONPATH, which should be the sole domain of the backend.

If you want to understand the part about the semantics, I suggest reading the many times that I explained it earlier. I have no idea why you are so insistent on making it so the front end’s search path for the backend leaks into the backend’s import path, but that has always been my objection. I see no reason why it needs to be a critical feature of the spec and if you want an option for it, you can super easily implement a backend that does that (in fact that’s literally what I’m planning on doing with setuptools).

I read this discussion as “should intreehooks be made part of PEP”, and I feel the answer might be yes. It is not a must, and things can be done in other ways, but I feel it is most intuitive to be able to do what interhooks currently does within the built-in section. I would certainly be rolling my eyes to read that I need another backend to build my own backend if I were new to this topic.

Because you’re proposing import semantics that are not normal Python semantics - the closest that exists is the isolated mode option, which disables the normal behaviour of putting the script directory on sys.path (or the current directory for -c, -m, stdin execution, and the interactive REPL), but in isolated mode, nothing gets imported from the directory that gets omitted from sys.path.

So @njs and I are both objecting to the “This one package can be imported from this directory, but nothing else can” idea from a “Let Python be Python” perspective, not anything to do with packaging specifically.

So I’m generally fine with Proof of Concept: Bootstrap backend from specified directory by takluyver · Pull Request #42 · pypa/pyproject-hooks · GitHub from an interface specification perspective (it’s very similar to the PEP517_SYS_PATH_0 workaround I added in Issue #6163: Temporary workaround for legacy setup.py files by ncoghlan · Pull Request #6210 · pypa/pip · GitHub), but the implicit sys.path.pop(0) is spectacularly weird from a backend execution environment perspective.

Instead, I’d prefer that frontends provide self-bootstrapping backends with normal Python import semantics (similar to the way PYTHONPATH already works), and let backend decide for themselves whether or not to do sys.path.pop(0) before running any project provided code. In the case of setuptools specifically, the way that would end up looking would presumably be:

  • setuptools.build_meta_legacy would make sure that the source directory was on sys.path, but otherwise not care
  • setuptools.build_meta would make sure that any source directory subpaths are not on sys.path, and direct folks to use their own sys.path manipulation instead

I have no idea why you think this is an option, but it’s not realistic. If the PEP 517 path semantics are broken by this change, we’ll simply have to live with it. Backends will have no simple way to tell whether an entry in sys.path was added there because the frontend thinks that it needs to be bootstrapped or for some other reason. The only reasonable place to enforce this is in the frontend.

There are lots of options proposed here that are well within “normal” path semantics (including the __build_backend__.py one) that don’t allow manipulation of the backend’s PYTHONPATH as part of the project configuration. You won’t be selling me on anything that allows a project to manipulate an arbitrary (non-bootstrapped) backend’s sys.path in the configuration file to solve what is essentially a non-problem anyway. pip can just make it so --no-binary :all: doesn’t work for build backends, or do some cycle detection and fall back to a wheel in the case where a project is trying to bootstrap itself - all without any changes to the PEP. If everyone’s so gung ho on making it easier to write backends that can build themselves, the easiest way to do that is to make it so that they can simply declare a dependency on their existing wheels - then they don’t need to maintain any sort of complicated bootstrapping operation in their source tree, they just have to ship a wheel once.

Note that since setuptools declares a build-time dependency on wheel, which itself is built by setuptools, I’m not even sure the in-tree bootstrapping will work without vendoring wheel. This problem is neatly sidestepped by either disabling --no-binary :all: while satisfying the build requirements or disabling it to break cycles. The only downside to the wheels approach is that it will be harder for some superstitious companies to build their entire tree from source, and frankly that’s not a use case we need to subsidize.

Can I suggest that we put this discussion on hold for a couple of days and then return to it with cooler heads? The messages are flying thick and fast, and we’re getting “I can’t see why you’d think…” kind of posts - the atmosphere is on the way to getting confrontational, and we’re not going to decide anything this way.

I’d like to take the (rest of the) weekend to think through the options, read what other people have written, and see if any new ideas or compromises come to mind. I don’t think this is an urgent one to solve now, and pip/setuptools have more pressing problems to resolve (edit: I’m thinking of https://github.com/pypa/pip/issues/6163, for instance).

2 Likes

For the specific case of Red Hat, there’s already a multi-step bootstrap for new Python stacks in order to get initial bootstrapping binary RPMs built without docs (to break the circular dependency on sphinx et al), without tests (to break circular dependencies on pytest et al), and without ensurepip (to break CPython’s own circular dependency on setuptools and wheel). Only once that core library set has been constructed does it then go back and rebuild them all properly, using their limited-but-good-enough-to-self-host bootstrap variants to build the fully functional versions. (For the curious: rpm-list-builder/python37.yaml at python37 · hroncok/rpm-list-builder · GitHub )

I’d expect any organisation with a “we build everything from source” policy that extends all the way down to core language ecosystem tooling to have a comparable technical capability, and if they adopt the policy without investing in the capability, I agree that’s their problem, not ours.

That means I’m a fan of the idea of making the enhancement here a way for projects to indicate that build requirements must be resolved from the wheel archive, and not built from source, even if a frontend has been asked to build everything from source.

That then makes the key remaining PEP 517 question on the bootstrapping topic the following: should we specify exact conditions under which a frontend MUST consider a build dependency unresolvable and fall back to installing from an available wheel archive?

In addition to promoting consistency across frontends, the benefit I see to our doing that is that projects like pip that need to modify the behaviour of options like --no-binary to still allow binary build dependencies will be able to point to the relevant section in PEP 517 as the rationale, rather than having to defend themselves to their users on a case-by-case basis.

The first clause I think we should add is something like:

Frontends MAY be implemented such that all declared build dependencies are installed from binary wheel archives and never implicitly built from source.

That lowers the barrier to building compliant frontends (those frontends just won’t be able to build some projects, for the same reason pip 18 couldn’t build them).

And then the second clause would be something like:

Frontends that allow for declared build dependencies to be implicitly built from source MUST fall back to instead installing from a binary wheel archive if the project’s own name is listed in build-system.requires (regardless of any other settings that have been given to the frontend).

Given such regression terminating behaviour in pip, both setuptools and wheel would terminate the current infinite regress, since they’re implicitly added to build-system.requires if there’s no build-backend set.


[Reordered this to put it after the wheel related discussion, as I now doubt we’re going to go in the direction of explicit self-bootstrapping, and instead rely on binary archives to terminate regressions (the same way RPM et al already do).]

I still don’t understand why you view this approach that way. The only code that runs in the backend process before the backend’s own code is frontend wrapper code, and even a modified PEP 517 would still tell frontends to keep in-tree paths out of sys.path unless bootstrap-backend-location had been set. Thus even without any help from the front end, backends would be able to infer the use of the setting (or a functional equivalent like the intreehooks meta-backend) from the presence of in-tree paths in sys.path.

It would also be possible to make the PYTHON_BOOTSTRAP_BACKEND_LOCATION environment variable a defined part of the PEP rather than a pep517 library implementation detail.

That said, I think we can put aside that entire digression, as I agree with you that having the ability to force the use of wheels for build dependencies is a more promising direction to resolve bootstrapping issues.

1 Like