PEP 517 Backend bootstrapping

dstufft · February 17, 2019, 6:00am

Already have it. It’s adding a —only-binary setuptools on to the end. Making the options more confusing (no binary…except when we arbitrarily decided it was too hard) is not how you get a good user experience.

cjerdonek · February 17, 2019, 6:00am

This could be another use case to add to the list in @takluyver’s new “Pip options for controlling use of prebuilt packages” thread.

The simplest API might wind up being one that lets the user opt in or out individual packages or types of packages, in which case this would be a special case.

Also, another thing to consider is whether we could let pip re-use from previous invocations dependencies previously built from source. I’m not sure how easy or hard that would be to implement or if it would violate some other assumption.

bernatgabor · February 17, 2019, 8:10am

I would say that simple vote of majority in between us or members of packaging authority, or shall we delegate the decision or the council to pick one.

takluyver · February 17, 2019, 9:54am

I think it’s pretty much what I meant by 2: “build all packages to be installed from source but satisfy build dependencies normally” (where ‘normally’ means use a wheel if available, build from sdist if not).

I’m still wavering back and forth on this. On the one hand, I don’t recall any specific use case where someone needs to build everything from source and would actually be getting build dependencies automatically from PyPI. Downstream packagers for systems like conda and apt have to satisfy build dependencies through their own packaging system anyway. So maybe it’s fine for general tools to get build dependencies from wheels.

On the other hand, there’s an obvious intellectual appeal in being able to build everything ‘from source’ including self-hosting build backends, and it looks like we can enable that with a relatively small addition to the spec. Plus there are people strongly opposed to the “build dependencies must be wheels” options, so it might be easier to specify a self-hosting mechanism than to get consensus on not specifying it.

I think the argument that swings it for me is packages that want to customise their own build process, without providing a general-purpose packaging tool. We definitely don’t want every package doing that, but it shouldn’t be too much work if a package author needs to. And having to package and publish the customised build backend separately sounds like too much work.

pf_moore · February 17, 2019, 11:24am

If we cannot reach consensus, then I consider it my responsibility as BDFL-delegate to be the person to “pick one”. I’m trying hard at the moment to let the discussions run their course, because I don’t think picking an arbitrary answer is the best option, and I don’t think there’s a huge amount of urgency for this to be resolved. But if the consensus is “just someone make a decision” I’ll do so.

bernatgabor · February 17, 2019, 12:41pm

If that’s the case I see a clear consensus for option 1, sure there are concerns that we might be overenginering this, or that it might open Pandora’s box, but generally the majority seems to be ok with it.

pf_moore · February 17, 2019, 1:12pm

Requiring Build Requirements to be available as Wheels

One of the big side-issues at the moment in this discussion is whether it’s OK to say that build requirements must be satisfied as wheels, bypassing all the self-hosting and recursion issues that we’re struggling with. In the interests of trying to get something concrete to work with, I went digging into the history of this on pip. Probably the best starting point is this comment, where I try to summarise the situation (there were a lot of very confused discussions going on at the time!)

For me, the two critical responses were this one from @njs and this one from @brettcannon, the two PEP 518 authors, clearly stating that when they wrote the PEP, they expected build requirements to be installed from source if they were not available as wheels - and while that opened up the question of recursive builds, that’s an implementation detail, and not something the PEP “hadn’t thought of”.

In fact, re-reading the whole of that thread is quite instructive - we pretty much hashed through everything that’s being said at the moment 18 months ago

It’s also worth noting that there were a number of issues after pip 10 was released (when we supported build requirements as wheels only) along the lines of this, which were essentially failures where build requirements were only available as source, and pip didn’t at that stage support that. So there was strong practical evidence at the time that “build requirements must be wheels” was not sufficient for our users’ requirements. Note that these issues were not particularly related to --no-binary.

If we were to change the PEP so that it required build requirements to be available as wheels, then either those issues would reappear as regressions, or pip would end up choosing to keep its current behaviour as effectively a frontend-specific extension to the PEP (and we’d be back to the whole problem of pip’s behaviour acting as a de facto standard).

So can I at this point make it very clear:

As BDFL-delegate, I do not consider proposals to insist that all build requirements must be available as wheels to be viable. We have ample evidence from the pip 10 -> pip 18 timeframe that this restriction caused genuine, practical problems for our users, and we have comments from the PEP 518 authors that such a restriction is not in the spirit of PEP 518.

Debating whether --no-binary :all: is a good way of “building from source only”, or whether universal wheels are “as good as source”, while possibly interesting, will not affect that decision.

Sorry to be blunt here, but I think there’s a lot of mental energy being wasted at the moment with people trying to make and rebut arguments that are mostly theoretical, while ignoring the practical experience from pip’s partial implementation of PEP 518 in pip 10. I’d much rather we accepted that we need self-bootstrapping, and move on to focusing on trying to design a solution that is acceptable to the people who need it.

bernatgabor · February 17, 2019, 1:24pm

Agreed, and I would propose someone to create wording for option 1 (yeah that’s not my proposal, but seems to gathered majority) and let’s move on to other issues. I personally would make it a new PEP, but I’m fine if we extend 517 (extending has the drawback that currently pep-517 complying tools will suddenly become not complying).

pf_moore · February 17, 2019, 1:33pm

Self-hosting vs in-tree backends

The original problem that this thread was created to discuss was self-hosting of backends. In particular, setuptools uses itself as its build system, and there is no easy way to express that in PEP 517. Flit currently doesn’t self-host, but would consider doing so if PEP 517 allowed it.

There’s a further complication involved for setuptools, as it also relies on wheel, which in turn uses setuptools as its build system. However, I think there’s general consensus that loops in the build requirements are something that we’re happy to prohibit, and so we don’t need to worry about this any
further (it does mean that setuptools may have to vendor wheel, and more generally self-hosting backends may need to be very careful about whether they depend on other packages or vendor them, but that’s something I think we’re all happy to treat as an issue for backend developers, not something the PEP has to solve).

So that leaves us with a basic problem, which is how to allow a self-hosting backend to say that the backend code it needs to build itself is available in the source tree that is being built.

In addition to the above problem, there’s also a potential use case for projects who don’t want to use one of the standard backends directly, but want to customise it. Such customisations could be achieved via a plugin mechanism offered by the backend, or by the project using a “wrapper” backend that
modified the behaviour of the underlying backend. But not all backends will provide plugin mechanisms, and often a “wrapper” backend is too project-specific to be something we’d want to encourage publishing on PyPI. So there’s a case here for again letting the project say “I have my own backend, it’s available within the project source tree”.

Obviously, writing your own backend (even a wrapper) is a non-trivial undertaking, and is not something we’d expect projects to do in general (and for setuptools-based projects in particular, with the full power of Python available in setup.py, it’s hard to imagine a case where you would need to write your own backend). But having the ability to do so would allow projects to provide a solution in those rare cases where it is needed.

Both of the above problems (self hosting of backends and in-tree backends) are actually variations of the same thing - saying to the build frontend “I have my backend code right here, you dodn’t need to do anything special to make it available”. As a result, I think that any solution we propose should be expressed in those terms, rather than in terms of a single expected use case. By providing a general mechanism, which is agnostic about what it’s used for, we avoid needing to struggle with the possibility of people “abusing” the mechanism and we can just worry about it being safe and robust.

To that end, I see a number of key ways of expressing the intention “my backend is available in the source tree at such-and-such location”:

Just add the project root to sys.path when locating the backend. Note that I’m explicitly only suggesting this for the step of locating the backend. I’d still expect the backend hooks to run with the project root not on sys.path. I do not think this is a good option (explicit is better than implicit) but I wanted to mention it for completeness.
Add a key that lets the project say “add this directory to sys.path while importing the backend”. This is essentially option 1 from @takluyver’s post (although as noted they are all functionally much the same). Although I’m not entirely clear whether the existing proposals limit the sys.path modification to when the backend is looked up, or whether they propose leaving the directory on sys.path when the backend hooks are called.
Have a further form for build-backend that includes a source-relative directory, something like build-backend = "src/dir/module_path:object_path". This would in effect be the same as adding src/dir to sys.path while searching for module_path:object_path, but would be very explicit that the directory is only used to locate the backend code.

I’m inclined to think that the third of these is the best approach, with the second being OK, as long as we’re explicit in the description of the option that the extra sys.path entry is only available when looking up the backend.

Does anyone have a particular use case for the in-source directory being available on sys.path when hooks are being called? It seems to me that this is the most controversial aspect of the discussion (and in particular the aspect that @pganssle strongly objects to) so if we can agree that’s not needed, that may help us reach at least some level of agreement.

pf_moore · February 17, 2019, 1:38pm

I’m happy to write up the final proposal. I’m OK with it being a new PEP, if people prefer, although I think it’s fine to have it as a revision of PEP 517 (that PEP is marked as “Provisional”, and I actually don’t think that anything we’re proposing makes existing tools non-compliant (or at worst it means that they haven’t yet implemented the new bits, and I’m confident we can word things to avoid that being a disaster).

dstufft · February 17, 2019, 2:02pm

I think that adding and then removing a directory from sys.path like that is going to cause more problems than it’s worth. It’s unusual in Python to remove directories from sys.path, so I think that it will be entirely unexpected for people, and I think we’re unlikely to figure out the subtle breakages that are going to occur from it.

For instance, using setuptools as an example, if it’s vendoring wheel only for builds, and I think that since it uses it’s plugin system to deal with the wheel dependency, it won’t even try to import wheel itself until “runtime”, which would then fail because we’ve removed the vendored directory from sys.path before executing the build backend.

Now of course, setuptools (and any custom build backend) could work around this by modifying sys.path in their in tree build backend… but it seems very silly to me to expect use cases where a dependency might not be imported until runtime to have to specify additions to sys.path twice, once in pyproject.toml and once in their own code.

I honestly don’t understand @pganssle concerns with the ability to add to the sys.path in pyproject.toml. I understand that the goal is to have users opt in to the new setuptools backend which doesn’t default to having . on sys.path, but he’s stated before (I think) that’s he’s perfectly fine with users adding sys.path.insert(0, ".") to their setup.py to get it back. I believe his statements to that were that he’s fine with users doing it, as long as it was explicit what they were doing. I don’t understand why manually munging sys.path inside of a setup.py is considered fine and dandy, but declaratively extending sys.path in pyproject.toml is going to cause a bad outcome.

Over all, I think that that removing the items from sys.path will just be another place where Python packaging is weird and confusing while also not actually doing anything useful for the ecosystem. It’s purely additional pain being added (for some subset of users) for no payoff that I can tell.

takluyver · February 17, 2019, 2:29pm

I agree with @dstufft that it would be weirdly surprising to remove the directory from sys.path again before calling the hooks. I see two ways it could go wrong:

While importing things at the module level is the norm, it’s not unusual to put imports inside a function to defer them until that function is called. When we do this, we expect the import to resolve to the same file as if it had been at the module level. This breaks if sys.path is changed from import time to call time.
If the backend module itself modifies sys.path on import (as various people suggested it should if the pyproject.toml file doesn’t give enough flexibility), then undoing the modification you did before that becomes a much more complex proposition.

If we want to ensure that people don’t use this key just to modify sys.path for their setup.py (@pganssle’s concern), then I think a better option is to specify that when python-path = ['.'] is used, frontends MAY verify that the backend is indeed imported from the first location given, and refuse to build if that’s not the case.

(I think I’m going back on an earlier argument I made that this would be too hard. We should think about how the check might go wrong, but I now think every other option to avoid this is harder.)

I’m also happy if we just decide that this isn’t a concern. If you want the CWD on sys.path, there are ways to do that, and this is one of them.

takluyver · February 17, 2019, 2:33pm

Thank you! I agree that we were spending a lot of mental energy on that question, and I think definitively ruling it out based on real experience is an important step forwards.

pf_moore · February 17, 2019, 5:05pm

I’m (relatively) happy to accept that. I thought I remembered some sort of precedent for what would be in essence “import from a specific filesystem location”, and I’m happy to do the research to see if I can find that precedent, but if people think it’s too complex, I’m happy enough to drop it.

As @pganssle has effectively absented himself from this discussion, and no-one else has any strong views on the need to avoid adding directories to sys.path, I don’t think there’s much further we can go in that direction. I’m happy to hear arguments, but I think we’ve exhausted the options for solutions which provide this without costs (in terms of complexity and fragility) and it’s now hard to justify looking further on the basis of a single person’s concerns. So I’m inclined now towards just saying “add a single directory to sys.path prior to loading the backend” and be done with it.

I’m +1 on some sanity checks, specifically the added directory must be somewhere inside the source tree, and the backend must be loaded from it (which should be easy enough to do by checking backend.__file__) but I’d prefer to make them things that frontends SHOULD do (in-tree backends MUST follow these rules, and frontends SHOULD check and refuse to continue if they haven’t)…

What do people think about the option of encoding the directory to be added into the backend name? Even if it’s semantically the same as an “add this path” option, it logically ties the path to the idea that it’s “where the backend lives” rather than a more general “path that stuff can go”.

Also, as a result of thinking of the added path as “where the in-tree backend is located”, I’m inclined to stick to adding a single directory. Code can explicitly add more if it wants to, but it’s not what I see as the norm.

njs · February 17, 2019, 9:07pm

I think our primary goal should be to make the semantics as transparent and un-surprising as possible. That’s how we avoid confused users, slightly incompatible frontends, etc. People don’t always read specs carefully (or at all!).

So on that grounds, I think I prefer an explicit python-path kind of argument to anything involving a custom DSL we invented. Our target audience is very familiar with manipulating PYTHONPATH and sys.path. But if I saw backend = "src/whatever:blah, then I wouldn’t have any idea what that did except by going to read the spec. (Does it involve exec somehow, maybe? Some kind of magic involving import hooks?) It would also mean we need to spec out a grammar.

I’ve made the case for multiple directories several times – I think the most recent was here: PEP 517 Backend bootstrapping - #147 by njs

Not sure if you’re saying you’ve considered those arguments and find them unconvincing, or just missed them in the long discussion.

bernatgabor · February 17, 2019, 9:25pm

If we allow (and seems we are very keen to do so), yeah let’s make it explicit additional configuration argument one.

takluyver · February 17, 2019, 9:27pm

Nathaniel already nicely made the case I would have made against this, so I’ll just note I precisely agree with his points.

I can see the arguments both ways on this question, and I don’t have a preference at the moment. I’d be happy with either solution.

dstufft · February 17, 2019, 10:22pm

I don’t have a strong preference, but I feel like making it a list of paths (or able to take a list if we want) makes sense. In either case a backend is going to be capable of adding an arbitrary number of paths to sys.path, so it seems somewhat arbitrary to me to say there’s just 1 allowed specified in this place.

One example of when you might want two, is if setuptools plans on vendoring wheel (either as a .whl or unpacked into a vendored/ directory) it is probably going to be useful for them to be able to add that as a second path while also adding . as a top level path.

I think we’re getting into edge cases of edge cases at this point, so I wouldn’t argue too much about it, but I do think the mechanism is basically just as easy to explain as a list, makes some edge cases easier, and specifying it as a single string doesn’t really buy us any additional wins. But if other folks prefer a single directory, that’s fine with me.

pf_moore · February 17, 2019, 11:11pm

I take your point (and I’m not that happy with a custom DSL) but I think there’s also an important distinction I’m trying to draw here. The directory we’re specifying, in my view is the location of the in-tree backend - it’s explicitly not a generic “directory to be added to sys.path”. By making the location part of the backend spec, I was trying to make that clearer. OK, so that’s a fine distinction, but as you yourself say, we want to make the semantics as transparent as possible, and the semantics I want are “here is where the backend code is stored”, and not a generic path-manipulation facility.

Let’s just drop the DSL idea and stick with a new key. What we call that key remains up for bikeshedding at this point (but I don’t think it’s bikeshedding in this case - I think it’s rather crucial).

Thanks for the reminder, I had seen that (and don’t find it convincing) but it was worth me reviewing it.

You’re taking the position that this is a facility for adding a location to sys.path - and I agree that with that framing, making it a list seems relatively harmless, and has some minor benefits. But that’s not how I see this feature at all. Certainly, in terms of its implementation we’ll be adding a location to sys.path, but I want to avoid people thinking of it that way (this is I think the point @pganssle was trying to get across) - if all you want to do is add entries to sys.path you have plenty of options already (in a setuptools-based project, you just manipulate sys.path in your setup.py for example).

What this feature is about is specifying where a very specific piece of code (an in-tree backend) is to be found. The pyproject.toml format is declarative, and we should be looking at this addition as a declarative statement “here’s where the in-tree backend is”, not as a procedural one “add this location to sys.path and then when you try to import the backend, you’ll find it”.

Following on from that, we’ve already said that we would want to insist that if a project specifies an additional sys.path entry, the build backend must come from that path entry. That requirement falls very naturally out of the “this is where the in-tree backend is located” view of the configuration item, whereas it is a fairly arbitrary technical constraint when considered from the “sys.path manipulation” view, and it’ll likely be a lot harder to explain why that constraint is needed under that view.

OK, so the above is my opinion. But while I’m pretty sure it’s the right one () I’m not yet ready to try to mandate it in the face of everyone else preferring the “sys.path manipulation” view. So I think we have a few questions to consider:

Do people understand the distinction I’m trying to make here? It’s pretty subtle, and it’s only just become clear to me, so if not please say so and I’ll try to explain better.
Is anyone persuaded by the above arguments that the “in tree backend location” framing is better? Or do people still prefer the simpler-to-explain “sys.path manipulation” explanation?
Have I missed any key features of the “sys.path manipulation” model that make it easier to explain, more declarative, less error prone, or easier to use in practice, than I’m making it out to be above?

dstufft · February 17, 2019, 11:23pm

I do understand the distinction, I guess I just don’t care a whole lot about the distinction. Like it doesn’t really matter to me if people use it to add . to sys.path instead of sys.path.insert(0, ".") in a setup.py. It doesn’t buy you any additional power except a slightly different way to spell accomplishing the same thing. So I don’t personally find the argument that people might use it for generically adding stuff to sys.path during the build particularly worrisome or compelling.

I think the simpler to explain and more generic function is better. Not better enough I’d lose sleep over it, but still better.

I don’t think so. I think the difference between 1 entry and a list of entries is really small so it doesn’t really matter much in the long run.