PEP 517 Backend bootstrapping

Ah, sorry, this is another one of those “I don’t know what everyone has taken to heart” things. I think you’ve been excellent about keeping my concerns in mind in this thread and I think a solution where you can manipulate sys.path but the build backend must be in the added PATH is fine if the front-end people are fine with it.

When this was first brought up (might have even been me who suggested it), it was originally shot down as too complicated to implement. I agree that it’s probably too complicated, which is why I wanted to give one last push for “known backend location” as the solution that solves both the question of simple implementation and the separation between the imperative and declarative portions of the spec. I still think the spec should be as simple as possible for everyone to comply with and still get everyone what they want, and we can build layers of abstraction on top of it, but the extra work in this case is mostly on the front-ends anyway.

I mainly stepped in because in several places you said that you were only serving as a limited proxy for my views, based on what I’ve already expressed.

I think it would be very weird and unusual for someone to write a custom backend just to mess with the path semantics of setuptools. There’s no point in creating a __build_backend__.py when you can just add sys.path.insert(0, ...) at the top of your setup.py. This is why I don’t care that you can achieve the “same thing” with the “known location” method, because I’m mainly concerned with preserving the distinction between the imperative and declarative portions of the code and between the backend and the frontend. I think the path semantics of setuptools or any other backend that wants to run user-supplied code as part of the backend should be done as part of the backend’s configuration.

I’m fairly certain that if we do not restrict this key to situations where the backend is found on the manipulated path, it will just become a configuration option for setuptools, because the people who want to mess around with setuptools’ PYTHONPATH will vastly outnumber the people who want to write custom build backends. Maybe setuptools or other imperative backends should or would want to provide such an option, but I think it should be the backend’s choice, not a side-effect of how we solve a different problem.

1 Like

Sorry, I think that might have been me shooting it down. While I still think there’s a possibility of some weird edge cases when symlinks are involved, I’ve since come round to the idea:

  1. The use cases of having hooks in-tree shouldn’t need weird symlink tricks anyway, so it shouldn’t be a big practical problem if the implementation isn’t 100% bulletproof.
  2. I’m optimistic that the way Python calculates __file__ will make things work as we expect, or pretty close to that. I’ll try to test this shortly.
  3. I like all the other proposals to solve the same problem less. :wink:

Yes, this is precisely what I was trying to get at. And this might be the most clearly the concern has been stated yet.

For me, stating in the PEP that the path is meant for locating the backend is enough. Otherwise people will try and infer the intent from what is possible (and nobody is coming back to this thread in the future to find out what we were thinking :wink: )

It sounds like people are agreeing, but I can’t tell if what people think we’re agreeing on is:

  • we should note in the spec that the intention of python-path (or backend-bootstrap-python-path or … whatever we call it) is to support in-tree backends
  • that + we use some technical means to enforce this, like checking __file__ or something

If it’s the latter, then I want to repeat my comment from above:

For example, suppose that someone does manipulate sys.path before importing setuptools. What are we worried will happen?

2 Likes

IMO, we’re agreeing that the PEP says that if python-path is used, the backend MUST be loaded from one of the locations specified, and frontends SHOULD check that this is the case (by checking __file__). At the moment, the question as to whether __file__ works in edge cases is still to be tested (@takluyver said he planned on doing so, and I do too). But honestly, I’m 100% fine with the check not working in the odd corner case - it’s not exactly the end of the world if some project can’t ship an in-tree backend in a zip file that it adds directly to sys.path, after all…

@pganssle has explained the concern - without the check, there’s a risk that projects will use the new setting as a way of configuring existing backends (setuptools in particular), and he views that sort of configuration as being something the backend should have control of (which is a reasonable position to take). I view the cost of a check (particularly if it’s optional for frontends to enforce) as minimal, and worth including if it helps us achieve consensus.

At a minimum, I’d want the PEP to allow frontends to include the check, and I’d expect pip to add that check.

Question: is this discussion related at all to this issue (“PEP 517 isolation breaks test module with relative out-of-source dependency”)?

The reason I ask is that the issue involves specifying dependencies on other directories, so I’m wondering if there is any overlap with the issue we’re discussing.

I don’t think so, because that’s about building stuff that depends on data outside of the source tree. That’s not something we are considering here (and as a general principle, I don’t think it’s something PEP 517 should address at all).

I’ve just tested with a few edge cases I could think of:

  • The directory added to sys.path is a symlink
  • A module within that is a symlink
  • A module within a package is a symlink
  • A package containing a module to load is a symlink
  • The module is in a zip file

I’ve endeavoured to test each with both absolute and relative symlinks, and on Python 3.7 and 2.7 (I was doing this by hand, so it’s possible I missed some combination, but I think I got them all).

In all cases, the __file__ attribute of the loaded module started with the relative path I added to sys.path, which would make it easy to implement the check we described.

These tests were all conducted on a case-sensitive filesystem. But if case differences can crop up, that should be easy to resolve with e.g. os.path.normcase().

1 Like

Yes, I understand. I’m literally just asking you or someone to take the next step and explain why this would be a problem. I’m sure it’s very obvious to people named Paul, and so it feels like I’m asking whether water is wet or something, but I think it’s reasonable to try to write down the rationale for technical proposals even when it’s obvious.

If someone were to be Very Naughty and used this to add a directory to sys.path that was not where the top module containing the backend was loaded from, then what is some bad thing that could happen?

1 Like

OK. I’ll do my best, but to be clear - I think this is a minor point. To me, the key point is “this is a small cost to pay to get setuptools on board with this change and achieve consensus rather than having to make a decision to override the concerns of one of the participants in this discussion”. I don’t think it’s a major technical point in the proposal, I think it’s mostly a political one.

So, to answer your question, why would it be a problem for setuptools (or any backend that runs user code) to have to deal with the possibility that users can configure the sys.path that is active when the backend runs?

To give a specific example, if a project wants to use a custom csv module, without the check they can write python-path = "custom" and dump a modified csv.py file in their custom directory. And then, if setuptools uses the stdlib csv module internally, setuptools is broken for this project, and it’s not at all obvious where the fault lies.

Conversely, with the check, the project must also have written their own backend, so it’s clearly on the project to resolve the issue (even if they chose to have their backend depend on setuptools).

Or, with the check, if the project wants to use setuptools and have a mmodified csv.py, they have to do sys.path.insert(0, "custom") in their setup.py and the change is isolated to setup.py and can’t break setuptools.

(Note - that example is entirely off the top of my head, and untested. Treat it for what it is, an example of why this is a concern).

I’ll try to make sure the PEP update includes some clarification of this - although I don’t want the resulting text to become top-heavy with edge case examples.

1 Like

Summary time!

I think we’re converging on prepending a list of relative paths to sys.path, and checking that the backend is loaded from (one of/the first of?) those prepended paths. The purpose of this check is to make it harder for packages to accidentally break stuff in ways that lead people to report bugs to packaging tools, in particular setuptools. Technically, the __file__ attribute seems to be a practical way to implement such a check.

@njs and @steve.dower have both, if I’m reading their posts correctly, indicated that they’d be happier if the spec didn’t refer to any such check. This solution is, I presume, not acceptable to @pganssle.

Nathaniel & Steve, do you want to press that point further, or are you happy to live with a check you consider unnecessary? I understand the concern about adding complexity, especially in a specification which potentially requires extra complexity in many tools. But as far as I’m aware, all the use cases we have discussed involve loading the backend from a specified path, so the check shouldn’t be a problem. I’m also not unhappy to be limiting the scope of the new feature a bit.

I have a question. Are we sure that there isn’t a way to let the front-end prepend additional paths, but still make it so that backends have enough information to reliably detect issues like the above and place the blame more squarely? This seems like it should be a straightforward technical question, and my instinct tells me there should be a more elegant way (maybe even more comprehensive). Or is there a technical reason it’s not possible?

For example, could setuptools at least have enough information to emit a warning along the lines, You are on your own in this scenario...? Or be given enough information to know if the provided paths may cause a conflict? I don’t recall this specific technical question being discussed before, but if it has, my apologies and we can move on.

The backend sees the same pyproject.toml file; there’s nothing to stop it reading that, figuring out that additional paths were specified, and issuing a warning - or even refusing to build entirely. I can’t say this is elegant, though. And if the extra paths cause the backend to fail before it can do this check, then it doesn’t help.

Thinking about it, as a backend maintainer, I’d also rather people weren’t easily able to specify Flit as a backend plus some extra paths which may shadow arbitrary modules Flit loads.

1 Like

I don’t think there’s any danger that bad things will happen if people are allowed to set up sys.path to shadow stdlib modules. This is how every python program has worked since the beginning of time – users get to set their $PYTHONPATH however they want before invoking it. It’s also how setup.py has worked for more than a decade… when was the last time setuptools got a bug report where someone created a csv.py file next to their setup.py, and then blamed setuptools for failing to work? And did the setuptools devs really struggle to decide whether this was their fault for trying to use the stdlib, versus the user’s fault for breaking the stdlib?

And on top of that, the build-system.python-path version of the problem is much harder to trip over than the normal one, because to cause actual harm it would require users to first manually opt-in, and then somehow ship an sdist before noticing that they can… no longer build sdists.

And if anyone actually wants to shadow modules on purpose, then checking __file__ will barely slow them down. They can still trivially use an “in tree backend” that sets up sys.path and then imports flit or setuptools or whatever.

I still don’t know whether this is actually @pganssle’s concern though. Maybe he’s worried about some other scenario none of us have thought of?

One possible way out: normally our specs don’t say anything about what kinds of checks pip is required to do. Maybe the spec should just say that the python-path key is to be used for paths that are needed for loading the backend itself, and then we let the pip devs decide what pip will do?

That is typically something that the user does themselves, though. This mechanism would let a package author add paths for users building their packages.

There are plenty of ways that it could work for one person but not another: they could be getting different versions of the backend (or of some dependency), or it could import different things depending on the platform or the Python version. The module author could also build the sdist without going through the PEP 517 interface at all (e.g. by running setup.py sdist).

Still, I agree that it doesn’t sound that likely. But given that it’s been raised as a concern and we have a seemingly practical option to rule it out, is it a problem to have the check?

I guess, but agreeing that something should be a de-facto standard in pip and then not mentioning it in the actual specification seems odd. If pip is going to do that, I’d at least specify it as a MAY or a SHOULD.

I’d still argue this won’t be very likely to affect build backends. Does a package maintainer file a bug to setuptools claiming it fails to install their package because it contains a csv.py? I see that a lot in user groups, but issue trackers, no so much.

Standard lib shadowing is a very established problem (of Python itself, but that’s another topic), and most people have learned to avoid it by the time they want to deploy packages.

I agree it won’t be very likely, but even so, it’s a definite possibility. E.g. Flit is a backend that doesn’t offer any hooks designed to run code from the project being built. It’s not hard to imagine that someone might decide to hack in a hook by hijacking some module that flit imports. I’d rather they had to explicitly include an in-tree backend and override Flit’s behaviour that way.

Likewise it’s not that hard to imagine that someone adds the project root to sys.path for setup.py and has some utility files lying around there, which shadow something that a future release of setuptools uses.

Problems with installation definitely do get reported to all kinds of places - especially if the installation works with setuptools version X and fails with version X+1. That’s not necessarily setuptools’ fault, but the obvious conclusion is that a change in setuptools ‘caused’ the problem. And even if it doesn’t generate formal bug reports, backend maintainers will see people blaming their tools in other fora, which is still wearing.

On reflection, I’d rather have the check than not, but I don’t feel strongly about it. But so far as I know, @pganssle is strongly in favour of the check. So, to be blunt: does anyone feel strongly enough that the check is a problem to make a concerted argument against it? If you do, please go ahead and lay out your argument. But if you can live with it, can I suggest that we move on?

I feel like we’re stuck in a loop of people arguing “meh, that’s not really necessary”, but without enough conviction to persuade anyone who’s leaning in favour of the check. The check is consistent with the purpose of what we’re trying to specify (allowing build backends defined in-tree), and I don’t recall anyone bringing up any specific problems that it would cause.

1 Like

I agree. We’re really not going to get much further waiting for everyone to agree. So what I propose is that I write up a PR for PEP 517. That will provide a specific set of words that will hopefully be acceptable to everyone for the bulk of the change. I’ll try to do that tomorrow.

For the matter of the check that the backend is loaded from one of the supplied paths, I’ll propose some wording and we can refine it to reflect the consensus. I think we have to include some words on the subject of the check - after all, the issue was a topic of some debate here, so we owe it to the readers of the PEP to document the conclusion of that debate. Leaving it unspecified isn’t an option, IMO.

Regarding the wording of the text about the check, this is the chain of logic I followed:

  1. Does anyone (in particular @njs and/or @steve.dower) feel that we should not explicitly state that the intent of the new functionality is to specify locations from which a backend should be loaded? I get the impression that everyone’s at a minimum resigned to limiting the intent that much.
  2. Assuming everyone is OK with (1), does anyone want to argue that we explicitly support use of the new feature to add paths without loading the backend from (one of) those paths?1 If no-one is willing to argue for that, then I see nothing wrong with the PEP saying “if python-paths is specified, the backend MUST be loaded from one of the specified paths”. That’s basically just standards-speak for the statement of intent described in (1)
  3. Given that we have that statement, does anyone want to argue that frontends must not check that the backend is loaded from one of the added paths? That seems a bit of a strong statement to me, given that we’ve agreed on (2). If not, then it seems to me that we’re accepting that it’s OK for the PEP to say that frontends MAY do so.
  4. Personally, I’d like to strengthen that statement to SHOULD, but I’m OK with using MAY if that breaks the deadlock.

Sorry - I know the above seems like a set of leading questions, but they are genuinely the steps I went through in my head when trying to pin down what I thought @njs was arguing for. So for me, knowing precisely where (if at all) our logic diverges would help a lot.

I propose to treat the nature of the check (look at __file__) as nothing more than a suggested implementation option for frontends who choose to add a check - simply to avoid people inventing their own solutions and potentially introducing confusing inconsistencies. In standards terminology, it would be non-normative.

As a frontend developer, I will say that I’d prefer to add the check - not so much because I think it’ll catch any major issues, but simply to establish boundaries around when projects are doing stuff that’ll trigger the type of bug reports where it’s hard to work out whether it’s the project, the frontend or the backend that’s at fault. In my experience, that’s the biggest pain point here - directing bug reports at the right people - and the first place reports go to is the frontend, so anything that makes those boundaries clearer is good from my POV.

1 Note that the same argument about saying nothing applies as I mentioned above - there’s been a debate over whether we should support using this as a general “add paths” feature, so we need to document the conclusion of that debate in the PEP.

OK, I have created an initial PR for PEP 517, available at https://github.com/python/peps/pull/901

I’d appreciate it if anyone involved in this discussion would take a look and add their review comments (if you’re fine with the change, it would be better if you add an approval review rather than simply staying silent - I don’t want to miss anyone’s input).

I’m not trying to rush anyone into agreeing to this - if anyone feels that they have more to say than will fit in a review comment, please feel free to add a review saying “see my comments on Discourse” and continue over here.

1 Like

I commented in line, but FWIW I think python-path is better than python-paths for symmetry with PYTHONPATH and sys.path, both of which do not use a plural form.

As to checking if the backend is loaded or not, personally I’m slightly in favor of not doing the check because I think that it doesn’t really buy us much. Anything someone could do without the check they can do with the check and a one line file, so it’s not really much of a barrier at all. I also think it presents some slight amount of risk since I’m not sure we can exhaustively test how __file__ occurs in all situations so there’s some amount of risk whatever we code up here won’t behave correctly in the “real world”.

That being said, I’m only slightly in favor of not doing the check because I don’t think the use cases enabled by not having the check are things we generally want to support and I don’t feel bad about making them use a single line file to get them. I also think the risk is very small, so I don’t think there is enough of a risk to fight one way or another about it.

That’s a lot of words to basically say I don’t care that much about it, if it makes people happier to have it in go for it, but I don’t think it really adds much.