A proposal for sdist build complexity signaling, providing user agency

kknechtel · July 6, 2024, 1:41pm

Continuing the discussion from Provide a way to signal an sdist isn't meant to be built?:

I tried to “do the work” and came up with a proposal. This is going to be long, because I’m trying to provide PEP-like levels of detail.

1. Gradual rollout for PEP 725 implementation

(For reference, that’s “Specifying external dependencies in pyproject.toml”.)

My idea is that in theory, maintainers will be following PEP 725 anyway. Rather than having project authors and maintainers make the choice about whether a project’s sdist is “too hard” for users to build (without knowing anything about those users), the goal is to give users (by default, and when using Pip interactively) the information up front about why the sdist might be hard to build (before it has a chance to fail).

Rather than immediately trying to solve the problem of turning dependency specifications like "virtual:compiler/c" or "pkg:generic/freetype" into an automatic lookup process, we should plan on just being able to map those specifications into friendly, human-readable descriptions for use within Pip etc. output (like “a C compiler” or “the freetype library (try installing it with your system package manager)”).

That lays the groundwork for installers to present detailed prompts, so the user can check manually whether the requirements are met before attempting to build the sdist. (Users who don’t understand this information can be advised not to try.)

2. New options for Pip

The existing options for influencing the choice between wheels and sdists are IMO rather confusing. Options like --only-binary and --no-binary and --prefer-binary seem like part of an enumeration of possible approaches, rather than orthogonal binary flags. They also don’t represent all the approaches that make sense, especially if the user is allowed to respond to new information during the process.

I seek to add one new approach for now, without ruling out the possibility of more alternatives being proposed in the future. Thus, I propose to add the following command line flag syntaxes, with an eye towards deprecating the aforementioned options:

--build-sdists STRATEGY
--build-sdist-for PACKAGES STRATEGY

Here, PACKAGES is a comma-separated list of package names, as currently used with --no-binary and --only-binary options. (The :all: and :none: syntaxes seem unnecessary here and I would suggest not supporting them.)

For a given package, Pip would choose to use wheels or build sdists according to the STRATEGY:

never - only use wheels, and fail if no wheel is available (should be equivalent to --only-binary)
always - only use sdists, and fail if no sdist is available (should be equivalent to --no-binary)
when-newest - try to build the newest suitable version if it’s sdist-only; otherwise use the newest suitable wheel (the current default, as I understand it)
when-needed - use a wheel if possible, but build the newest sdist if there are no compatible wheels (should be equivalent to --prefer-binary)
ask-when-newest - see below
ask-when-needed - see below - new, default behaviour

3. Prompting the user

When the ask-when-newest strategy is selected and Pip is being used normally (i.e., without --no-input), the user is prompted like:

The newest version of `foo` compatible with everything being installed is `1.2.3`.

There is no compatible wheel available for this version, but Pip could try to build it from source.

The package claims that building requires:
* a C compiler available as `gcc` on the command line
* the `freetype` library, which needs to be installed with
  your system package manager or by following directions at 
  <url>

Please choose:
1. Try to build this package now
2. Stop all installation (and maybe try again later)
3. Look for an older version (including source distributions - 
   some of them could be easier to build locally)
4. Look for an older version, but only accept wheels

If the user chooses to look for an older version, including sdists, the prompt is repeated for each sdist found (until a wheel is found, building commences or the user cancels).

Similarly for ask-when-needed:

Pip can't find any compatible wheels for `foo`, but could try to build version `1.2.3` from source.

The package claims that building requires:
* a C compiler available as `gcc` on the command line
* the `freetype` library, which needs to be installed with
  your system package manager or by following directions at 
  <url>

Please choose:
1. Try to build this package now
2. Stop all installation (and maybe try again later)
3. Check the next most recent version, in case it's easier to build

4. For CI users

If the --no-input option is provided, such that prompting isn’t possible, corresponding information should be logged. Under these conditions, Pip will try to build the package; so ask-when-newest and ask-when-needed are equivalent to when-newest and when-needed respectively.

A log message would be more technical - it could look something like:

Pip is attempting to build `foo==1.2.3` from source because a wheel couldn't be found that's compatible with both the platform and the arguments to Pip.

The package claims that building requires:
* a C compiler available as `gcc` on the command line
* the `freetype` library, which needs to be installed with
  your system package manager or by following directions at 
  <url>

If building fails, please adjust your install scripts appropriately, e.g. by disallowing this version in your requirements specifiers.

To suppress this diagnostic in the future, please pass Pip an appropriate value for `--build-sdists STRATEGY` or `--build-sdist-for foo STRATEGY` as described in the Pip documentation.

5. Backwards compatibility concerns

For the purpose of prompting and logging, if no PEP 725 metadata is available, Pip should not assume that there are no special requirements to build the sdist, but instead that these requirements are unknown. (In a future where the specified dependencies can actually be fetched, of course, it would be reasonable for Pip to act as though there are no such dependencies, and allow the ancient legacy setup.py to deliver the bad news.) If the PEP 725 metadata explicitly has empty values for build-requires and host-requires, of course, then there really aren’t any special requirements.

The practical, immediate effect is that if, say, foo is a legacy project that uses setup.py and only distributes an sdist despite being pure Python, the user will get a false-alarm warning that the project needs to be “built from source” with unknown system requirements. Users who disregard that warning will see the project install just as it did before, and that could well continue to be the case indefinitely.

If, on the other hand, foo is one of these new giant AI libraries that depend on Torch and a bunch of other things, the same warning would be real, and disregarding it would lead to the same “help, what is subprocess-exited-with-error” situation we have today - but at least the user was warned up front and got a decently clear explanation up front. And the packages would be able to add PEP 725 metadata to give a bit more clarity about what’s involved.

If it’s really desirable here to let the package authors give custom warnings here, I think that might be best implemented by extending PEP 725 to describe pseudo-dependencies (that are always considered “met” by whatever future resolver, but allow custom descriptions).

pf_moore · July 6, 2024, 3:04pm

This is a very different proposal than the existing “--only-binary by default” one. That’s not to say it’s a bad thing, just that it has different trade-offs. Personally I prefer --only-binary being the default, but that’s because I’m in the privileged position of using a platform where most packages I’m interested in come with binaries.

IMO, this proposal has a significantly smaller risk of breaking existing usage than --only-binary by default, but in contrast it would be much harder to implement. I’d be fine with seeing a PR from someone implementing it, but in the absence of one, I’d rather see --only-binary by default, as I think that realistically it’s more achievable.

On the other hand, maybe the trade-offs would be different for uv, so it might be interesting to see if they would be interested in implementing this proposal…

kknechtel · July 6, 2024, 3:26pm

I mean, I’d be interested in giving it a shot, except that the existing Pip codebase is apparently something like 150kloc of Python (plus whatever else) and I wouldn’t know where to jump into it.

(Also, I guess the sort of PEP 725 mapping I have in mind would be blocked on finalization of PEP 725. But that could be a placeholder in an example implementation.)

oscarbenjamin · July 6, 2024, 4:09pm

You don’t have to read all the code. Take a look at src/pip/_internal/commands/install.py and then check the functions and classes imported from elsewhere as needed. Stick a breakpoint() in install.py and you can follow the whole process in a debugger. The architecture docs are also useful for a high level overview.

pf_moore · July 6, 2024, 4:43pm

@oscarbenjamin is correct, but also yes, this is why it’s hard to get new features into pip - pip is already very complex

kknechtel · July 6, 2024, 6:16pm

Hmm.

After looking in a bit, the implementation (completely disconnected from --no-binary and --only-binary) and semantics (doesn’t apply to individual packages, but to “everything not specified by --no-binary or --only-binary”) of --prefer-binary would make it difficult to implement what I have in mind. At the least, it seems like it would have to start with a major refactoring of FormatControl and the code using it (as well as the prefer_binary instance of Option). Getting my head around this would really require a clear mapping from package to… what I called “strategy” above; for the existing options a tuple would suffice (giving acceptable formats in preference order), but in the long run I’d want it to be an enum.

tiran · July 7, 2024, 12:03pm

I’m a big fan of PEP 725, +1 for the idea. But (and I’m sorry that there is always a but) I see two problems for the broader adoption for the PEP.

The version specifiers for PURLs proposal has not landed, yet. There is currently no way to express a minimum version in a PURL.
There is no authoritative list of pkg:generic/ names and PEP does not specify how to canonicalize the name of a dependency.

Lack of versioning is not a problem for most packages. Only few packages like llvmlite have hard version requirements.

The second item is a bigger problem, because Linux distributions and vendors don’t agree on common names. For example Debian-based distributions have package names like libssl-dev, zlib1g-dev, and libncurses5-dev. Fedora-based distributions have openssl-devel, zlib-devel, ncurses-devel. Even the examples in the PEP are not consistent and reflect preferred names from distributions. A user may wonder why it’s lcms2 and freetype on the other hand, and libtiff and libjpeg on the other hand. (*)

It would be helpful to have guidelines and a list of common package names to avoid fragmentation. Fedora naming guidelines recommend to use the upstream source tarball, project name, or name in other distributions into account.

And please use virtual:compiler/c++ for C++ compiler. For me, virtual:compiler/cpp stands for C Pre Processor.

(*) The names are based on the upstream project names with the exception of Little CMS.

JamesParrott · July 7, 2024, 12:38pm

That’s fantastic you’ve done the leg work on this one.

But as regards the current alternatives: are setup.py installs going to be deprecated any time soon?

I would use a build hook and pyproject.toml instead of setup.py (as I’m late to the party) that throws an error, or simply not publish an sdist.

pradyunsg · July 7, 2024, 1:01pm

FWIW, this has already been discussed on the PEP’s thread. It would be a separate PEP as mentioned in PEP 725: Specifying external dependencies in pyproject.toml - #6 by rgommers (with a link to a draft of such a future PEP in that post too).

oscarbenjamin · July 7, 2024, 1:10pm

See is setup.py deprecated. You can still use setup.py if you want and you can have it throw an error.

Having a setup.py that throws or not uploading an sdist are both not good solutions because either way it means that there is no sdist for anyone downstream who actually does want to build from source. Some packages already do these things though. An intended benefit of being able to signal to a tool like pip not to build by default is that there should be no need for projects to consider the suboptimal workarounds of having no sdist or a broken sdist.

kknechtel · July 7, 2024, 2:13pm

Admittedly I haven’t had the attention span to keep up with that PEP process. But I do need to say, I don’t think the idea has much value unless that canonicalization is done. There’s not much use in knowing what keys to use in the TOML, if I can’t determine the exact values too. This is far more important IMO than coming up with ways to automate the lookup of those dependencies - really, that should be the packaging tools’ responsibility anyway.

tl;dr: the role of setup.py has changed over time, but there’s no reason to expect it to go away. I can’t fathom a future where Pip stops supporting local builds, and it’s not realistic to orchestrate an arbitrary compilation process through TOML. But you’re already expected not to run setup.py as a standalone script. Instead…

if I understood you correctly, is a distinction without a difference.

These approaches will likely be supported forever, regardless of my current proposal or the simpler idea of making --only-binary a default or anything else. There isn’t really a way to prevent them from working.

But they have clear disadvantages.

If you don’t publish an sdist, you might lose potential users who could build the project easily but aren’t given the option. If you also want to do open source, then you need a separate channel to publish the source (admittedly not hard), and certain third parties will want additional assurances that said published source corresponds to the wheel. It might also cause problems for certain licenses.

If you error out from compilation, it’s could be that much harder for the user to understand what happened. More importantly, it can’t happen until building has already started. It’s possible to give more detailed information this way (“you said you have a C compiler, but I tried these paths and none of them led to an executable at all!”), but I would still do it in addition to advertising PEP 725 requirements.

My proposal is an attempt to leverage said requirements to warn users ahead of time.

JamesParrott · July 7, 2024, 2:21pm

it means that there is no sdist for anyone downstream who actually does want to build from source

Only those downstream relying on pip, who are not necessarily humans. They just have to read and follow the build instructions instead. A good exception message would explain to ordinary users their use case is not supported, and point devs and downstream third party builders to the docs. There’ve been a couple of excellent example messages in this and the previous thread, which I would have loved to have had pip show me, when I previously failed to install (by inadvertantly building) cryptography with it.

JamesParrott · July 7, 2024, 2:40pm

I agree with the disadvantages. But they’re all the package author or owner’s prerogative.

I’m a fan of reproducible builds too. In the past I’ve provide instructions, a dockerfile, a CI process, and build scripts, all to make that easier for third parties (and myself of course, should I forget).

All those principles are fantastic, I’m not debating any of them at all. I just think firstly, relying on unzipping .tar.gz files from PyPi is not a user friendly way of achieving open source status. And secondly, I don’t buy the implication that it necessarily falls to pip and PyPi to provide third parties their assurances (reproducibility), or GPL compliance.

Passing on failed compilation attempts, let alone compiler errors, would be poor communication indeed. I’d use an env var, or carry out a few LBYL tests first, instead of using EAFP for compilation, whatever the build environment.

But isn’t there some way to raise a useful exception when setting up the build environment, that could occur much sooner than compilation? E.g. via a sentinel package on PyPi that can never be installed, but to which helpful error messages can be added.

The proposal is still really great feature, if it’s decided to proceed with it. It would directly solve a problem that has frustrated me in the past (silent install failures). But it’s not obligatory.

oscarbenjamin · July 7, 2024, 2:42pm

Not just pip but all PEP 517 build frontends. There is a standardised way to distribute the source code and to build from the source code. If you use the setup.py to make that process error out then you have broken the expected way to build a Python package.

JamesParrott · July 7, 2024, 2:44pm

you have broken the expected way to build a Python package.

Indeed. Intentionally. But ideally, helpfully.

Isn’t the whole point of this, for packages for which the standardised build won’t work, for Python packages that must be built some other way, or that have special requirements beyond the what the standard build backends can provide?

kknechtel · July 7, 2024, 4:16pm

… By the way, does Pip generally consider PRs for pure refactoring without new functionality? To implement what I have in mind, the only way forward I can see would involve changing how the existing options parsing is done for --only-binary etc. - and probably also the interface between that and the rest of the system, so it would ripple a bit. But I think this would be beneficial regardless; that I have a decent chance of simplifying the existing codebase; and that it makes sense to separate this part out.

pf_moore · July 7, 2024, 5:04pm

If a feature requires both a refactoring and functional changes, we strongly prefer the refactoring gets split out into its own PR. Trying to identify which are the functional changes and which are refactorings in an “all in one” PR can be a nightmare. As a rule of thumb, whatever you can do to split a large PR into smaller, more easily reviewable chunks, is worth doing.

On the other hand, PRs that are simply isolated refactorings, with no motivating feature PR, while they aren’t unwelcome, are low priority - and given how hard it is to get reviews of even feature changes, you shouldn’t expect a quick turnaround…

pradyunsg · July 7, 2024, 6:24pm

Yes, although as Paul said, there is very limited review capacity on the project so, it can take a while for any open PRs to land.

oscarbenjamin · July 7, 2024, 6:55pm

You might be thinking of different cases to the ones I am but for the examples I am thinking of the situation is that the standard build works if some external requirements are satisfied and does not work otherwise. If someone wants to build then they need to satisfy the external requirements first. When they do build they can still use a standard PEP 517 frontend (such as pip) to do the building though.

For example on Ubuntu after you apt-get install openblas-dev build-essential then pip install numpy should be able to build from sdist successfully. However most users on e.g. Windows will not have openblas lying around and they can’t just apt-get to install it (outside of WSL).

In this context it is still useful and expected that the project distributes an sdist and that the sdist can be built. It is just not useful for most end users when pip attempts the build as part of pip install. There are still plenty of people who build these projects from sdist but they are generally a smaller more experienced group who can more reasonably be expected to provide an opt-in flag that tells pip to build from sdist.

Ideally there should be a way for projects to communicate to pip that it should not build by default while still distributing an sdist that does not have a broken build script. For now there is no such way to communicate this and so it might be that shipping a broken setup.py is a useful workaround. The purpose of this thread and the other linked in the OP is that there should be a better way to do this that does not break the build for people who actually do want to build from source and who can be expected to take responsibility for providing the external requirements before doing so.

JamesParrott · July 8, 2024, 9:05am

That’s a really useful clarification and a great example - thanks. That shows the usefulness of what Karl’s written, over a boolean flag.

I’m trying to understand firstly what’s wrong with a simple broken build script (albeit intentionally and constructively, and perhaps conditionally), and secondly which users the proposal is useful for, beyond the majority of us, who would all benefit from being told we can’t install a library the standard way on our platform.

Why would any of these more experienced users, that still want to build from an sdist, for their platform or for other aforementioned reasons, who know how to compile C extensions, object to setting an environment variable beforehand (or reading the docs)? What’s their goal? Which tools are they using? It’d be good to understand what value I’m adding by packaging my projects to suit them.

Nonetheless, overall I think implementing this is a great feature and adds value to pip for all of us. Explicit is better than implicit. I think the decision to compile from an sdist should be intentional, and made explicit.

The proposal’s just not obligatory. The onus to figure out how to compile something a non-standard way, off the well trodden path, is on the user.