Trusted publishing of other peoples wheels?

thejcannon · September 7, 2023, 1:45am

There’s a handful of reasons to prefer only wheel releases. Most notably, and getting always more popular, is that Python Build Standalone by @indygreg (who I’m still convinced is not a hivemind collective of several super-people) has quirks with compiling native modules (due to the hardcoded values in siteconfig module).

However, not all projects have wheels. And certainly not older versions of projects. For 3 projects I discovered we depend on that don’t have wheels, I reached out on GitHub, opened issues, and then opened PRs and walked the owners through the steps in the changes. All were grateful, but took varying speeds to respond. One required compiling C code, the others were pure python.

So… what if it was easy to have wheels be built for you, in a way that everyone in the community could trust?

Package authors wouldn’t have to worry remembering to publish wheels (or how to publish them at all) but the community could still benefit
Older versions of packages could have wheels uploaded if missing.
- this is especially useful if say, a newer version of Python was released
- “Just use a newer version” isn’t always a solution

The new solution would have to be something blessed and maintained by PyPI to have any kind of trust. I’m sure Microsoft would love to donate some compute resources (as they already kinda do to the open source community…).

Just a musing…

BrenBarn · September 7, 2023, 2:18am

Are you envisioning something similar to conda-forge, but for PyPI?

h-vetinari · September 7, 2023, 5:28am

This is essentially the “build farm for PyPI” idea, the lack of which is felt in many different ways. With the string “build farm”, you can also find some more references to this topic, including this recent mention:

I’m obviously in favour of the concept, but then again, I spend most of my FOSS time on conda-forge.

hansgeunsmeyer · September 7, 2023, 1:40pm

One of the reasons I got enamored with Rust recently is that Rust seems to have gotten this right (or at least a lot more right than the Python dev community has) with cargo and rustc. But it seems we aren’t even able to have a purely data-driven build – specified just by pyproject.toml – we still need a pretty complex setup, with setup.py, setup.cfg, pyproject.toml, MANIFEST.in for repos that contain more than pure Python code. And this setup is not really standardized… (poetry is doing its own thing, building a conda package again requires a special setup…). Instead of seeing more tools and tooling, personally I think we first need more agreement about some kind of public standard that is shared by various builders and packaging “authorities”.

steve.dower · September 7, 2023, 3:23pm

Rust is an entirely different scenario - their main distribution is sources and it’s all built around local compilation, and they control their entire native stack. JavaScript/npm is a more relevant comparison, but they’re still in the “bundle Go/.NET executables and shell out” stage, so don’t get too excited about more info

Personally, I’m a huge fan of trusted third parties building and distributing packages written by other people. We can call them “distributors”. Or alternatively, if the project maintainers themselves allow them to be published as part of the project, we can call them “other maintainers”.

That was a bit sarcastic, but my point is that these already exist and people don’t like using them.^[1] Also worth noting that virtually every distribution has figured out that they can’t just use the existing infrastructure to distribute their builds, which is why none of them do it. But that aside, there’s nothing stopping anyone hosting their own index containing builds and nothing stopping users pointing at it when they install. There’s also nothing stopping you rehosting wheels from PyPI on your index to fill in gaps if you don’t want to build everything yourself.

What’s missing is persuading users to use anything other than the default, which is incredibly difficult. Among other reasons, as soon as you are perceived to lag even one version of one package, users will declare your solution “unusable”, no matter what other benefits you offer.

And if you want your builds to be available in the default, well, the way to do that is to approach the project and get them to make you a maintainer on PyPI. There’s no requirement for all the files to be published at once or to come from the same uploader, so what’s missing here is persuading the projects to (in their view) give up control over their own builds. Some are willing (some are keen!) but others will refuse. Again, it’s a people problem, and it needs to be solved by talking to the people.

(With my “Microsoft representative” hat on briefly, GitHub Actions and Azure Pipelines free tiers are likely to be as much as we’ll donate to public projects right now. Special cases may still come up, but “we want to run builds” is adequately covered right now.)

The distributions, that is. People generally like maintainers. ↩︎

thejcannon · September 7, 2023, 4:08pm

I think a lot of these points would be mitigated by PyPI’s involvement, as PyPI is already used widely and trusted. Which you point out (as I see PyPI as “the default”):

What’s missing is persuading users to use anything other than the default, which is incredibly difficult.

And to

And if you want your builds to be available in the default, well, the way to do that is to approach the project and get them to make you a maintainer on PyPI.

I don’t think that scales, and I suspect you don’t either And it certainly isn’t a solution for “the ecosystem”.

So I guess the point of all this, aside from musing into the void was to judge interest and then figure out next steps. I’m guessing the immediate next step would be to informally chat with someone(s) from PyPI on their appetite.

steve.dower · September 7, 2023, 4:25pm

They read this forum and will chime in when they feel like it. But I expect their response to be “projects can already designate approved uploaders by making them a Maintainer”.

One project at a time is the only way this scales.

pf_moore · September 7, 2023, 5:32pm

Usually, the issue here is when a distribution is incompatible with builds distributed on PyPI. If you can pip install from PyPI when you care about getting the latest version before the “distribution” makes it available, then any “unusability” problem is significantly addressed.

The problem is that most of the things that people normally call distributions don’t want their curated and tested versions mixed with arbitrary releases from PyPI. That’s not unreasonable if they want to provide some sort of support, but it’s not what every user wants.

Christoph Gohlke’s Windows wheels were a great example of “publishing other people’s wheels”. And it was very popular, even though he didn’t publish his builds as a package index, so it was non-trivial to use them. I’m relatively sure that if someone published an index containing PyPI compatible wheels of “stuff that isn’t available as wheels on PyPI”, it would get interest - how much depends on what packages it provides, of course, many of the packages that made Christoph’s builds popular (numpy, scipy, …) are now available as wheels on PyPI.

Having to add --extra-index-url https://index.pypi-extras.org to your pip invocation would be a minor disadvantage, IMO. And explicit is better than implicit, especially when dealing with the question of “where did this code I’m running come from?”

steve.dower · September 7, 2023, 5:57pm

Considering most distributions have a solid definition of “compatible” and PyPI does not, this is impossible to address.

One of my big hopes for PyBI was that PyPI would be able to adopt a specific PyBI package as the baseline for anything published to PyPI. Without any baseline at all, the only way to release compatible wheels is to also release the entire stack below (and occasionally above) that wheel.

The practical incompatibilities we see on PyPI are mostly from packages trying to handle this situation independently, by bundling or linking more than they ought to.

(If you just meant the packaging format, such that most distros don’t allow you to update their packages to incompatible ones from PyPI, see my earlier point about our current set of metadata not being able to capture the necessary info to make suitable decisions, and so distros have to invent workarounds. But we’ve covered this ground before )

davidism · September 8, 2023, 1:27am

8 posts were split to a new topic: Why build wheels for pure Python projects?

h-vetinari · September 8, 2023, 4:24am

Some more background why this is so much harder for python.

If we do this, it needs to become the default IMO, i.e. get the original maintainers invested/involved in the process, because it’ll be the main delivery path to their users. I’m aware that this a very tall order… Speaking of:

One thing that makes this endeavour a couple orders of magnitude more difficult still, is that build farms are expensive, both in terms of resources and maintenance. Such costs are the death of many a good idea in FOSS, because if you can’t find someone to pay for it, it’s just not happening.

For example, conda-forge could not exist without Microsoft (and other providers) offering free CI resources, and without Anaconda footing the bill for Petabytes worth of storage and traffic. Completely aside from that, there’s an army of bots & volunteers, plus a substantial core team (some of which are paid at least in part for their work) to keep things running.

As a counter-example, CRAN has made it such that the default publication path needs to go through a build farm. I just doubt that it scales to Python, because Python is much more of a glue language than R, and so needs to build a much wider variety of stuff across its ecosystem. Having a build farm that does all of {C, C++, Fortran, Rust, CUDA, Java, JavaScript, …} sanely, much less across all relevant platforms, is a mind-boggling amount of work^[1].

and would be essentially reinventing conda & conda-forge ↩︎

steve.dower · September 8, 2023, 10:10am

Agreed, and I believe it could scale to Python, except now you need to convince the maintainers to modify their projects so that they build in a clean environment with nothing other than a python -m build command.

R basically started with this requirement, and so packages were developed with this assumption, but Python’s history means it is full of things you need to pre-install on your machine, or environment variables to set, etc. Cleaning those up would be great, but a lot of work.

“Build on GitHub Actions with the default images” is a pretty good build farm, IMHO. The trick is still to force things like native library downloads, extracts, builds and installs into the (*gasp*) setup.py so that only one command is needed to build. Most projects’ CI build scripts are much more complex than this today, in exchange for simpler Python build scripts.

thejcannon · September 8, 2023, 11:27am

To me, anything is better than nothing.

I envision the v1 of the build farm to basically:

Run via a GitHub Actions Workflow
The simplest definition of a build:
1. Download the sdist
2. Try pipx run build
3. If that fails due to extensions, run cibuildwheel
4. If that fails, bail. Womp womp.
5. Otherwise, hurrah we have wheels

I think the more challenging bit would be some kind of way to ensure we aren’t redoing work (e.g. we aren’t rerunning a build for a pure python project which already has a wheel, or the build failed for good reason). And figuring out how users can request it to be triggered.

I imagine this would take us pretty far. And certainly farther than we’re at now.

For the projects I mentioned earlier where I needed wheels, this wouldve worked

fungi · September 8, 2023, 12:45pm

I envision the v1 of the build farm to basically:

Run via a GitHub Actions Workflow

The simplest definition of a build:

Download the sdist

Try pipx run build

If that fails due to extensions, run cibuildwheel

If that fails, bail. Womp womp.

Otherwise, hurrah we have wheels

I think the more challenging bit would be some kind of way to
ensure we aren’t redoing work (e.g. we aren’t rerunning a build
for a pure python project which already has a wheel, or the build
failed for good reason). And figuring out how users can request it
to be triggered.

We do something along these lines today in the OpenDev
Collaboratory, in order to accelerate CI jobs for projects we host.
These projects often have many transitive dependencies to packages
from PyPI, and some of those dependencies lack appropriate wheels
yet are expensive time-wise to compile. Probably the biggest
difference (aside from only building a tiny subset of the things on
PyPI) is that we build platform-specific Linux wheels rather than
building manylinux wheels, in part because this process was
developed before there was such a thing as manylinux, and since it’s
still working there’s been little reason to revisit it.

Our process goes like this: A daily CI job calls pip to install the
various branches of Python projects we host into venvs on all the
Linux distributions we use regularly for CI. The wheel caches from
these installation calls are pooled, and the output from pip is
analyzed to determine which dependencies were built from sdist and
which were downloaded as wheels (either from our own pre-built wheel
mirrors or from our caching PyPI proxy, we try not to hit PyPI
directly from CI jobs both for efficiency and to be good citizens of
the Python community). Any wheels which were downloaded from
somewhere are removed from the pool, and then the remainder (the
ones which actually got built by the job) are synchronized to a
network filesystem.

That network filesystem backs a distributed server farm which serves
them up with the “simple API” (separated into different file trees
by build platform) as sort of bespoke CDN endpoints in all our donor
CI regions. Our CI jobs are generally configured to look to the
closest of these mirror network endpoints for wheels as an “extra
index” so that they’ll pull things from there rather than waste
precious CI resources uselessly rebuilding the same sdists over and
over. Since the wheel cache building jobs are themselves also CI
jobs, they can take advantage of the existing cache too, so that
they don’t keep rebuilding the same ones day after day either, only
whatever’s new.

The biggest gap we have in this process is that it is additive only,
so doesn’t react to releases on PyPI getting “yanked” (we end up
continuing to serve wheels for those, which can lead to some
unexpected CI results from time to time, and requires a service
admin to pull the offending wheels out when that occurs).

brettcannon · September 8, 2023, 8:41pm

Getting a build farm is already listed as a fundable project, so it’s more about someone getting the money together (i.e., people power and resources).

What do you mean by “fails due to extensions”? What is cibuildwheel doing that build isn’t (ignoring projects that don’t specify a [build-system])?

And we all look forward to you implementing your proof-of-concept. In all seriousness, I don’t think I have ever heard anyone say this is a bad idea, just that no one has figured out how to make it sustainable.

If you want to really try this and maybe avoid paying massive costs thanks to Microsoft’s free CI (I don’t know if what I’m about to suggest goes against GitHub’s ToS or how quickly you will exhaust your free tier), you could try:

Create an org on GitHub
Set up a server somewhere that monitors PyPI
Create a repo in the org per-project on PyPI (with probably some protections for spam); CodeQL mutation, REST API endpoint
Set up each repo to do steps 1 & 2 via GitHub Actions when triggered by your server when it detects a new release (once again, with some protections to avoid spam)
Create a release on GH and push the wheels to it
Have a server that creates a package index that points to the built wheels in the release

That feels like conda-forge, but without trying to do the hard work of custom conda build scripts for projects and instead saying, “make it work with [build-system] or you don’t get the automatic builds”.

jamestwebber · September 8, 2023, 8:58pm

This puts me in mind of a post I made in @BrenBarn’s “10 year view” thread–the idea that more-generic packaging tools (like conda/mamba) could delegate installation to more specific tools like pip.

Maybe what I’m trying to say is: as part of building such a proof-of-concept (or maybe in place of it), it would be neat to see if a) conda-forge could replace many recipes with a delegation to pip and b) then could expose those automatically built wheels as an extra index. Or, if those aren’t possible, figure out the blockers.

thejcannon · September 8, 2023, 10:18pm

It really isn’t hard to get a really neato proof-of-concept. But it really isn’t worth more than a waste of Microsoft’s storage and compute, since the end goal is for PyPI to be the one hosting the wheels. Unless it was super public, I don’t expect the every user of Python Build Standalone (or any other wheel wanter) to know about this toy cheeseshop

Any proof of concept is essentially “look, I can run a GitHub action that builds a wheel” with some fun tricks on top for hosting. And that’s not terribly exciting.

And if it wasn’t clear. The way to make this inexpensive to start is by-request. Which in probably lies most of the complexity. Not the building or hosting, but by the management of.

brettcannon · September 11, 2023, 11:45pm

Sure, if that’s your end goal. But if your end goal is some index that backfills missing wheels from PyPI then you can do it independent of PyPI. Really depends on how you define success and how important it is to you to reach it. You can either work towards it iteratively and maybe not get that far or that much uptake, or you can aim for the whole thing and maybe get nowhere.

dstufft · September 17, 2023, 6:47pm

I think that having something setup where PyPI can automatically build wheels (either directly, or by farming things out to some other system) would be a pretty nice UX improvement overall.

However, one thing that I think is very important here is that I don’t think that PyPI should be producing these wheels without the project itself asking for it to. I know that there are ecosystems where this is the norm, but it’s not the norm for PyPI and I think that it’s important that we allow authors to retain control over how/who is building their code.

I think we also can’t legally blanket build for all of PyPI, AIUI our Terms of Service guarantee us the right to distribute what was uploaded, but not to build it or execute it.