Rebuilding Python Wheels

Hey there!

I’m working on open-source packaging ecosystem improvements at Google and recently prototyped a rebuilder for Python Wheels.

I’ve put together a summary of the system and results and presented the broader work at PackagingCon last month (talk).

Overall, the system successfully rebuilt about 50% of the top 100 packages with very little complexity and achieved coverage of ~25% of all PyPI downloads. I think the strategy is promising for assessing the degree to which packages reflect their advertised upstreams and for encouraging package owners to maintain good metadata and release hygiene.

Would love to get any feedback or suggestions everyone here might have and I’ll be sure to keep this list updated on future work in the space.

Thanks!

4 Likes

Thanks for sharing!

Would you envision this feeding into or using Sigstore to publish the hashes of the builds?

For those projects for which you couldn’t infer the repo, did you try against the sdist for at least that level of reproducibility?

It seems like you would benefit the most from the core metadata being expanded upon to record source code provenance a bit more (or as a separate file like the direct URL recording spec). Is that a fair assessment? If so are you thinking of starting a discussion to submit a PEP?

Thanks for sharing!

Absolutely!

Would you envision this feeding into or using Sigstore to publish the hashes of the builds?

Yeah we’ve been working with the Sigstore team on this project and that’s one of the options for bringing this into wider use.

Another smaller opportunity might be to provide feedback directly to maintainers when we have high confidence we detect a change not present upstream. There’s definitely a spectrum of short- to long-term applications of this technique and many ways of getting to an end state of guaranteed source metadata for Python packages.

For those projects for which you couldn’t infer the repo, did you try against the sdist for at least that level of reproducibility?

That’s a great idea! I hadn’t thought of it but it’s certainly worth exploring! I had some trouble with sdists early on and dropped them from the prototype but I think sdist packaging processes have a lot of room for improvement in the ecosystem, too.

It seems like you would benefit the most from the core metadata being expanded upon to record source code provenance a bit more (or as a separate file like the direct URL recording spec). Is that a fair assessment?

An in-package indicator as found in pbr.json, an adaptation of PEP 610, or others like it (e.g. .cargo_vcs_info.json) is certainly an option. I think my intuition would be that, if present at all, it belongs alongside a more complete record of the steps to reproduce like buildinfo rather than just being included as a lone hint to the package user.

I’d probably favor keeping this data in the API (or even in an entirely separate data store like a transparency log a la sigstore) over changing the internal package format just yet.

Regardless of the mechanics, though, the overall goal would be incorporating this source info into the package metadata such that users are better able to understand the code they use. I’m sure there are many ways of doing so and those with background on Python packaging’s recent history are well-placed to make suggestions (I’d love to hear them!).

If so are you thinking of starting a discussion to submit a PEP?

I don’t think the path forward is clear enough for a PEP just yet but that is the goal. Getting some form of rebuilder integration agreed upon in the community would be a great goal for the next few months!

1 Like