WheelNext & Wheel Variants: An update, and a request for feedback!

That really depends on the type of package. For numerical libraries like NumPy, PyTorch, and OpenBLAS there are hundreds of performance-critical functions, not just a few. I can imagine (but have no hard knowledge on this) that for a package with higher-level primitives like Pillow-SIMD, there may be a smaller set of accelerated functions. I do have some numbers, and am happy to consolidate them in the rationale for “why not dynamic dispatch”. To give some rough indication now:

  • OpenBLAS: most BLAS and some LAPACK functions have multiple implementations. We took wheel sizes from 9.5 MB to 5.5 MB or so by careful tradeoffs by reducing the number of x86-64 CPU architectures dispatched for from 15 to 5 (see this issue). There’s another 25-40% in wheel size to gain if we could have wheel variants. There is a lot of code to support dynamic dispatch, and >50% of total size came from that.
  • NumPy: there’s probably O(100) of the most performance-sensitive functions with SIMD implementations, out of O(1000) total public functions. If I compare a default local wheel build (with optimizations) to one without optimizations (add -Csetup-args=-Ddisable-optimization=true to python -m build), the wheel size changes from 7.5 MB to 5.5 MB.
  • PyTorch: I don’t have exact numbers at hand, but:
    • For CPU code, the SIMD usage is probably a little heavier than NumPy’s
    • For GPU code, the story is completely different. Almost every operator is compiled 8 times or so for different GPU architectures, which matters not only for performance but for supporting the hardware at all - and binary size scales linearly with the number of architectures supported.

So binary size wise, dynamic dispatch is more expensive than you think (and as @dstufft explained, it’s the per-file wheel size that matters, not cumulative size). Then there is implementation complexity. That is probably even more important. It’s incredibly complex to set up dynamic dispatching, since it requires not only CPU feature detection but indirection at the individual function level. This is orders of magnitude more work than simply passing a flag at build time to compile for a different architecture. I’m not exaggerating that - it’s complex enough that projects like SciPy and scikit-learn have looked at it and decided that it was too complex to implement. Only a handful of projects (to my knowledge) like PyTorch, NumPy, OpenCV and Pillow-SIMD actually use it.

tl;dr dynamic dispatch can be valuable, but due to both binary size and complexity/maintainability it can’t reasonably be considered as the only/standard solution.

***

As a meta comment: while new ideas are very much valuable and hopefully will lead to a better design with fewer tradeoffs that will be acceptable to everyone, the many folks involved in the wheel variants effort did already explore a bunch of different designs and looked at a lot of the prior art. “Rely on dynamic dispatch” and “use different package names” are two pretty obvious ideas with a good amount of prior art. In addition, there were multiple previous Discourse threads on these topics with hundreds of posts (e.g., here, here, and here), hence these particular ideas are known. So perhaps it’s more constructive to ask if your favorite solution X was considered and if anyone can share a summary of tradeoffs, rather than just asserting that X is the most obvious, idiomatic, or only solution and starting to argue from that starting point.

16 Likes