Standard library support for SIMD

hansgeunsmeyer · October 1, 2023, 8:20pm

Language like Swift and Rust have builtin support in their standard libraries for SIMD processing.
Would it be desirable to have this kind of support (for various platforms) also in Python?
I personally miss this at the moment in certain pet projects (compute-intensive with high-degree of implicit data-parallelism to be exploited). I can work-around it, by providing my own extension code, but it would have been so much easier if the support was already there. (Also, this would help extension code authors by providing sample code to speed up their own learning.)
So, I wonder if more people would like to see something like this, or if this is pretty niche.

vovavili · October 1, 2023, 8:39pm

Mojo is basically statically compiled Python + SIMD primitives.

hansgeunsmeyer · October 1, 2023, 8:41pm

I really need to take a closer look at that then.
(Unfortunately no support yet for Apple Silicon…)

barry-scott · October 1, 2023, 8:47pm

This would be useful for numpy etc?
If they support SIMD then you have the advantage in python already?

effigies · October 1, 2023, 8:57pm

https://numpy.org/doc/stable/reference/simd/index.html

Edit: And I can confirm that these are active in the wheels posted to PyPI (I’ve had obscure bugs pop up on CI when a particular runner supported AVX512 extensions), though I don’t know if they will be as well-optimized as a wheel built on your specific machine.

hansgeunsmeyer · October 1, 2023, 9:02pm

In numpy this seems to be all under the hood, it’s not exposed on the Python layer, or directly controllable at that level.

effigies · October 1, 2023, 9:10pm

Are you looking for an interface like simd · PyPI?

I’m curious what your use-case would be and what direct control you’re interested in. Could you mock up the kind of code you would like to write?

hansgeunsmeyer · October 1, 2023, 9:21pm

The simd interface could work (though that package doesn’t support NEON, and current Apple M1/M2 is my primary platform). The core code basically looks like this:

            xp = vp & zm
            d0 = (((x & xp) + xp) ^ xp) | x | vn            
            hp = vn | ((d0 | vp) ^ mask)
            hn = vp & d0
            xp = (hp & zm) << 1
            xn = (hn & zm) << 1
            vp = xn | ((d0 | xp) ^ mask)
            vn = xp & d0

where all variables are bit vectors. The bit vectors are in the order of 1M bits and can be broken up into smaller vectors.

barry-scott · October 1, 2023, 9:55pm

What api would you expect to have to allow python at the high level to be able to use SIMD?
Would that api look like numpy?

I do not see how python could do what you want.
Both swift and rust are strongly typed and compiled but python is interpreted at runtime.

hansgeunsmeyer · October 1, 2023, 11:12pm

Yes, it could look like numpy in the sense that it would define a special vector-type that would be mapped to SIMD registers (if present, and otherwise to a normal array). The strong-typing or not is not really relevant, I think, since you would have this dedicated vector-class in Python. Operators would then be defined on that class, so the basic API is just that vector class itself (with several int or float types for contained items), plus normal operators. So, I do think this possible - but it would require core Python support (or a special extension - as in the simd package, numpy).

But I’m actually wondering now if there isn’t an inherent contradiction in what I’m looking for. Since the main goal of supporting this would be getting max performance given suitable data structures. But suppose all the variables in the code snippet that I posted earlier mapped to SIMD registers, and all the logical and arithmetic operations would be accessible in Python, then I wonder if doing so (=> Python function call overhead + object creation in new assignments) would not wipe out any (or most of the) performance boosts you could get. If so, then SIMD processing really would only make sense in extension modules (where you can run a block of code, without having to go back and forth from and to Python), as done in numpy? Either that, or you’d need a special compiler or jit-compiler (like numba) that could optimize blocks of code (like the snippet I showed) trying to avoid those overhead costs?..
So, even if it is possible to add sth like this, would it make sense?

jamestwebber · October 1, 2023, 11:32pm

Yeah, I don’t think this makes sense in pure Python simply because Python allows for almost everything to change at almost any point. Without tracing all of the types and operations that are being used in a given piece of code ^[1], you can’t reliably say “this function can be performed with SIMD instructions”.

sometimes referred to as “compiling”, perhaps “just in time” ↩︎

hansgeunsmeyer · October 1, 2023, 11:44pm

Good point, I had not considered this…

tjreedy · October 2, 2023, 3:30am

In this series of posts the author translates an highly optimized C benchmark program (n-body-???) that uses explicit SIMD first into ugly, non-idiomatic, and unsafe rust that does nearly the exact same thing in about the same time. He then modifies it into safer and more rust-like rust that runs just as fast. In the very last post, he starts over and writes (more quickly) a prettier, idiomatic, and ‘safe’ rust equivalent that lets the compiler entirely handle SIMD. The result was substantially faster. It seems that some rust compiler author(s) successfully embodied expert knowledge of how to best use SIMD into the compiler.

I would assume that the same is true of numpy.

mikeshardmind · October 2, 2023, 10:51pm

There are some cases numpy does not cover. I’m not sure how useful adding this to the standard library would be, especially given existing libraries. I believe the current best options are Numba (llvm based compilation of a subset of valid python via decorators which transform your code), CuPy (if you want to specifically do parallel work by writing small bits to run in parallel on the GPU), Cython, or a native extension. Given the example here seems to be focused around direct use of simd intrinsics but mapping them to python objects, I imagine that numba or CuPy would suitably provide acceleration for the cases the original author is thinking about if numpy does not. I don’t want to rule out that there could be a use in the standard library by saying this, but there’s a lot in the third-party ecosystem that may be better positioned to help here.

hansgeunsmeyer · October 2, 2023, 11:50pm

Thanks - I will look into numba. It was on my list, but I’ve never used it before.
Ultimately it will probably be easiest to just write my own extension code.

jamestwebber · October 3, 2023, 2:22am

As long as you’re staying inside what they support I’ve found numba to be really nice for accelerating numerical functions with very little extra work.

If you need something more complex and you know Rust, maturin and pyo3 are straight-up magic in terms of how easy it is to write a python extension.

NeilGirdhar · October 3, 2023, 6:32am

I think XLA automatically applies SIMD operations, and Jax is a fantastically-designed front-end to XLA. It’s as easy as:

from jax import jit
import numpy as np

@jit
def f(x, y):
  return x + y

print(f(np.zeros(100), np.ones(100)))

It also supports compiling to GPUs and TPUs.