Standard library support for SIMD

Language like Swift and Rust have builtin support in their standard libraries for SIMD processing.
Would it be desirable to have this kind of support (for various platforms) also in Python?
I personally miss this at the moment in certain pet projects (compute-intensive with high-degree of implicit data-parallelism to be exploited). I can work-around it, by providing my own extension code, but it would have been so much easier if the support was already there. (Also, this would help extension code authors by providing sample code to speed up their own learning.)
So, I wonder if more people would like to see something like this, or if this is pretty niche.

1 Like

Mojo is basically statically compiled Python + SIMD primitives.

1 Like

I really need to take a closer look at that then.
(Unfortunately no support yet for Apple Siliconā€¦)

1 Like

This would be useful for numpy etc?
If they support SIMD then you have the advantage in python already?

https://numpy.org/doc/stable/reference/simd/index.html

Edit: And I can confirm that these are active in the wheels posted to PyPI (Iā€™ve had obscure bugs pop up on CI when a particular runner supported AVX512 extensions), though I donā€™t know if they will be as well-optimized as a wheel built on your specific machine.

2 Likes

In numpy this seems to be all under the hood, itā€™s not exposed on the Python layer, or directly controllable at that level.

1 Like

Are you looking for an interface like simd Ā· PyPI?

Iā€™m curious what your use-case would be and what direct control youā€™re interested in. Could you mock up the kind of code you would like to write?

The simd interface could work (though that package doesnā€™t support NEON, and current Apple M1/M2 is my primary platform). The core code basically looks like this:

            xp = vp & zm
            d0 = (((x & xp) + xp) ^ xp) | x | vn            
            hp = vn | ((d0 | vp) ^ mask)
            hn = vp & d0
            xp = (hp & zm) << 1
            xn = (hn & zm) << 1
            vp = xn | ((d0 | xp) ^ mask)
            vn = xp & d0

where all variables are bit vectors. The bit vectors are in the order of 1M bits and can be broken up into smaller vectors.

What api would you expect to have to allow python at the high level to be able to use SIMD?
Would that api look like numpy?

I do not see how python could do what you want.
Both swift and rust are strongly typed and compiled but python is interpreted at runtime.

Yes, it could look like numpy in the sense that it would define a special vector-type that would be mapped to SIMD registers (if present, and otherwise to a normal array). The strong-typing or not is not really relevant, I think, since you would have this dedicated vector-class in Python. Operators would then be defined on that class, so the basic API is just that vector class itself (with several int or float types for contained items), plus normal operators. So, I do think this possible - but it would require core Python support (or a special extension - as in the simd package, numpy).

But Iā€™m actually wondering now if there isnā€™t an inherent contradiction in what Iā€™m looking for. Since the main goal of supporting this would be getting max performance given suitable data structures. But suppose all the variables in the code snippet that I posted earlier mapped to SIMD registers, and all the logical and arithmetic operations would be accessible in Python, then I wonder if doing so (=> Python function call overhead + object creation in new assignments) would not wipe out any (or most of the) performance boosts you could get. If so, then SIMD processing really would only make sense in extension modules (where you can run a block of code, without having to go back and forth from and to Python), as done in numpy? Either that, or youā€™d need a special compiler or jit-compiler (like numba) that could optimize blocks of code (like the snippet I showed) trying to avoid those overhead costs?..
So, even if it is possible to add sth like this, would it make sense?

1 Like

Yeah, I donā€™t think this makes sense in pure Python simply because Python allows for almost everything to change at almost any point. Without tracing all of the types and operations that are being used in a given piece of code [1], you canā€™t reliably say ā€œthis function can be performed with SIMD instructionsā€.


  1. sometimes referred to as ā€œcompilingā€, perhaps ā€œjust in timeā€ :wink: ā†©ļøŽ

1 Like

Good point, I had not considered thisā€¦

In this series of posts the author translates an highly optimized C benchmark program (n-body-???) that uses explicit SIMD first into ugly, non-idiomatic, and unsafe rust that does nearly the exact same thing in about the same time. He then modifies it into safer and more rust-like rust that runs just as fast. In the very last post, he starts over and writes (more quickly) a prettier, idiomatic, and ā€˜safeā€™ rust equivalent that lets the compiler entirely handle SIMD. The result was substantially faster. It seems that some rust compiler author(s) successfully embodied expert knowledge of how to best use SIMD into the compiler.

I would assume that the same is true of numpy.

11 Likes

There are some cases numpy does not cover. Iā€™m not sure how useful adding this to the standard library would be, especially given existing libraries. I believe the current best options are Numba (llvm based compilation of a subset of valid python via decorators which transform your code), CuPy (if you want to specifically do parallel work by writing small bits to run in parallel on the GPU), Cython, or a native extension. Given the example here seems to be focused around direct use of simd intrinsics but mapping them to python objects, I imagine that numba or CuPy would suitably provide acceleration for the cases the original author is thinking about if numpy does not. I donā€™t want to rule out that there could be a use in the standard library by saying this, but thereā€™s a lot in the third-party ecosystem that may be better positioned to help here.

2 Likes

Thanks - I will look into numba. It was on my list, but Iā€™ve never used it before.
Ultimately it will probably be easiest to just write my own extension code.

As long as youā€™re staying inside what they support Iā€™ve found numba to be really nice for accelerating numerical functions with very little extra work.

If you need something more complex and you know Rust, maturin and pyo3 are straight-up magic in terms of how easy it is to write a python extension.

3 Likes

I think XLA automatically applies SIMD operations, and Jax is a fantastically-designed front-end to XLA. Itā€™s as easy as:

from jax import jit
import numpy as np

@jit
def f(x, y):
  return x + y

print(f(np.zeros(100), np.ones(100)))

It also supports compiling to GPUs and TPUs.