Language like Swift and Rust have builtin support in their standard libraries for SIMD processing.
Would it be desirable to have this kind of support (for various platforms) also in Python?
I personally miss this at the moment in certain pet projects (compute-intensive with high-degree of implicit data-parallelism to be exploited). I can work-around it, by providing my own extension code, but it would have been so much easier if the support was already there. (Also, this would help extension code authors by providing sample code to speed up their own learning.)
So, I wonder if more people would like to see something like this, or if this is pretty niche.
I really need to take a closer look at that then.
(Unfortunately no support yet for Apple Siliconā¦)
This would be useful for numpy etc?
If they support SIMD then you have the advantage in python already?
https://numpy.org/doc/stable/reference/simd/index.html
Edit: And I can confirm that these are active in the wheels posted to PyPI (Iāve had obscure bugs pop up on CI when a particular runner supported AVX512 extensions), though I donāt know if they will be as well-optimized as a wheel built on your specific machine.
In numpy this seems to be all under the hood, itās not exposed on the Python layer, or directly controllable at that level.
Are you looking for an interface like simd Ā· PyPI?
Iām curious what your use-case would be and what direct control youāre interested in. Could you mock up the kind of code you would like to write?
The simd interface could work (though that package doesnāt support NEON, and current Apple M1/M2 is my primary platform). The core code basically looks like this:
xp = vp & zm
d0 = (((x & xp) + xp) ^ xp) | x | vn
hp = vn | ((d0 | vp) ^ mask)
hn = vp & d0
xp = (hp & zm) << 1
xn = (hn & zm) << 1
vp = xn | ((d0 | xp) ^ mask)
vn = xp & d0
where all variables are bit vectors. The bit vectors are in the order of 1M bits and can be broken up into smaller vectors.
What api would you expect to have to allow python at the high level to be able to use SIMD?
Would that api look like numpy?
I do not see how python could do what you want.
Both swift and rust are strongly typed and compiled but python is interpreted at runtime.
Yes, it could look like numpy in the sense that it would define a special vector-type that would be mapped to SIMD registers (if present, and otherwise to a normal array). The strong-typing or not is not really relevant, I think, since you would have this dedicated vector-class in Python. Operators would then be defined on that class, so the basic API is just that vector class itself (with several int or float types for contained items), plus normal operators. So, I do think this possible - but it would require core Python support (or a special extension - as in the simd package, numpy).
But Iām actually wondering now if there isnāt an inherent contradiction in what Iām looking for. Since the main goal of supporting this would be getting max performance given suitable data structures. But suppose all the variables in the code snippet that I posted earlier mapped to SIMD registers, and all the logical and arithmetic operations would be accessible in Python, then I wonder if doing so (=> Python function call overhead + object creation in new assignments) would not wipe out any (or most of the) performance boosts you could get. If so, then SIMD processing really would only make sense in extension modules (where you can run a block of code, without having to go back and forth from and to Python), as done in numpy? Either that, or youād need a special compiler or jit-compiler (like numba) that could optimize blocks of code (like the snippet I showed) trying to avoid those overhead costs?..
So, even if it is possible to add sth like this, would it make sense?
Yeah, I donāt think this makes sense in pure Python simply because Python allows for almost everything to change at almost any point. Without tracing all of the types and operations that are being used in a given piece of code [1], you canāt reliably say āthis function can be performed with SIMD instructionsā.
sometimes referred to as ācompilingā, perhaps ājust in timeā
ā©ļø
Good point, I had not considered thisā¦
In this series of posts the author translates an highly optimized C benchmark program (n-body-???) that uses explicit SIMD first into ugly, non-idiomatic, and unsafe rust that does nearly the exact same thing in about the same time. He then modifies it into safer and more rust-like rust that runs just as fast. In the very last post, he starts over and writes (more quickly) a prettier, idiomatic, and āsafeā rust equivalent that lets the compiler entirely handle SIMD. The result was substantially faster. It seems that some rust compiler author(s) successfully embodied expert knowledge of how to best use SIMD into the compiler.
I would assume that the same is true of numpy.
There are some cases numpy does not cover. Iām not sure how useful adding this to the standard library would be, especially given existing libraries. I believe the current best options are Numba (llvm based compilation of a subset of valid python via decorators which transform your code), CuPy (if you want to specifically do parallel work by writing small bits to run in parallel on the GPU), Cython, or a native extension. Given the example here seems to be focused around direct use of simd intrinsics but mapping them to python objects, I imagine that numba or CuPy would suitably provide acceleration for the cases the original author is thinking about if numpy does not. I donāt want to rule out that there could be a use in the standard library by saying this, but thereās a lot in the third-party ecosystem that may be better positioned to help here.
Thanks - I will look into numba. It was on my list, but Iāve never used it before.
Ultimately it will probably be easiest to just write my own extension code.
As long as youāre staying inside what they support Iāve found numba
to be really nice for accelerating numerical functions with very little extra work.
If you need something more complex and you know Rust, maturin
and pyo3
are straight-up magic in terms of how easy it is to write a python extension.
I think XLA automatically applies SIMD operations, and Jax is a fantastically-designed front-end to XLA. Itās as easy as:
from jax import jit
import numpy as np
@jit
def f(x, y):
return x + y
print(f(np.zeros(100), np.ones(100)))
It also supports compiling to GPUs and TPUs.