Supporting SIMD on AArch64 Linux

Hey folks, I think it’s time we seriously consider adding SIMD support for AArch64 in our libraries. Here’s why this makes sense now.

ARM servers are everywhere these days

Look around and you’ll see ARM servers popping up everywhere. AWS has their Graviton chips, Azure is running Ampere processors, and honestly, the ARM server ecosystem has gotten pretty solid. It’s not some experimental thing anymore - companies are running real production workloads on these machines because they often get better bang for their buck.

The writing’s on the wall: ARM isn’t going anywhere, and we’re probably going to see even more of it.

We’ve already done the hard work for x86

Here’s the thing - we already know how to do SIMD optimization. Libraries like blake2 in Python already use AVX2 and other x86 SIMD instructions to speed things up significantly.

The cool part is we don’t have to start from scratch. There are tools like SIMD Everywhere (SIMDe) that basically let you write SIMD code once and run it on both x86 and ARM. It translates x86 intrinsics to ARM NEON instructions, so you can take existing optimized code and get it working on ARM without rewriting everything.

Plus, we already have some experimental support for AArch64 on macOS, which means we’ve got a head start on the ARM SIMD work. Getting this working on Linux should be much easier since we can build on that existing foundation and experience.

This makes prototyping and testing way easier than you’d expect.

ARM’s SIMD story keeps getting better

ARM’s SIMD capabilities used to be pretty basic with just NEON, but that’s changed a lot. The newer ARM architectures, especially ARMv9.2-A, now support 512-bit wide SIMD instructions. That’s the same width as some of the latest x86 stuff.

So we’re not talking about settling for worse performance on ARM - in many cases, we can match or even beat x86 performance with the right optimizations.

Which modules?

For now, I think there is not any API change. The detail is under the water.

I think we just need to support SIMD on aarch64 for Linux for 2 modules

  1. hamc
  2. blake2

Those two modules is used very common on Linux server.

And those two modules are support aarch64 on Linux now, so I think this is more easy to support it on Linux

cc @diegor

1 Like

What exactly are you proposing? What concrete APIs/implementation changes/??? do you want implemented?

For now, I think there is not any API change. The detail is under the water.

I think we just need to support SIMD on aarch64 for Linux for 2 modules

  1. hamc
  2. blake2

Those two modules are used very common on Linux server.

Ok, so your suggestion is to add specializations to these two modules? Have you tried just contributing these things directly?

Hi, I’m maintaining those modules (I wrote the HMAC module and it’s _hmac and not hamc).

Those modules are not very common; _hmac is only used if OpenSSL is not present and I think most Linux servers build Python with OpenSSL. For the _blake2 module, it’s present everywhere though but BLAKE-2 is not a very common hash function in general (usually, we rely on SHA when we use a Python script). If this is not the case, I’d be interested to see some evidence.

In addition, they use HACL* code, so this entirely depends on the HACL* part and we have this comment:

SIMD256 can’t be compiled on macOS ARM64, and performance of SIMD128 isn’t great.

So, Linux+ARM64 should be already using SIMD.

Thanks for the reply.

Those modules are not very common; _hmac is only used if OpenSSL is not present and I think most Linux servers build Python with OpenSSL.

Oops, I miss some detail here. Thanks for the reminder

For the _blake2 module, it’s present everywhere though but BLAKE-2 is not a very common hash function in general (usually, we rely on SHA when we use a Python script). If this is not the case, I’d be interested to see some evidence.

In GitHub code search, there are almost 4k result for “from hmac import” + “import hmac”, and 3k result for “from hashlib import blake2b”. I think the blake module is more popular than we think.

FYI Code search results · GitHub
and Code search results · GitHub

Yes, I’m not sure about this HACL part. Thanks for reminder(

  • 3k results is not much though. I don’t think Linux/aarch64 has an issue. There are clear reasons why we disabled SIMD on macOS/aarch64.
  • import hmac imports the public module, this is not the same; hmac delegates to OpenSSL or HACL* HMAC.

Thanks for creating the post and raising the issue. In the past there were discussions around this topic (Provide detection for SIMD features in autoconf and at runtime · Issue #125022 · python/cpython · GitHub)
It’s in my TODO list to look at this a little bit closer and come up with a tangible proposal to enable this. Things I have in mind are not just enabling modules in the standard library but also looking at the JIT and providing public API to see which SIMD features are present at runtime.
It is still very fuzzy to me, I hope to spend some time on it after EuroPython 2025.

So, Linux +ARM64 should be already using SIMD.

I don’t think Linux/aarch64 has an issue. There are clear reasons why we disabled SIMD on macOS/aarch64.

I have made a test for blake2s,2. 3.

  1. We don’t enable the SIMD 128 for aarch64 by default now,
  2. I make a force patch, to enable the simd128 test on Linux. Yep, the result is same with MacOS. SIMD128 performance is not good, even worse
  3. I will update a PR to document the result of the code.

We don’t enable the SIMD 128 for aarch64 by default now,

Where do you see this? it entirely depends on whether the compiler supports them or not and depends on CPUID detection as well.

EDIT: in the configure script, we only disable if we’re on macOS + aarch64. But otherwise, the configure checks should not check for aarch64 on Linux I think?

Correct me plz if I’m wrong.

I have checked the PR here. If the CPU is not support -msse -msse2 -msse3 -msse4.1 -msse4.2 , So we don’t define HACL_CAN_COMPILE_SIMD128 here.

But On Linux, SIMD128 is provided by argument -march=armv8-a+simd. So the config.status on Linux will be like

S["LIBHACL_BLAKE2_SIMD256_OBJS"]=""
S["LIBHACL_SIMD256_FLAGS"]=""
S["LIBHACL_BLAKE2_SIMD128_OBJS"]=""
S["LIBHACL_SIMD128_FLAGS"]=""

On x86_64 like

S["LIBHACL_BLAKE2_SIMD256_OBJS"]="Modules/_hacl/Hacl_Hash_Blake2b_Simd256.o"
S["LIBHACL_SIMD256_FLAGS"]="-mavx2"
S["LIBHACL_BLAKE2_SIMD128_OBJS"]="Modules/_hacl/Hacl_Hash_Blake2s_Simd128.o"
S["LIBHACL_SIMD128_FLAGS"]="-msse -msse2 -msse3 -msse4.1 -msse4.2"

I don’t understand. Do you mean on linux/aarch64? there are x86_64-linux-gnu and aarch64-linux-gnu, those are both different. The SIMD flags are conditioned to the following:

if test "$ac_sys_system" != "Linux-android" -a "$ac_sys_system" != "WASI" || \
   { test -n "$ANDROID_API_LEVEL" && test "$ANDROID_API_LEVEL" -ge 28; }
then
  AX_CHECK_COMPILE_FLAG([-mavx2],[
    [LIBHACL_SIMD256_FLAGS="-mavx2"]
    AC_DEFINE([HACL_CAN_COMPILE_SIMD256], [1], [HACL* library can compile SIMD256 implementations])

I don’t understand from where -march=armv8-a+simd is coming from.

Yes, I means linux/aarch64 and the toolchain aarch64-linux-gnu. I think we check the target but check the compiler we use is supporting -mave2 or -msse -msse2 -msse3 -msse4.1 -msse4.2 argument or not, right?

If this is right, the aarch64-linux-gnu is not support -mave2 or -msse -msse2 -msse3 -msse4.1 -msse4.2, so we don’t set the HACL_CAN_COMPILE_SIMD128 and HACL_CAN_COMPILE_SIMD256

Yes, and if the compiler itself doesn’t support this, then we don’t enable it. But that depends on your compiler. Have you tried checking whether your compiler supports those flags? maybe that’s the issue.

Yes, this is core point, The aarch64-linux-gnu is not support -msse -msse2 -msse3 -msse4.1 -msse4.2 and -mave2

The SIMD on aarch64 is named Neon or SME(after ARM v9.2-A for 512 bit). If we want to use the SIMD on aarch64. The compiler argument will be like -march=armv8-a+simd not -msse -msse2 -msse3 -msse4.1 -msse4.2 or -mave2

FYI hacl-star/dist/configure at main · hacl-star/hacl-star · GitHub

Ah I see. The flags for aarch64 and x86-64 are actually not the same. So, we should add another kind of verification for aarch64 on Linux because even if -mavx2is supported by the compiler, it isn’t helpful for aarch64.

However, for that we should first check if HACL* has a correct aarch64 SIMD support.

They already have support 128 for aarch64

Hello, there seems to be some confusion about SIMD technologies on Arm vs x86, let me give some clarification.

First, SSE and AVX are x86-specific SIMD instruction sets and they are not available on Arm CPUs. This means that any code that uses SSE/AVX intrinsics won’t run natively on Arm hardware.
While libraries like SIMDe can translate those intrinsics to Neon equivalents, this is a compatibility layer and it doesn’t mean that Arm supports SSE or AVX.

On the Arm architecture side, the SIMD landscape is quite different:

  • Neon: Available since Armv7-A and present since Armv8-A onwards. It’s a fixed-width (128-bit) SIMD extension commonly used for multimedia, signal processing, and general-purpose vectorisation. More info Neon: Neon
  • SVE (Scalable Vector Extension): Introduced in Armv8.2-A, SVE moves away from fixed vector sizes. Instead, it supports scalable vector lengths from 128 to 2048 bits. Code written with SVE intrinsics is vector-length agnostic, allowing it to adapt to different hardware configurations without recompilation.
  • SVE2: Added in Armv9-A, SVE2 builds on SVE and adds instructions that expand its use beyond HPC into DSP and more general-purpose workloads. Like SVE, it’s scalable and vector-length agnostic. More info: SVE
  • SME (Scalable Matrix Extension): Also in Armv9-A, SME introduces architectural support for matrix operations, with new registers and modes for efficient tile-based computation. It’s highly relevant for AI and ML workloads. More info: Challenge Validation

These Arm SIMD extensions are accessed through C/C++ intrinsics (e.g., <arm_neon.h>, <arm_sve.h>) and require compiler support: Neon support is more mature than SVE.
If we need to write portable SIMD code across architectures, libraries like SIMDe and xsimd can help abstract over the differences.

Regarding Python, I still need to make my mind up of what it makes sense to implement and how. We could start with the existent modules where Intel SIMD intrinsics are used and we can make the code portable to work with Neon as it is the most widespread SIMD instruction set.
As a subsequent step we could start introducing SVE. I still need to figure out if it makes sense to have SME as well. I’ll defer this decision later on.
Also I’d like to explore this topic around the JIT as well.

As I said in my previous post, I might spend some time on it later in the summer.

4 Likes