Using Clang-CL for Better Performance on Windows on ARM

I am writing to discuss the potential of using clang-cl as the default compiler for Windows on ARM (WOA) devices to ensure consistent performance across platforms. Given its widespread acceptance in iOS devices with ARM SoCs, I believe clang-cl is highly optimized for ARM64 and can significantly enhance performance on WOA devices.

My goal is to ensure that every end user experiences the same performance on WOA devices as they do on other platform devices.

In previous discussions, the primary concern raised was ABI compatibility. I have conducted few experiments with clang-cl compiled Python and have not encountered any compatibility issues with MSVC-compiled binaries. While there were issues in the past, the LLVM team has made considerable progress in addressing them. I am specifically focused on Windows on ARM, as it is fundamentally different from Windows on x64. Currently, MSVC for ARM64 does not match the performance of clang-cl, which is far superior.

I would greatly appreciate it if anyone could provide specific examples or insights into where clang-cl might be causing trouble in terms of ABI compatibility or other issues.

2 Likes

Do you have benchmarks showing that performance is better with clang-cl, or is this something that still needs to be done?

1 Like

As this is clearly targeted at convincing me, I’ll just say that the absence of specific examples is not convincing (in either direction).

I’ve listed a number of suitable approaches that could do this for performance critical parts of CPython, and also pointed out that your benchmarks have been significantly outdated. I’ve also given things you could do that might produce convincing outcomes - none of those were “post on Discourse”.

The easiest way to prove it will be to release clang-cl binaries and watch what happens. I don’t have the bandwidth to provide support to people who suffer breakage, which is why I’m not considering doing it with the upstream releases. But if you want to make your own binaries and promote them as higher performance (and take the bug reports yourself, only passing them upstream if they reproduce with the standard compiler), you’re more than welcome. If your build becomes more popular, it’ll be pretty convincing proof (and there’s a number of popularity metrics - e.g. it might be popular with package developers who receive fewer complaints about broken builds from their users, or it might be popular with “news” websites - one of those is better than the other).

Good luck.

Prior context: python/cpython#134524, from the same author.

A

I have collected pyperformance data on Python 3.14.0b2. This is a comparison between the beta release and the repository compiled with clang-cl (with PGO and computed gotos enabled). The clang-cl version I used is 20.1.4

Platform: Snapdragon X-Elite

Benchmarks with tag ‘apps’:

±---------------±----------------±-----------------------+
| Benchmark | py314b2_release | py314b2_clang |
+================+=================+========================+
| 2to3 | 228 ms | 166 ms: 1.37x faster |
±---------------±----------------±-----------------------+
| docutils | 1.60 sec | 1.32 sec: 1.21x faster |
±---------------±----------------±-----------------------+
| html5lib | 38.3 ms | 27.9 ms: 1.37x faster |
±---------------±----------------±-----------------------+
| Geometric mean | (ref) | 1.32x faster |
±---------------±----------------±-----------------------+

Benchmarks with tag ‘math’:

±---------------±----------------±----------------------+
| Benchmark | py314b2_release | py314b2_clang |
+================+=================+=======================+
| float | 66.8 ms | 40.3 ms: 1.66x faster |
±---------------±----------------±----------------------+
| nbody | 96.6 ms | 52.8 ms: 1.83x faster |
±---------------±----------------±----------------------+
| pidigits | 172 ms | 211 ms: 1.22x slower |
±---------------±----------------±----------------------+
| Geometric mean | (ref) | 1.35x faster |
±---------------±----------------±----------------------+

Benchmarks with tag ‘regex’:

±---------------±----------------±----------------------+
| Benchmark | py314b2_release | py314b2_clang |
+================+=================+=======================+
| regex_compile | 89.6 ms | 55.0 ms: 1.63x faster |
±---------------±----------------±----------------------+
| regex_dna | 122 ms | 130 ms: 1.06x slower |
±---------------±----------------±----------------------+
| regex_effbot | 2.19 ms | 2.22 ms: 1.02x slower |
±---------------±----------------±----------------------+
| regex_v8 | 17.0 ms | 15.1 ms: 1.13x faster |
±---------------±----------------±----------------------+
| Geometric mean | (ref) | 1.14x faster |
±---------------±----------------±----------------------+

Benchmarks with tag ‘serialize’:

±---------------------±----------------±----------------------+
| Benchmark | py314b2_release | py314b2_clang |
+======================+=================+=======================+
| json_dumps | 6.52 ms | 4.61 ms: 1.41x faster |
±---------------------±----------------±----------------------+
| json_loads | 16.9 us | 13.3 us: 1.27x faster |
±---------------------±----------------±----------------------+
| pickle | 8.08 us | 5.95 us: 1.36x faster |
±---------------------±----------------±----------------------+
| pickle_dict | 21.0 us | 13.3 us: 1.58x faster |
±---------------------±----------------±----------------------+
| pickle_list | 3.38 us | 2.30 us: 1.47x faster |
±---------------------±----------------±----------------------+
| pickle_pure_python | 250 us | 162 us: 1.54x faster |
±---------------------±----------------±----------------------+
| tomli_loads | 1.74 sec | 993 ms: 1.76x faster |
±---------------------±----------------±----------------------+
| unpickle | 8.83 us | 7.14 us: 1.24x faster |
±---------------------±----------------±----------------------+
| unpickle_list | 3.28 us | 2.45 us: 1.34x faster |
±---------------------±----------------±----------------------+
| unpickle_pure_python | 192 us | 112 us: 1.71x faster |
±---------------------±----------------±----------------------+
| xml_etree_iterparse | 71.5 ms | 65.6 ms: 1.09x faster |
±---------------------±----------------±----------------------+
| xml_etree_generate | 63.5 ms | 44.3 ms: 1.43x faster |
±---------------------±----------------±----------------------+
| xml_etree_process | 46.6 ms | 32.5 ms: 1.43x faster |
±---------------------±----------------±----------------------+
| Geometric mean | (ref) | 1.39x faster |
±---------------------±----------------±----------------------+

Benchmark hidden because not significant (1): xml_etree_parse

Benchmarks with tag ‘startup’:

±-----------------------±----------------±----------------------+
| Benchmark | py314b2_release | py314b2_clang |
+========================+=================+=======================+
| python_startup | 24.2 ms | 19.7 ms: 1.23x faster |
±-----------------------±----------------±----------------------+
| python_startup_no_site | 20.2 ms | 15.7 ms: 1.29x faster |
±-----------------------±----------------±----------------------+
| Geometric mean | (ref) | 1.26x faster |
±-----------------------±----------------±----------------------+

Benchmarks with tag ‘template’:

±---------------±----------------±----------------------+
| Benchmark | py314b2_release | py314b2_clang |
+================+=================+=======================+
| genshi_text | 18.2 ms | 11.4 ms: 1.59x faster |
±---------------±----------------±----------------------+
| genshi_xml | 37.9 ms | 24.7 ms: 1.53x faster |
±---------------±----------------±----------------------+
| mako | 9.23 ms | 6.90 ms: 1.34x faster |
±---------------±----------------±----------------------+
| Geometric mean | (ref) | 1.48x faster |
±---------------±----------------±----------------------+

All benchmarks:

±-------------------------±----------------±-----------------------+
| Benchmark | py314b2_release | py314b2_clang |
+==========================+=================+========================+
| 2to3 | 228 ms | 166 ms: 1.37x faster |
±-------------------------±----------------±-----------------------+
| async_generators | 275 ms | 196 ms: 1.41x faster |
±-------------------------±----------------±-----------------------+
| asyncio_tcp | 398 ms | 357 ms: 1.12x faster |
±-------------------------±----------------±-----------------------+
| asyncio_tcp_ssl | 13.3 sec | 12.9 sec: 1.03x faster |
±-------------------------±----------------±-----------------------+
| chaos | 47.0 ms | 31.2 ms: 1.51x faster |
±-------------------------±----------------±-----------------------+
| comprehensions | 14.0 us | 8.40 us: 1.66x faster |
±-------------------------±----------------±-----------------------+
| bench_mp_pool | 81.8 ms | 62.1 ms: 1.32x faster |
±-------------------------±----------------±-----------------------+
| bench_thread_pool | 1.09 ms | 870 us: 1.25x faster |
±-------------------------±----------------±-----------------------+
| coroutines | 22.3 ms | 12.1 ms: 1.85x faster |
±-------------------------±----------------±-----------------------+
| coverage | 208 ms | 44.5 ms: 4.69x faster |
±-------------------------±----------------±-----------------------+
| crypto_pyaes | 58.0 ms | 42.2 ms: 1.38x faster |
±-------------------------±----------------±-----------------------+
| deepcopy | 185 us | 118 us: 1.57x faster |
±-------------------------±----------------±-----------------------+
| deepcopy_reduce | 1.91 us | 1.27 us: 1.51x faster |
±-------------------------±----------------±-----------------------+
| deepcopy_memo | 25.2 us | 14.4 us: 1.75x faster |
±-------------------------±----------------±-----------------------+
| deltablue | 3.21 ms | 1.75 ms: 1.84x faster |
±-------------------------±----------------±-----------------------+
| docutils | 1.60 sec | 1.32 sec: 1.21x faster |
±-------------------------±----------------±-----------------------+
| dulwich_log | 35.2 ms | 27.3 ms: 1.29x faster |
±-------------------------±----------------±-----------------------+
| fannkuch | 321 ms | 200 ms: 1.60x faster |
±-------------------------±----------------±-----------------------+
| float | 66.8 ms | 40.3 ms: 1.66x faster |
±-------------------------±----------------±-----------------------+
| create_gc_cycles | 909 us | 953 us: 1.05x slower |
±-------------------------±----------------±-----------------------+
| gc_traversal | 2.36 ms | 2.69 ms: 1.14x slower |
±-------------------------±----------------±-----------------------+
| generators | 28.7 ms | 18.8 ms: 1.52x faster |
±-------------------------±----------------±-----------------------+
| genshi_text | 18.2 ms | 11.4 ms: 1.59x faster |
±-------------------------±----------------±-----------------------+
| genshi_xml | 37.9 ms | 24.7 ms: 1.53x faster |
±-------------------------±----------------±-----------------------+
| go | 110 ms | 61.8 ms: 1.77x faster |
±-------------------------±----------------±-----------------------+
| hexiom | 5.69 ms | 3.11 ms: 1.83x faster |
±-------------------------±----------------±-----------------------+
| html5lib | 38.3 ms | 27.9 ms: 1.37x faster |
±-------------------------±----------------±-----------------------+
| json_dumps | 6.52 ms | 4.61 ms: 1.41x faster |
±-------------------------±----------------±-----------------------+
| json_loads | 16.9 us | 13.3 us: 1.27x faster |
±-------------------------±----------------±-----------------------+
| logging_format | 8.56 us | 5.71 us: 1.50x faster |
±-------------------------±----------------±-----------------------+
| logging_silent | 341 ns | 237 ns: 1.44x faster |
±-------------------------±----------------±-----------------------+
| logging_simple | 7.73 us | 5.28 us: 1.46x faster |
±-------------------------±----------------±-----------------------+
| mako | 9.23 ms | 6.90 ms: 1.34x faster |
±-------------------------±----------------±-----------------------+
| mdp | 910 ms | 622 ms: 1.46x faster |
±-------------------------±----------------±-----------------------+
| meteor_contest | 74.4 ms | 56.3 ms: 1.32x faster |
±-------------------------±----------------±-----------------------+
| nbody | 96.6 ms | 52.8 ms: 1.83x faster |
±-------------------------±----------------±-----------------------+
| nqueens | 67.5 ms | 44.6 ms: 1.51x faster |
±-------------------------±----------------±-----------------------+
| pathlib | 23.1 ms | 20.3 ms: 1.14x faster |
±-------------------------±----------------±-----------------------+
| pickle | 8.08 us | 5.95 us: 1.36x faster |
±-------------------------±----------------±-----------------------+
| pickle_dict | 21.0 us | 13.3 us: 1.58x faster |
±-------------------------±----------------±-----------------------+
| pickle_list | 3.38 us | 2.30 us: 1.47x faster |
±-------------------------±----------------±-----------------------+
| pickle_pure_python | 250 us | 162 us: 1.54x faster |
±-------------------------±----------------±-----------------------+
| pidigits | 172 ms | 211 ms: 1.22x slower |
±-------------------------±----------------±-----------------------+
| pprint_safe_repr | 573 ms | 343 ms: 1.67x faster |
±-------------------------±----------------±-----------------------+
| pprint_pformat | 1.15 sec | 711 ms: 1.63x faster |
±-------------------------±----------------±-----------------------+
| pyflate | 414 ms | 286 ms: 1.45x faster |
±-------------------------±----------------±-----------------------+
| python_startup | 24.2 ms | 19.7 ms: 1.23x faster |
±-------------------------±----------------±-----------------------+
| python_startup_no_site | 20.2 ms | 15.7 ms: 1.29x faster |
±-------------------------±----------------±-----------------------+
| raytrace | 231 ms | 129 ms: 1.79x faster |
±-------------------------±----------------±-----------------------+
| regex_compile | 89.6 ms | 55.0 ms: 1.63x faster |
±-------------------------±----------------±-----------------------+
| regex_dna | 122 ms | 130 ms: 1.06x slower |
±-------------------------±----------------±-----------------------+
| regex_effbot | 2.19 ms | 2.22 ms: 1.02x slower |
±-------------------------±----------------±-----------------------+
| regex_v8 | 17.0 ms | 15.1 ms: 1.13x faster |
±-------------------------±----------------±-----------------------+
| richards | 41.6 ms | 24.5 ms: 1.70x faster |
±-------------------------±----------------±-----------------------+
| richards_super | 46.7 ms | 27.6 ms: 1.70x faster |
±-------------------------±----------------±-----------------------+
| scimark_fft | 215 ms | 159 ms: 1.35x faster |
±-------------------------±----------------±-----------------------+
| scimark_lu | 93.3 ms | 57.7 ms: 1.62x faster |
±-------------------------±----------------±-----------------------+
| scimark_monte_carlo | 54.4 ms | 34.8 ms: 1.56x faster |
±-------------------------±----------------±-----------------------+
| scimark_sor | 107 ms | 61.6 ms: 1.73x faster |
±-------------------------±----------------±-----------------------+
| scimark_sparse_mat_mult | 3.19 ms | 2.73 ms: 1.17x faster |
±-------------------------±----------------±-----------------------+
| spectral_norm | 81.3 ms | 54.6 ms: 1.49x faster |
±-------------------------±----------------±-----------------------+
| sqlglot_normalize | 202 ms | 139 ms: 1.45x faster |
±-------------------------±----------------±-----------------------+
| sqlglot_optimize | 36.5 ms | 26.2 ms: 1.40x faster |
±-------------------------±----------------±-----------------------+
| sqlglot_parse | 1.01 ms | 620 us: 1.64x faster |
±-------------------------±----------------±-----------------------+
| sqlglot_transpile | 1.19 ms | 758 us: 1.57x faster |
±-------------------------±----------------±-----------------------+
| sqlite_synth | 1.83 us | 1.72 us: 1.06x faster |
±-------------------------±----------------±-----------------------+
| telco | 4.56 ms | 3.55 ms: 1.28x faster |
±-------------------------±----------------±-----------------------+
| tomli_loads | 1.74 sec | 993 ms: 1.76x faster |
±-------------------------±----------------±-----------------------+
| typing_runtime_protocols | 114 us | 78.9 us: 1.44x faster |
±-------------------------±----------------±-----------------------+
| unpack_sequence | 58.5 ns | 42.3 ns: 1.39x faster |
±-------------------------±----------------±-----------------------+
| unpickle | 8.83 us | 7.14 us: 1.24x faster |
±-------------------------±----------------±-----------------------+
| unpickle_list | 3.28 us | 2.45 us: 1.34x faster |
±-------------------------±----------------±-----------------------+
| unpickle_pure_python | 192 us | 112 us: 1.71x faster |
±-------------------------±----------------±-----------------------+
| xml_etree_iterparse | 71.5 ms | 65.6 ms: 1.09x faster |
±-------------------------±----------------±-----------------------+
| xml_etree_generate | 63.5 ms | 44.3 ms: 1.43x faster |
±-------------------------±----------------±-----------------------+
| xml_etree_process | 46.6 ms | 32.5 ms: 1.43x faster |
±-------------------------±----------------±-----------------------+
| Geometric mean | (ref) | 1.43x faster |
±-------------------------±----------------±-----------------------+

1 Like

I’d like to reiterate here that I find the benchmarking results a little suspect. The OP says:

It’s clear that x64 performance has improved with each release, while ARM64 performance has been inconsistent, with a noticeable regression in the latest version.

However, they benchmarked minor versions of 3.11 and 3.12. We did not do any major performance work between CPython 3.11 and 3.12. Any difference they’re seeing is likely down to noise. In fact, some benchmarks regressed slightly due to immortal objects.

1 Like

My apologies if I am bothering the community regarding this topic once again, but WOA platforms are suffering significantly due to the immature software ecosystem available for Windows ARM64. My intention is not to move away from MSVC, but rather to ensure that mature and optimized binaries are available for WOA devices. One reason for advocating clang is its high efficiency on ARM64 SoCs a great example being its widespread use in Mac-based ARM64 chips.

Although we are actively exploring it, our primary concern remains the Python interpreter. Unfortunately, we haven’t seen promising results so far. MSVC for ARM64 has not been able to effectively optimize the interpreter loop, which is why we’ve reverted to using clang-cl for better performance.

We’ve tried encouraging our target audience to use ourlocally compiled binaries for performance gains, but that hasn’t helped much unless the Python community officially accepts them. After all, the end goal is to improve the experience for everyday users. results, received poor performance ratings of device.

I personally don’t believe that adopting clang-cl as the default for WOA would immediately trigger a flood of issue reports. Python on WOA still lacks a broad ecosystem of extensions compared to Windows on x64. For instance, numpy one of the most widely used libraries only released its first WOA version a few weeks ago. I am also seeing discussions on GitHub regarding major Python libraries(numpy, scipy, openblas) considering whether to adopt clang-cl for their Windows on ARM64 release binaries to enhance performance.

Also, believe it’s worth considering the use of clang-cl for Windows on ARM64 beta releases to assess whether it poses potential risks or if it’s a promising idea to adopt.

1 Like

I can’t speak for the differences between ARM and x64 in MSVC, but we have known for some time that clang-cl can produce faster binaries than MSVC in x64. Scroll down to “Effect of tail calling interpreter” graph in here GitHub - faster-cpython/benchmarking-public: A public mirror of our benchmarking runner repository . The main reason I suspect being the lack of explicit computed gotos in MSVC.

The problem is that any switch in building Python has to occur at the latest the earliest beta of the in-development CPython version (currently 3.15). This is to avoid any possible problems arising for ABI incompatibilities between the different compilers (there should be minimal, but it’s not fully rulled out).

@Akash9824 when you say “we” who are you referring to? Sorry I’m a little confused when you use “we” in that way.

Finally, I am willing to investigate into ARM64 performance on Windows more, but I do not have an ARM laptop or desktop, so I can’t at the moment.

Initially, I commented on the performance of x64, but later I realized it would not be fair to directly compare MSVC x64 with MSVC ARM64 due to the different architecture of SoCs. However, if the two devices have the same Geekbench performance, then we can compare them, I guess. Anyway, leaving x64 aside


Even if I disable ‘computed gotos’ while compiling, I still see a significant difference between MSVC ARM64 and clang-cl ARM64 compiled binaries. i can share the data if you are interested.

My intention here is not just to discuss using clang-cl as the default compiler for WOA, but also to explore possible ABI-related issues we might face. Although I am investigating and trying to reproduce potential issues, I haven’t found any so far. If there are any known issues, they should be fixed by LLVM. I would love to work on those issues as well.

When I say ‘we,’ I mean the team working on the performance of Windows on ARM64 applications. We have largely adopted clang-cl for WOA apps.

Being more specific you can check the performance data of arm64 binaries between 3.12.0 vs minor version 3.12.10, you will see the visible regression in 3.12.10. if you believe it outdated then also you can check between major version 3.12.0 vs 3.13.0 you will see regression in 3.13.0 compare to 3.12.0.

Note: i am talking about arm64 binaries only and running pyperformance to compare the performance.

That’s not really any clearer. “The team working on WOA performance” at which company/organization, in which context?

4 Likes

Personally I think the main reason is that the people working on optimising Python are using GCC and just hoping that every other compiler will behave the same. I base this assumption on the complete surprise I hear from those people whenever they test with a different compiler[1] and get different results :wink: It looks like a classic case of using bias to reinforce bias (you know, that thing everyone works so hard to prevent ML from doing).

This is starting to feel deceptive. If you work for an organisation that doesn’t allow you to disclose that you work there, please use either “Three Letter Acronym” (if it’s one of those) or “Fruit” (if it’s that one) in your next post. But most companies have social media policies that require you disclose, and we operate on openness here.

Either way, it sounds like you’re funded and in a position to demonstrate the advantages and reliability of an alternative compiler, and possibly to just benefit directly from using it, so don’t let the volunteers refusing to take on more work stop you from doing it.


  1. Or in one case, a different version of GCC. ↩

3 Likes

This is not true and this entire “who do you work for?” line of reasoning feels disingenuous. Lets please not make such demands. Nobody has to tell us publicly who they work for here and if we ask and don’t get an answer or don’t want to believe it, we need to move on. It is gatekeeping and isn’t relevant.


Akash: From your side the best thing to read into such lines of questioning is that people are searching for a reason to trust you. If you are willing to both contribute the work to get such a clang-cl based windows release build infrastructure and automation setup and commit to ongoing support and maintenance and bugfixing of it within the CPython project in the future, that’d be the most likely way to ever see it happen. Yet it still no guarantee it’d be accepted. As is, you’re being met with a cold reception because you are coming into a community with a presumed-real desire but not offering signs that you can help do both the up front and long term work to make it real.

People see no reason to rally around your, to them esoteric, cause.


Using clang+llvm as our compiler on Windows (all platforms) would be a good thing in general IMNSHO. But to me the biggest obstacle to doing that is that, frankly, >90% of core devs do not care al all about working on the Windows platform. On top of that, Microsoft has proven they do not care about CPython. This leaves the very few core devs willing to spend their time caring, ex: Steve who is (unfortunately for him :pray:) heroically holding a lot of our Python on Windows support together.

If someone claims it works better for windows on arm, be inclined to believe them. Because there’s no reason to come telling us that if it weren’t true because it’d quickly become obvious if it were not if any of us cared to spend time on such an unusual platform. We don’t appear to have anyone core team associated with the time or ability to care? So such a change is a non-starter until we do.


Why do I have my opinion? Rust on Windows is LLVM based and its use is rapidly increasing, including within existing C and C++ application stacks, including with CPython (third party extension modules today, eventually we will wind up with Rust bits within CPython core, it’s just a matter of time and effort, but Rust is clearly the future of compiled languages). There is no reason I can see to believe clang has Windows compatibility issues other than FUD at this point. Meaning the FUD needs proving to change minds. But given our limited core team dedicated to the Windows platform is understandably biased against spending their valuable time finding out and working through how to validate and guarantee that in our project, we cannot make such demands of them.

The entire clang-cl windows tool chain exists in significant part because Visual C++ was growing incapable of compiling and linking the Chromium codebase thus Google invested in making it a thing. ABI compatibility was a specific goal of the project. Rust’s reliance on LLVM and linkage into existing C/C++ applications makes this only more so. I’m not worried. But I’m not who needs convincing as I don’t do Windows.

6 Likes