Some notable missing pieces ahead of freethreading

mikeshardmind · July 30, 2025, 9:21pm

Sorry if this comes off as “asking too much”, not my intent here, just putting together a list of things I’ve noticed absent that may be more relevant and worth considering including in the language itself. Currently limiting the scope to things that aren’t trivial to package with the current expectations people have about wheel availability. If it’s reasonable to make it third-party and “just work”, then that’s a more appropriate place to innovate first.

Atomic compare-and-exchange for integers and booleans.

These are a necessary building block for lock-free algorithms, and are non-trivial to package correctly for a wider audience, as the available atomics are platform-dependent in a way that cannot be properly captured in the wheel specification. In particular, most available 16B atomic operations are only guaranteed to be available on intel and amd with AVX, with the specifics being different across cpu manufacturers. Even relying on cmpxchg16b, which is generally available, isn’t universally safe without the lock-prefixed variant. Libraries providing even use of this require build tools or runtime dispatch. Many 8B instructions are more reasonable to assume, but there’s still no good way to guarantee their availability from a library perspective other than requiring build tools and erroring if the minimum cpu instructions are not available.

RWLocks.

Naive local testing showed implementing this in python ends up slower than having a more contested naive single lock for a much larger scale than I anticipated^[1] before the RWLock was better. Fast implementations that are either fair or write-preferring also run into the above problem of not having reliable atomics in wheels. It’s possible to implement read-preferring RWLocks performantly without the atomics.

There’s a few other things, but most of them are possible to build with the existing primitives reasonably performantly, or possible to implement natively within general user expectations of wheel availability.

Note: edited for formatting only, discourse’s new rich text editor breaks the ability to just write a footnote, and tried to “helpfully” escape markdown for me without providing a button for the same markdown in an unescaped form.

with an 80/20 reader/writer split, required 2 orders of magnitude increase in attempted concurrent access. With a 90/10 split, it was closer to 3 orders of magnitude. With a native, but non-portably packagable implementation, the RWlock outperformed at the same scale for both of the above splits. ↩︎

colesbury · July 30, 2025, 9:35pm

Just to clarify: do you want these constructs in Python code or from the C API?

ngoldbaum · July 30, 2025, 9:38pm

I ended up relying on C++ features for a place where we needed an RWLock in NumPy: https://github.com/numpy/numpy/blob/46268caa719db95ab2198bcd619f0e3120d7c52e/numpy/_core/src/umath/dispatching.cpp#L914

I agree it probably makes sense to expose the rwlock in the pycore_lock.h header. I also think it would be nice to have a one-time initialization API exposed.

Also for what it’s worth I think this stuff could also live in a (header-only? if that’s possible…) library that projects could depend on, if only to bypass getting something approved in the C API. It would also enable experimentation that would inform a future C API proposal.

The main reason the minimal header-only utility library didn’t get created to help with the NumPy port is we ended up using C++ features. IMO it’s still worth doing for projects that can’t or don’t want to use C++.

mikeshardmind · July 30, 2025, 9:43pm

I’d be fine with either personally. but I think “both” is the more appropriate answer for at least the RWLock.

A RWLock implementation should probably exist in the threading module accessible from python by python users.

I ran into this when trying to wrap some things I expect people to be missing when they try to write free-threading safe versions of their existing code, and want to ensure this is portable enough that it behaves well with current packaging limitations. I don’t have an issue with writing native code, but expecting users to have build tools or a rust toolchain, or any number of other things is at odds with the current push to make wheels available for everything.

For personal and professional use, I can ignore the packaging edges and just require buildtools. I don’t feel the same about my open source contributions as to the viability of this given the emphasis on wheel availability, and have a feeling that this (as well as a few things that are more asyncio-specific I intend to open a seperate thread for in that category) are going to be friction points to free-threading adoption.

colesbury · July 30, 2025, 9:46pm

Would you please clarify your request? I still don’t understand if you ran into this when writing code against the Python C API or when writing Python code. Those are two very different experiences with two different audiences.

mikeshardmind · July 30, 2025, 9:59pm

I think there should be a RWLock provided by python (the language) and by CPython, the implementation, and that this should be available to both people writing python code and people writing extension modules. (available in both the C-API and a python interface exposed in the threading module).

I’d prefer if some basic atomic values (bool, int32) were also available to both, but realistically, this would be fine in the C-API only if the interpreter wraps the behavior correctly for all platforms.

The concrete problem experienced was not with implementing this, but with implementing it in a way that meets current packaging ecosystem expectations, and my conclusion was that these are so fundamental that it makes more sense to put the building blocks in the language itself, and leave the more complex things that build on them to libraries that can use that, rather than have to do runtime feature detection and runtime dispatch to be safe in a wheel.

ngoldbaum · July 30, 2025, 10:08pm

Are you aware of cereggii?

GitHub - dpdani/cereggii: Thread synchronization utilities for Python

Some of the Python API you’re looking for is there. There’s also ft_utils:

I think there’s lots of room for experimentation on this stuff. I’m not sure putting something in the standard library immediately is the best choice. Eventually migrating stuff into the standard library does make sense, but that should be informed by real-world use IMO.

mikeshardmind · July 30, 2025, 10:11pm

I’m aware of both. Neither would be safe to distribute in a wheel publishable to pypi without runtime dispatch as implemented, specifically because of the atomics.

ngoldbaum · July 30, 2025, 10:16pm

I’m confused why atomics are a problem for distributing wheels. Why couldn’t another library more-or-less copy what’s in pyatomic.h and re-expose platform-specific atomic operations? There’s no need to support more than what python itself supports.

I also think it makes sense to expose pytomic.h publicly but also it would probably be a fair bit of work to argue that and see the change through. It also won’t be available until Python 3.15 at the earliest.

mikeshardmind · July 30, 2025, 10:26pm

As explained in the first post, the underlying cpu instructions for various atomic operations are not guaranteeable by the wheel spec or by limiting support to the platforms python itself supports. As wheels distribute compiled code, this leaves various options, none of which are satisfying.

Don’t use atomics
Only use a slow, internally locking version of atomic values.
Don’t distribute wheels.
Ship a fat runtime and do runtime dispatching.
Don’t provide this for other people at all.

Exposing the header does not solve this, there would need to be something provided by the c-api that abstracts it via interaction with the interpreter per-platform, or packaging needs to evolve to be able to package not just by platform, but by available instruction sets.

As for whether or not this should be included in the standard library, I don’t think a RWLock is exactly new or novel. It’s just not been something most people are going to have needed with the available parallelism previously available in python. It’s usually the best lock option for concurrency that requires explicit synchronization, and I don’t have high expectations on all existing python code being rewritten to be lock-free concurrency safe.

I don’t think atomic values are exactly new or novel either, but it’s arguable that providing the things people would build with them is sufficient.

colesbury · July 30, 2025, 10:51pm

You are not asking too much, but your requests and motivation are confusing to me. Adding anything to the Python standard library or the Python C API is difficult, and getting consensus requires concrete use cases. Real code and projects are helpful.

Atomic instructions are part of the base instructions set on x86-64 (and aarch64). cmpxchg16b is a (widely available) extension that is unrelated to AVX. Nothing forces you to use it. You can also use cereggii without relying on it.

We already expose atomic operations in C with things like _Py_atomic_add_int, but they are not documented and not considered public (hence the leading underscore). I don’t think we will ever make them public because it’s a pain to maintain a comprehensive atomic library and atomics are already part of modern C, C++, and Rust.

On RWLock, see add threading.RWLock · Issue #53046 · python/cpython · GitHub . There is a publicly available project on pypi. If this is a widely requested feature, than we could add it to the Python standard library, but that doesn’t seem to be the case so far.

We have a C implementation of a RWLock. RWLocks tend not to perform well – lots of small readers fundamentally doesn’t scale well, even in the absence of writers – so it only has one, somewhat awkward use case. You can use the shared_mutex from C++ or Rust. If there are enough C-only projects that need it, then we might be able to convince the C API WG that it’s worthwhile.

mikeshardmind · July 30, 2025, 11:04pm

Only some of the 8B instructions are atomic as part of the x86 ISA spec. cmpxchg16b is the only 16B atomic instruction that is widely available, however it’s not actually safe to use without the lock prefix. The other 16B instructions are only documented as supported and atomic by Intel and AMD on processors with AVX support. aarch64 support is not actually much better here. Recent processors will support it, but double word atomics are only available with LSE, and the equivalent comapre and swap requires v8.1 (LSE). v8.4 (LSE2) is needed for load/store. Wheels are not versionable by this, which means that leaving it to libraries is going to create problems. (For instance, it’s possible to get a SIGILL if you try to build a manylinux wheel using cereggii)

The wide availability of cmpxchange16b is also a bit of a trap when pre-building binaries, as there are platforms that using cmpxchange16b without the lock prefix will result in incorrect behavior.

ngoldbaum · July 30, 2025, 11:23pm

I’m still confused by this. To make that concrete, in NumPy we’re using this code:

github.com/numpy/numpy

numpy/_core/src/common/npy_atomic.h

46268caa7


      
          * Provides wrappers around C11 standard library atomics and MSVC intrinsics
          * to provide basic atomic load and store functionality. This is based on
          * code in CPython's pyatomic.h, pyatomic_std.h, and pyatomic_msc.h
          */
          
          #ifndef NUMPY_CORE_SRC_COMMON_NPY_ATOMIC_H_
          #define NUMPY_CORE_SRC_COMMON_NPY_ATOMIC_H_
          
          #include "numpy/npy_common.h"
          
          #ifdef __cplusplus
             extern "C++" {
                 #include <atomic>
             }
             #define _NPY_USING_STD using namespace std
             #define _Atomic(tp) atomic<tp>
             #define STDC_ATOMICS
          #elif defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L \
             && !defined(__STDC_NO_ATOMICS__)
             #include <stdatomic.h>
             #include <stdint.h>

Under what circumstances is the code in that header going to lead to miscompilation? I haven’t heard of anyone hitting issues related to our use of atomics in NumPy. We’re relying either on standard library support or compiler-specific intrinsics if that isn’t available. This code was adopted from the pyatomic.h header.

mikeshardmind · July 30, 2025, 11:55pm

It’s not going to lead to miscompilation, it’s going to lead to compiling to something that isn’t guaranteed to be as safely portable as the wheel declares it to be.

The C++11 atomics are stl implementation dependent, and may be emitted differently on different cpu targets and differently by different compilers. On different cpus, the underlying code emitted for atomics is different. This is based on instruction set. gcc documents this here clang documents this here. As these are emitted at compile time for the target, this is technically possible to cover under cases 1 or 2 from the list above in wheels trivially, but undermining the reason to write this as native code. Case 4 is possible to support with increased complexity. None of these seem reasonable for a library to need to do, but cpython does support platforms that may require this for the wheel to be safely distributed.

This is one of the problems with the push for wheels for everything and telling people they don’t need to have build tools, the wheels can’t safely express everything. This is a known wheel limitation and the current solution used by manylinux doesn’t adequately cover this case.

I could write this off as “not my problem”, require build tools, or just not provide something like this and keep all of my tools private until the issue is resolved in packaging, but I figured having something provided by the interpreter was likely more productive for things I would consider fundamental.

If there’s no appetite for this, I’ll spend my time on other things.

oscarbenjamin · July 31, 2025, 12:11am

As I understand it numpy are switching to -march=x86-64-v2 which I think means cmpxchg16b can be in the wheels. Or is that still not good enough?

Of course it would be better if wheel metadata could reflect instruction set metadata like this because those wheels would just crash on older hardware. For now though it is probably reasonable to set v2 as a baseline for wheels.

Liz · July 31, 2025, 12:11am

Any atomic int type in the standard library would have to be wrapped in some way. Python integers are arbitrary sized, but what you want requires cpu specific alignment and size restrictions.

I wouldn’t mind having a fast RWLock implementation, but that feels like a luxury compared to atomic values, lock-free datastructures, or immutable record types in terms of effort that could be spent improving the language for free-threading.

mikeshardmind · July 31, 2025, 12:36am

cmpxchg16b can be in the wheels, but it presents a problem if a compiler emits this instruction without using it as lock cmpxchg16b. Worse than crashing with a SIGILL on first use, this one’s going to be available, but not actually atomic on pre-AVX x86_64.

I can get specific updated reference links to this later (I’ll have to re-open a bunch of pdfs, and it’s getting late here)

oscarbenjamin · July 31, 2025, 12:41am

Is that not controllable at compile time though?

I don’t know enough about the stack (or the whole general topic) here to trace through from say the header linked by @ngoldbaum to where a compiler would generate the instruction…

mikeshardmind · July 31, 2025, 12:45am

if numpy is switching to -march=x86-64-v2, it’s up to the compiler. That’s targeting something that locking the entire cacheline isn’t supposed to be needed. I would have to check if the compiler then not only currently does this, but that it’s a behavior they guarantee in some way, which isn’t something I would expect, or that I found in my research of this.

mikeshardmind · July 31, 2025, 12:50am

Someone else has done a much better job of collecting relevant information. A notable detail:

Update 2020-07-08 : Travis Down suggested that I should make sure 16B unaligned L/S test crosses both the 16B and 32B alignment boundaries. This makes the previously succeeding 16B unaligned test fail on Zen 2.

zen 2 is only 6 years old