PEP 779: Criteria for supported status for free-threaded Python

ngoldbaum · March 14, 2025, 8:07pm

This is my last day at work before I go on vacation for a week, so aplogies in advance if I take a while to get back to replies to this but my goal on this trip is to not be on my phone or computer very much.

I’ve been working on community support for free-threaded Python for about a year now. I started with relatively little multithreaded programming experience, beyond a teeny bit of OpenMP coding via Cython. I’d never used the python threading module. It was daunting to have to learn an entirely new programming paradigm, but I’ve come out the other end convinced that it’s worth it and that I can help others by communicating what I’ve learned and building community knowledge.

IMO, the raw computational power unlocked by the free-threaded build on modern processors more than justifies the work that will be necessary to get everything working and make multithreaded parallelism a first-class tool in Python.

As of now, we have the “base” of the scientific python stack in Numpy, SciPy, matplotlib, pandas, and scikit-learn working, code generation and language interop tools like Cython, PyO3, pybind11, and f2py are all working. We’re still waiting on a Cython release, but the others all ship releases supporting the free-threaded build. For the past few months I’ve been focusing on projects that ship Rust native extension and depend on PyO3 to help unblock that corner of the ecosystem.

I’m really excited about the possibilities available when people start migrating from process-pool based parallelism to thread pool parallelism. Take a look at this NumPy PR which fixed a multithreaded scaling issue for ufunc evaluation that was reported against the first release of NumPy to support the free-threaded build. The details of the code are less important than the graph in that PR, which clearly demonstrates that doing more or less the exact same workflow using NumPy is substantially faster using a free-threaded thread pool. Process pool parallelism is leaving a ton of CPU performance on the table. Of course that was the primary motivation for starting the experiment, but it was really exciting to actually generate that graph on my laptop and first-hand see how much faster free-threaded Python can be.

I’m also excited about the morning of the first day of talks at PyCon where there will be a series of talks in Hall A on free-threaded Python, including one from myself and @lys.nikolaou which aims to cover the content at https://py-free-threading.github.io in talk form, to the extent that’s possible. I’m hopeful that our talk will be a lasting community resource for those who learn about programming via videos and lectures rather than reading documentation.

This is a great point, one I fully agree with.

While up until yesterday the content at py-free-threading.github.io was very native extension focused, I was prompted by your reply in this thread to spend most of yesterday updating py-free-threading.github.io to add content to make it clearer what users and project maintainers need to do to get their code working. See here: py-free-threading

I also split our original single porting guide page into three pages that focus on thread safety issues in pure python code, multithreaded testing, and then a third page that focuses on native extensions. That way projects that don’t have native extensions don’t see this content and get scared away.

We’ve been working on py-free-threading.github.io since last summer and our hope is that it will be the go-to place for questions about free-threaded Python, at least for content that doesn’t make sense in the CPython documentation. Please please please open issues telling us what needs to be improved or letting us know about mistakes. Contributions are also very welcome.

It depends a little on what you mean by “work”. If the module already has extensive multithreaded tests under the GIL-enabled build, then being able to tell whether or not free-threading introduces new kinds of bugs will be easy to see.

If the code is for a CLI app or really any kind of user-controlled application where the user decides whether or not to create a thread pool and use code in a multithreaded context, the user or CLI app author can do testing to make sure their internal use of threads is safe. Tools like pip that do not have a public Python API also don’t need to worry about making internal Python code thread-safe, unless they would like to use threading internally.

Library authors will need to do a little more work. This is particularly acute for libraries that do not have good multithreaded tests. However, as @thomas points out later in the thread:

I ran into exactly a situation like this last week working on the cryptography library. It turns out the _ANSIX923PaddingContext is implemented in Python and uses an internal bytes buffer to store state. If two threads simultaneously update the context, there is a possibility that the threads can race to update the bytestring, leading to silently incorrect results.

It’s easy to trigger this on the GIL-enabled build by doing e.g. sys.setswitchinterval(.000001) before running a multithreaded test: `_ANSIX923PaddingContext` isn't thread-safe · Issue #12553 · pyca/cryptography · GitHub. That doesn’t mean it can’t happen on the default configuration, just that it requires an “unlucky” thread switch that happens to trigger the race. I wouldn’t be surprised to learn about rare crashes in production systems on the GIL-enabled build using thread pools due to issues like this.

This is also a good point. I’ve opened an issue to track adding examples demonstrating how the free-threaded build makes things that were previously impossible possible: Add a page to the guide showing why disabling the GIL is awesome · Issue #149 · Quansight-Labs/free-threaded-compatibility · GitHub

There’s subtlety here too. One example is that (at least according to @cfbolz, I think this is what he’s getting at in his reply in this thread) in Python 3.10 a change to how bytecodes release the GIL means that pure python code release the GIL far less often than Python 3.9 and earlier as well as PyPy: Races on appending bytes are much easier to trigger in PyPy than CPython · Issue #5246 · pypy/pypy · GitHub

There are likely more cases like this. We need better tests to find them all.

That’s true! IMO adding multithreaded tests of objects with mutable state are important to add before you can call your module thread-safe.

I also want to point out that it’s completely valid (at least IMO) for a project to not guarantee any thread safety and leave it up to the user to evaluate whether or not their code is thread-safe.

For example, NumPy does not guarantee (and never has) the thread safety of ndarray: Thread Safety — NumPy v2.3.dev0 Manual. We’re also aware of a couple of free-threading-specific thread safety issues in NumPy that we’re not yet prioritizing because we haven’t gotten any reports about problematic uses yet.

IMO making ndarray thread-safe is not the correct way forward. We could add locking, but doing it in a way that allows scalable multithreaded performance (e.g. without causing any regressions to existing multithreaded workflows in the GIL-enabled build) would be a large engineering effort. NumPy also exposes a number of C API functions that allow direct access to all kinds of low-level memory buffers allowing unsafe mutation in a multithreaded context.

Even with all those issues though, people have been using multithreaded parallelism with NumPy for years. Dask, for example, supports thread pools as a parallelization strategy, and NumPy has fixed a number of issues seen by dask users over the years.

Even though users are allowed to do unsafe things, they are still productively using NumPy in a multithreaded context by avoiding multithreaded mutation. Many, many production workflows end up as an embarrasingly parallel operation that uses read-only access on shared ndarrays.

IMO, long term, they way forward is for NumPy to offer better support for immutable ndarrays. This may eventually involve adding a borrow-checker to NumPy’s ndarray buffer to enforce at runtime rust-like write XOR reads pattern on data owned by an ndarray.

Why would it be unpopular to have it in the stable ABI? In the GIL-enabled build the begin and end critical section macros are just open bracket and close bracket. Of course they’d be more complicated in a free-threaded stable ABI, but also they’d be necessary for thread safety in many C extensions. I don’t see what the problem is.

Mutating a shared ndarray isn’t thread-safe and right now we don’t plan to make it safe, via Python or the C API.

We’re doing our best to report and fix crashes or memory unsafety due to mutation of shared arrays, but I don’t think it’s possible to fix all the data races one could trigger by mutating a shared array. The same is true on the GIL-enabled build and is relatively easy to trigger because mutating ndarrays releases the GIL (except for object arrays or other limited exceptions).

What I would like to see is better support for immutable ndarrays or maybe an alternate array type that has better guarantees about that immutability and less legacy baggage.

If PyArrow has bigger needs for a thread-safe ndarray or making the NumPy C API thread-safe then I’d love to start a discussion about that on the NumPy mailing list or issue tracker. I’m very interested in having some design discussions about this to inform long-term plans in NumPy.

This is true. I’ve been trying to add documentation and examples of how to write multithreaded stress tests.

Here are some recent examples I’ve come up with on projects I’ve added free-threaded support:

github.com/pyca/bcrypt

tests/test_bcrypt.py

9dd850a68


      
          
          def test_2a_wraparound_bug():
              assert (
                  bcrypt.hashpw(
                      (b"0123456789" * 26)[:255], b"$2a$04$R1lJ2gkNaoPGdafE.H.16."
                  )
                  == b"$2a$04$R1lJ2gkNaoPGdafE.H.16.1MKHPvmKwryeulRe225LKProWYwt9Oi"
              )
          
          
          def test_multithreading():
              def create_user(pw):
                  salt = bcrypt.gensalt(4)
                  hash_ = bcrypt.hashpw(pw, salt)
                  key = bcrypt.kdf(pw, salt, 32, 50)
                  assert bcrypt.checkpw(pw, hash_)
                  return (salt, hash_, key)
          
              user_creator = ThreadPoolExecutor(max_workers=4)
              pws = [uuid.uuid4().bytes for _ in range(50)]

I think we should make it easy to run a native profiler against CPython so you can see for yourself where the issue is. We have an open issue to better document this as well as to add tooling to discover multithreaded scaling bottlenecks: Documentation and tooling for detecting multithreaded scaling issues and regressions · Issue #117 · Quansight-Labs/free-threaded-compatibility · GitHub

I also agree with @steve.dower that a built-in way to identify contention on resources that are protected by PyMutex locks would be neat.

Finally, we also opened an issue to track the need for documentation on debugging multithreaded performance in the free-threading guide: Add a multithreaded performance section · Issue #151 · Quansight-Labs/free-threaded-compatibility · GitHub