PEP 779: Criteria for supported status for free-threaded Python

barry-scott · March 14, 2025, 11:47am

What about adding a #define/-D that can be used to remove the unsafe APIs from the header files?

I could then opt into removing the unsafe APIs and get compiler errors when migrating to free-threading.

pitrou · March 14, 2025, 12:29pm

That implies your project doesn’t compile until you fix all potential issues, which seems a bit too severe. Also, some APIs like PyDict_Next are apparently safe/unsafe depending on the context. A lint-like utility that would flag potentially dangerous constructs would be more flexible.

barry-scott · March 14, 2025, 12:32pm

It would be a reliable way to find unsafe usage.
I can always remove the -D from the compile once I have the list of code to fix.
Then I can fix incrementally and retest with the -D to confirm I’ve fixed all unsafe calls.

pf_moore · March 14, 2025, 1:56pm

I’d like to give an example of why I (as a “pure Python” programmer) find it so hard to understand what free-threaded Python means to me - and hence, what needs to be part of the documentation I’m referring to above. I apologise if this is a bit long, but I think context and a concrete example are important here.

I have a casual interest in simulating various dice and card games using Python. This typically takes the form of simulating a game turn a large number of times, and summarising the results. To give a relatively simple example, working out how many spaces you move on average in the game Monopoly. The simplest way is just to define a function that calculates the result, and run that a million or so times. That’s often sufficient, but for more complex problems, it can be pretty slow. To speed it up, I thought of using threads, as that would share the work across my multiple cores. Unfortunately, as my calculations are CPU-bound (a number of calls to random.randint() plus arithmetic and conditional tests), the GIL prevents them being run in parallel, and the threaded version is significantly slower than the serial version (because of the overhead involved in setting up worker pools, etc).

Given that the GIL is my problem here, I expected free-threaded Python would give me the speedup I was hoping for. However, that doesn’t seem to be the case - in a simple test, my serial run took about 1 second, and the threaded (with GIL) run took 12 seconds. The free-threaded interpreter took 25 seconds - twice as long as with the GIL!

I’m not doing anything clever here, my thread loop is simply:

def main():
    results = Counter()
    with ThreadPoolExecutor() as exc:
        futures = [
            exc.submit(one_result, results)
            for _ in range(1_000_000)
        ]
        
    wait(futures)
    print(results)

There are two fundamental problems I have here.

I would expect the free threaded interpreter to suffer at most the 10% performance penalty that’s being quoted, and certainly not the 100% penalty I’m seeing.
If the GIL isn’t what’s stopping my code from running in parallel, I have no idea what is. So I have no intuition about what is a good problem for using threads on any more.

If we are switching to a free-threaded model, I think part of that switch has to include re-educating Python users who have internalised the “threads are for IO-bound workloads, the GIL makes threading useless for CPU-bound tasks” message of the past. And part of that re-education has to be for us to understand, and be able to articulate, what the new message actually is.

By the way - I don’t want advice on how to improve my code above. It’s a throwaway example, knocked up in a couple of minutes. It’s intended as an example of sloppy code written with a mindset of “I have 8 cores, if I use them all, I’ll get 8x the speed”. Maybe that’s not the target use case for the free-threading build - but if so, how do we make that clear to the average user?

ngoldbaum · March 14, 2025, 2:53pm

First, thanks all for the pings in this thread. It’s an honor that my opinion is valued

I’m writing a longer reply to all the points brought up, but wanted to quickly reply to this:

Is there any chance you can share your code anyway? I’m curious to see where the performance bottleneck is.

IMO your simple example probably should have led to the speedup you were expecting, unless there’s something else going on. The fact that it doesn’t deserves investigation.

da-woods · March 14, 2025, 2:58pm

That’ll likely have a global seed state that’s shared between threads and needs locking. So if that dominates I wouldn’t necessarily expect speedup.

colesbury · March 14, 2025, 3:21pm

You might be interested in this note in the random module documentation. This program, or one very similar to it, was discussed previously and you received a number of helpful suggestions on how to write it in a way that it scales well.

csm10495 · March 14, 2025, 3:25pm

Is it possible to detect improper concurrent access to resources in python code programmatically? Like having a -x warnmultiaccess or something that warned could be helpful to find these bugs (in python code itself).

Even looking around company code, I once in a while see things that work because of the GIL. Things like adding to lists in threads without an explicit lock.

Then we have a plethora of historical answers, external documentation, etc. that rely on it as well. People saying that we don’t need a lock here because of …

Without some sort of detection or linting that sees it, I’m not really sure how we ‘stop the bleeding.’

steve.dower · March 14, 2025, 3:28pm

Some kind of performance counter to highlight contention (count of parking lot uses? I trust Sam &co to choose better than me) would be very useful. Even if it’s just a printed statistic at the end of execution, it’s at least a not-entirely-~~fictional~~theoretical way to measure whether code is parallelisable or not.

encukou · March 14, 2025, 4:09pm

For the stable ABI: I see a way forward, but with how things are interconnected I’m having trouble pinning it down and breaking it into pieces that are PEP-sized or smaller. See Pre-PEP: PyModuleExport -- a new export hook for modules for where i am now.

Given what I emphasized there, I think Phase II should be started in an early alpha release. That is, announce it now, but for 3.15.

Sorry for pushing back! I fully realize the sorry state of stable ABI for free-threading is partly my fault – looking back, I should have made it a bigger priority.

pf_moore · March 14, 2025, 4:31pm

Sure - it’s at montecarlo.py · GitHub

I didn’t share it because there’s very little in it beyond the main function I posted

I’m glad to hear that your intuition was the same as mine. That suggests that there’s not as much education needed as I feared.

colesbury · March 14, 2025, 4:33pm

Yes! We’ve discussed this briefly a while ago in a Faster CPython Sync, but I hadn’t written it up as an issue. I’ve filed Use pystats for free threading performance statistics · Issue #131253 · python/cpython · GitHub to outline some ideas.

pf_moore · March 14, 2025, 5:03pm

Yes. I hadn’t recalled that discussion, so thanks for the link, but the details there aren’t really my point here. I routinely write this same bit of code from scratch, and it’s never something I’m thinking of in terms of scalability. So I don’t try very hard to optimise it, and that’s really what I’m trying to get at here - how do we make sure that people don’t do the wrong thing when they aren’t thinking beyond “split the load across my cores”.

Some immediate takeaway thoughts:

I don’t think that big of a drop in performance for Random on multiple threads is OK. If it performed badly on threads under the GIL as well, I’d be OK with this, but having previously-working code slow down this much is far from ideal. Particularly when it’s far from obvious how to use separate instances of Random per thread, for example when using a thread pool like I am. At a minimum, a recipe in the docs would be helpful, and maybe even a runtime warning^[1].
The pattern I used is the approach I’d expect to work under the GIL for workloads that are suitable for threads (IO-bound, basically). So if free-threading isn’t as simple as letting you use the same pattern for other workloads, where’s the documentation explaining that?

As I said, I’m not demanding that this exists now, I’m simply saying it should be part of the criteria for when free threading is ready to be marked as “supported”.

FWIW, I changed my code to put a RNG instance in thread-local storage on worker startup and use that. It didn’t improve performance at all

For what it’s worth, there’s a whole bunch of other issues with sharing a RNG across threads, around potential degradation in randomness - but I don’t care about that for this application, and I don’t think we need to worry about educating anyone who is writing code that cares about those issues ↩︎

pf_moore · March 14, 2025, 5:09pm

Also, I’ll note that my main point then was that there’s no information on how to write such code correctly under free threading, and that wasn’t addressed - which is why I’m reiterating the same point now

This is why I don’t like offering code examples. People naturally advise on how to fix the code, and the point that I had no means to find out how to fix the code without asking people gets missed…

ngoldbaum · March 14, 2025, 8:07pm

This is my last day at work before I go on vacation for a week, so aplogies in advance if I take a while to get back to replies to this but my goal on this trip is to not be on my phone or computer very much.

I’ve been working on community support for free-threaded Python for about a year now. I started with relatively little multithreaded programming experience, beyond a teeny bit of OpenMP coding via Cython. I’d never used the python threading module. It was daunting to have to learn an entirely new programming paradigm, but I’ve come out the other end convinced that it’s worth it and that I can help others by communicating what I’ve learned and building community knowledge.

IMO, the raw computational power unlocked by the free-threaded build on modern processors more than justifies the work that will be necessary to get everything working and make multithreaded parallelism a first-class tool in Python.

As of now, we have the “base” of the scientific python stack in Numpy, SciPy, matplotlib, pandas, and scikit-learn working, code generation and language interop tools like Cython, PyO3, pybind11, and f2py are all working. We’re still waiting on a Cython release, but the others all ship releases supporting the free-threaded build. For the past few months I’ve been focusing on projects that ship Rust native extension and depend on PyO3 to help unblock that corner of the ecosystem.

I’m really excited about the possibilities available when people start migrating from process-pool based parallelism to thread pool parallelism. Take a look at this NumPy PR which fixed a multithreaded scaling issue for ufunc evaluation that was reported against the first release of NumPy to support the free-threaded build. The details of the code are less important than the graph in that PR, which clearly demonstrates that doing more or less the exact same workflow using NumPy is substantially faster using a free-threaded thread pool. Process pool parallelism is leaving a ton of CPU performance on the table. Of course that was the primary motivation for starting the experiment, but it was really exciting to actually generate that graph on my laptop and first-hand see how much faster free-threaded Python can be.

I’m also excited about the morning of the first day of talks at PyCon where there will be a series of talks in Hall A on free-threaded Python, including one from myself and @lys.nikolaou which aims to cover the content at https://py-free-threading.github.io in talk form, to the extent that’s possible. I’m hopeful that our talk will be a lasting community resource for those who learn about programming via videos and lectures rather than reading documentation.

This is a great point, one I fully agree with.

While up until yesterday the content at py-free-threading.github.io was very native extension focused, I was prompted by your reply in this thread to spend most of yesterday updating py-free-threading.github.io to add content to make it clearer what users and project maintainers need to do to get their code working. See here: py-free-threading

I also split our original single porting guide page into three pages that focus on thread safety issues in pure python code, multithreaded testing, and then a third page that focuses on native extensions. That way projects that don’t have native extensions don’t see this content and get scared away.

We’ve been working on py-free-threading.github.io since last summer and our hope is that it will be the go-to place for questions about free-threaded Python, at least for content that doesn’t make sense in the CPython documentation. Please please please open issues telling us what needs to be improved or letting us know about mistakes. Contributions are also very welcome.

It depends a little on what you mean by “work”. If the module already has extensive multithreaded tests under the GIL-enabled build, then being able to tell whether or not free-threading introduces new kinds of bugs will be easy to see.

If the code is for a CLI app or really any kind of user-controlled application where the user decides whether or not to create a thread pool and use code in a multithreaded context, the user or CLI app author can do testing to make sure their internal use of threads is safe. Tools like pip that do not have a public Python API also don’t need to worry about making internal Python code thread-safe, unless they would like to use threading internally.

Library authors will need to do a little more work. This is particularly acute for libraries that do not have good multithreaded tests. However, as @thomas points out later in the thread:

I ran into exactly a situation like this last week working on the cryptography library. It turns out the _ANSIX923PaddingContext is implemented in Python and uses an internal bytes buffer to store state. If two threads simultaneously update the context, there is a possibility that the threads can race to update the bytestring, leading to silently incorrect results.

It’s easy to trigger this on the GIL-enabled build by doing e.g. sys.setswitchinterval(.000001) before running a multithreaded test: `_ANSIX923PaddingContext` isn't thread-safe · Issue #12553 · pyca/cryptography · GitHub. That doesn’t mean it can’t happen on the default configuration, just that it requires an “unlucky” thread switch that happens to trigger the race. I wouldn’t be surprised to learn about rare crashes in production systems on the GIL-enabled build using thread pools due to issues like this.

This is also a good point. I’ve opened an issue to track adding examples demonstrating how the free-threaded build makes things that were previously impossible possible: Add a page to the guide showing why disabling the GIL is awesome · Issue #149 · Quansight-Labs/free-threaded-compatibility · GitHub

There’s subtlety here too. One example is that (at least according to @cfbolz, I think this is what he’s getting at in his reply in this thread) in Python 3.10 a change to how bytecodes release the GIL means that pure python code release the GIL far less often than Python 3.9 and earlier as well as PyPy: Races on appending bytes are much easier to trigger in PyPy than CPython · Issue #5246 · pypy/pypy · GitHub

There are likely more cases like this. We need better tests to find them all.

That’s true! IMO adding multithreaded tests of objects with mutable state are important to add before you can call your module thread-safe.

I also want to point out that it’s completely valid (at least IMO) for a project to not guarantee any thread safety and leave it up to the user to evaluate whether or not their code is thread-safe.

For example, NumPy does not guarantee (and never has) the thread safety of ndarray: Thread Safety — NumPy v2.3.dev0 Manual. We’re also aware of a couple of free-threading-specific thread safety issues in NumPy that we’re not yet prioritizing because we haven’t gotten any reports about problematic uses yet.

IMO making ndarray thread-safe is not the correct way forward. We could add locking, but doing it in a way that allows scalable multithreaded performance (e.g. without causing any regressions to existing multithreaded workflows in the GIL-enabled build) would be a large engineering effort. NumPy also exposes a number of C API functions that allow direct access to all kinds of low-level memory buffers allowing unsafe mutation in a multithreaded context.

Even with all those issues though, people have been using multithreaded parallelism with NumPy for years. Dask, for example, supports thread pools as a parallelization strategy, and NumPy has fixed a number of issues seen by dask users over the years.

Even though users are allowed to do unsafe things, they are still productively using NumPy in a multithreaded context by avoiding multithreaded mutation. Many, many production workflows end up as an embarrasingly parallel operation that uses read-only access on shared ndarrays.

IMO, long term, they way forward is for NumPy to offer better support for immutable ndarrays. This may eventually involve adding a borrow-checker to NumPy’s ndarray buffer to enforce at runtime rust-like write XOR reads pattern on data owned by an ndarray.

Why would it be unpopular to have it in the stable ABI? In the GIL-enabled build the begin and end critical section macros are just open bracket and close bracket. Of course they’d be more complicated in a free-threaded stable ABI, but also they’d be necessary for thread safety in many C extensions. I don’t see what the problem is.

Mutating a shared ndarray isn’t thread-safe and right now we don’t plan to make it safe, via Python or the C API.

We’re doing our best to report and fix crashes or memory unsafety due to mutation of shared arrays, but I don’t think it’s possible to fix all the data races one could trigger by mutating a shared array. The same is true on the GIL-enabled build and is relatively easy to trigger because mutating ndarrays releases the GIL (except for object arrays or other limited exceptions).

What I would like to see is better support for immutable ndarrays or maybe an alternate array type that has better guarantees about that immutability and less legacy baggage.

If PyArrow has bigger needs for a thread-safe ndarray or making the NumPy C API thread-safe then I’d love to start a discussion about that on the NumPy mailing list or issue tracker. I’m very interested in having some design discussions about this to inform long-term plans in NumPy.

This is true. I’ve been trying to add documentation and examples of how to write multithreaded stress tests.

Here are some recent examples I’ve come up with on projects I’ve added free-threaded support:

github.com/pyca/bcrypt

tests/test_bcrypt.py

9dd850a68


      
          
          def test_2a_wraparound_bug():
              assert (
                  bcrypt.hashpw(
                      (b"0123456789" * 26)[:255], b"$2a$04$R1lJ2gkNaoPGdafE.H.16."
                  )
                  == b"$2a$04$R1lJ2gkNaoPGdafE.H.16.1MKHPvmKwryeulRe225LKProWYwt9Oi"
              )
          
          
          def test_multithreading():
              def create_user(pw):
                  salt = bcrypt.gensalt(4)
                  hash_ = bcrypt.hashpw(pw, salt)
                  key = bcrypt.kdf(pw, salt, 32, 50)
                  assert bcrypt.checkpw(pw, hash_)
                  return (salt, hash_, key)
          
              user_creator = ThreadPoolExecutor(max_workers=4)
              pws = [uuid.uuid4().bytes for _ in range(50)]

I think we should make it easy to run a native profiler against CPython so you can see for yourself where the issue is. We have an open issue to better document this as well as to add tooling to discover multithreaded scaling bottlenecks: Documentation and tooling for detecting multithreaded scaling issues and regressions · Issue #117 · Quansight-Labs/free-threaded-compatibility · GitHub

I also agree with @steve.dower that a built-in way to identify contention on resources that are protected by PyMutex locks would be neat.

Finally, we also opened an issue to track the need for documentation on debugging multithreaded performance in the free-threading guide: Add a multithreaded performance section · Issue #151 · Quansight-Labs/free-threaded-compatibility · GitHub

mikeshardmind · March 14, 2025, 8:12pm

It might make sense to change the random module to use a thread local Random instance by default rather than a global one with threadsafe access as a result of free threading specifically to avoid this kind of contention, but I don’t know if such a change would have unexpected or unacceptable side effects.

Generally speaking, any shared mutable state will continue to have contention, and code that has it should try to avoid sharing it across threads when possible to see the most benefits. This seems obvious to me, so I’m not the right person to know where explicitly stating this would be useful to other people, but this kind of guidance for those who haven’t had to consider it before should definitely exist somewhere.

da-woods · March 14, 2025, 8:31pm

In their current form, critical sections are stack allocated so can’t be opaque. That probably means you have to nail down at least of some the details of the current implementation. I’d be surprised if that didn’t raise a few objections. It may be that the implementation is sufficiently obvious that people are happy to go with it though.

pf_moore · March 14, 2025, 9:23pm

That’s certainly one approach. Like you, I don’t know if this sort of change would be considered backward compatible, though.

An alternative that I’d be perfectly OK with is a twofold approach:

Reframe the random documentation to make using an explicit Random instance, and calling methods on it, the preferred apprach, with the module-level functions being presented as convenience functions, explicitly not recommended in a multi-threaded context.
Improve the APIs for creating and managing thread-local state, in particular in concurrent.futures.

Regarding the latter, the current pattern appears to be:

data = threading.local()
def init_state():
    data.rng = Random()

with ThreadPoolExecutor(initializer=init_state):
    ...

That’s not exactly complex, but you need to look in a few places (the threading docs and the details of the ThreadPoolExecutor constructor) to find all the pieces. Plus, having to use a global variable to hold the thread local values feels less than ideal. If moving shared global state to thread-local is likely to be a common pattern in free-threaded Python, it would be useful to make it easier. (It’s a good practice even with the GIL, but having the GIL makes it less risky, so you can get away with not doing it - this is one of those “your code is already broken” cases that people with limited multi-threading experience won’t necessarily think of).

Some of these details don’t have to be dealt with on day one, but IMO we should be striving to make people’s initial impressions of free-threaded Python (once it’s declared “ready to use”) as good as we can manage. And a few quality of life changes could make a lot of difference.

cmaloney · March 14, 2025, 11:06pm

for 2 I think contextvars — Context Variables — Python 3.13.2 documentation is preferred? (although definitely a module I only became aware of recently). Definitely feels like something that needs broader teaching around.

I’ve been looking at reducing critical sections, system calls, and locking in BufferedIO (Ideas thread / proposal soon™, 3.15 timeline). That would mean a lot less overhead and potentially no userspace / Python locking in read only and write only I/O (stdin, stdout and stderr). I know techniques and have tools like atomics to do what I need in the CPython C internals, but have no idea how I could implement the same thing Python native (ex. _pyio). I know _pyio is contentious but the delta in what can be expressed / capabilities is to me a problem. Ecosystem often grows faster once it’s part of batteries included, but it doesn’t feel like there are basics yet, and definitely not general directions / guidance with patterns for adopting.

Watching the PRs for free threading going through CPython modules, is a lot of often intricate work relying on a lot of systems tools and general systems programming knowledge (ex. ThreadSanitizer, understanding lock contention, etc). All of that is teachable if people really want to learn it, but most Python code I write I really don’t want to think about it. Recently things like cgi vs. http.server module have come up, and I don’t know what the right thing for free threading is, nor how I’d measure why things are getting slower / tools to help me write better code and understand what is happening/why. It’s great to have a range of options, but in terms of “I want a to write simple Python code to handle HTTP requests” I don’t have a clear picture where I need my code to be headed, let alone where I would go from a simple “flask” app. CGI → WSGI I did quite a while ago. Do I do asyncio now? free threading? both at the same time? How do I make sure what performance I had before doesn’t regress? especially in product code bases that often don’t have comprehensive test and performance suites.

TL; DR: I think there needs to be clear paths for why and how maintainers of systems get to free-threading (might not be for a couple more versions), how ongoing app development can adopt it, and how new projects can write simpler/faster code from day one with it. All very solvable, but today both the tools and the education is at the very least not something visible to me.

colesbury · March 15, 2025, 12:14am

I know you don’t want advice on how to improve your code, but that’s not the pattern I would use. There are lots of valid ways to write most Python programs. Keeping data private to a thread doesn’t require threading.local. In may cases it’s simpler just to use local variables: montecarlo.py

Regarding your other concerns about Random:

There are no issues about “degradations of randomness” using random.randint from multiple threads. (random.gauss is not thread-safe, but that’s not new nor specific to free threading).
I don’t think it’s reasonable to require that random.randint() scales well across multiple threads before going from “experimental” to “supported”. This same issue exists in Java. It also existed in Go for many years.

In general, if you’re not thinking about scalability and don’t try to optimize your code, I don’t think you should expect your code to run efficiently.

If we are considering reframing the official Python documentation so that the preferred approaches align with efficient free threading practices, maybe we’ve gone beyond “Phase 1: experimental, not for production use”…