A fast, free threading Python

pf_moore · June 25, 2023, 9:33pm

Ah, we may be talking at cross purposes. My understanding was that “people might get in trouble” by using raw threads, along with explicit locking, or more likely insufficient locking… And that the “new approaches” would offer safe alternatives that would avoid the need to use raw threads.

I’m very unclear where concurrent.futures and multiprocessing.dummy (note, dummy, which uses threads, not actual multiprocessing) fit into all this. They are higher level abstractions, sure, and they are what I use currently. But they aren’t new, and no-one seems to be saying “with nogil, I’ll be able to replace my current clumsy multiprocessing-based code with multiprocessing.dummy and get multi-core performance gains without the complexity” (which is what I meant by them being discussed as the nogil model).

jamestwebber · June 25, 2023, 9:47pm

Ah, yeah. I suspect you are thinking more from the perspective of people writing extensions, and other C code? It seems likely that writing thread-safe C will be more complex, and higher-level tools would be very useful for those developers. But from my perspective, anyone doing this is a fairly advanced Pythonista.

Oh, I am absolutely saying this! I didn’t realize that wasn’t getting across. I use multiprocessing almost daily in my own code and via other tools ^[1], and it is clunky to process large/complex data that way. Every situation requires some experimentation to figure out if it’s even worth doing, and so I rely on intuition about whether I should even try (like, “does this likely save me X minutes for Y minutes of coding”?).

Most of my tasks are easily parallelizable, or at most involve some kind of scatter/gather pattern. Even then, subprocesses take a toll and I’d be glad to remove that complexity.

edited to add: I use concurrent.futures these days when possible and it’s much nicer ↩︎

gabrieldemarmiesse · June 25, 2023, 10:02pm

From what I understand about the python ecosystem, for a large part of users who use ProcessPoolExecutor, they’ll be able to switch to ThreadPoolExecutor with pep 703 and have a similar performance or strictly better if they were passing a lot of data between processes. So new libs/high level apis might not be needed at all to do correct and safe multithreading.

smontanaro · June 25, 2023, 10:47pm

That’s my perspective as well. I’m no threading whiz, so I’ve always stuck with the simplest possible parts of the threading module to achieve my desired aims, generally higher level constructs like queues. (I seem to recall Aahz wrote a “threading for dummies” essay years ago. Maybe I’m thinking of his OSCON slides from around 2000.) I think the main difference between current GIL-mediated and free threads is that you might discover bugs in your code sooner with the new stuff.

GalaxySnail · June 26, 2023, 1:33am

In a pre-PEP 703 world, a common pattern for achieving parallelism is as follows:

from concurrent.futures import ProcessPoolExecutor

def do_some_stuff(data):
    ...
    return result

some_data = [...]
executor = ProcessPoolExecutor()

# the function and data are pickled and sent to child processes
results = list(executor.map(do_some_stuff, some_data))
# results are pickled and sent back to the main process

However, pickling and unpickling are slow, and there are many objects unpicklable. This is where a no-gil python can be beneficial.

Now that we have a per-interpreter GIL in 3.12, imagine if there were a way to send ownership of objects between interpreters, without pickling or memcpy:

from concurrent.futures import SubinterpreterPoolExecutor

def do_some_stuff(data):
    ...
    return result

some_data = [...]
executor = SubinterpreterPoolExecutor()  # each subinterpreter has a thread

# data are sent to subinterpreters without pickling or memcpy.
# the main interpreter is calling executor.map and GIL is held,
# ensuring that data won't be touched in the main interpreter
results = list(executor.map(do_some_stuff, some_data))
# executor.map returns, data and results are sent back to the main interpreter,
# so we can do anything with data again

# it works well:
print(some_data, results)

If such a mechanism could be implemented, IMO the no-gil python would become unnecessary, so I personally prefer PEP 684 to PEP 703.

It might even be possible to release GIL in SubinterpreterPoolExecutor.map, but a new ob_interpreter_id field would be required in the PyObject struct (which breaks ABI), and it should be checked in every function call and every C API, in case another thread has transferred ownership. If an object is moved to another interpreter, any operation on it should raise an exception. It may be simpler than PEP 703 and more friendly to the Faster CPython project, but I don’t know how much the performance impact will be.

lunixbochs · June 26, 2023, 5:51am

Isn’t this already the case for subinterpreters, because subinterpreters share C extensions?

Honestly for my use case I wouldn’t mind completely removing the GIL, but adding another (transitional?) feature where threads take a global lock by default anyway. Because then I can leave “old style” threads alone but avoid acquiring the global lock in specific cases where I need better latency.

However I think there are good opportunities around stuff like actor model in a free threaded world, where the ownership is simplified by keeping mutable data inside an actor. (It seems hard to do actor scheduling on subinterpreters, because you can’t easily migrate actors across interpreters)

The big draw for me is removing the GIL acquisition time in a realtime environment. I talked a bit about it here. Python is mostly fast enough for me, but it struggles with tasks like realtime audio, input processing, and rendering in a single process. Subinterpreters feel more like web workers to me, where you need to manually split up your workload. They’re a completely different programming model and not quite as flexible. I think nogil today would remove my GIL related stalls (which currently cause me to drop frames for both input and UI) without me making any other changes.

Subinterpreters feel to me like “multiprocessing API, in a single process”. You need to initialize a new interpreter, import all of your libraries again, etc. It seems slightly more efficient than multiprocessing, especially with some shared memory, but neither multiprocessing or subinterpreters really help my main use case at all. I’m not CPU bound, I’m latency bound, and it’s difficult enough today to prevent GIL contention even in extension code I completely control. I’ve had to do a lot of engineering just to work around various things that hold the GIL for more than a frame

(I think “I’ve had to do a lot of engineering to work around the GIL” is a common theme authors of some high performance C extensions, e.g. an earlier comment from openai)

I want to schedule semi-realtime work on 4-8 cores of a consumer machine (<16ms deadline for any given GIL acquisition). With subinterpreters, I’d need to pay the import cost for my whole tree 4-8 times, schedule all of my work manually on specific interpreters, and there will be significant abstraction leakage to the users scripting my app. With nogil I can use my existing locking and let the OS scheduler do what it does best.

With free threading you can also implement new paradigms for Python, like a multithread async scheduler. I don’t see that happening with subinterpreters, due to the cost of moving objects and code between interpreters.

da-woods · June 26, 2023, 7:51am

As usual, I suspect Cython counts as “somewhere in between”. If you’re just using Cython to manipulate Python objects then all the same “atomicity” guarantees should hopefully apply (so you can totally have race conditions, but you shouldn’t corrupt the Python interpreter state). If you’re accessing C types from multiple threads (and relying on the GIL to ensure that only one thread does it at once) then you’re in for a world of excitement.

The only caveat is that the GIL appears to be a stronger lock in Cython since it won’t release the GIL itself every few instructions (unlike the Python interpreter) so I suspect some people misuse it for that. It’s never recommended though since it’s incredibly easy to trigger a small chunk of actual Python code and end up releasing the GIL unexpectedly inside a supposedly “locked” block.

I think Sam ended up submitting one fairly small Cython change to support his nogil fork and beyond that it largely seems to work as is without huge complaints (which obviously isn’t proof there aren’t multiple threading bugs lurking)

pf_moore · June 26, 2023, 9:57am

I’d like to explore this point a little. In my experience, the cases where the GIL has been a problem for me are where a single-threaded solution is too slow to be practical, and the GIL makes multithreading no faster. So removing the GIL won’t “just expose threading bugs I already have”, as I don’t already have any code. In that situation, my problem is knowing whether code I’d naturally write is correct at all. And threading bugs like race conditions are often non-obvious, so I could easily just end up with subtly wrong answers, not easy-to-spot crashes.

The following use case is far from critical, but I think it illustrates the point nicely. Suppose I want to calculate the probability of rolling a total on some dice, by simulation:

def sum_ndm(n, m):
    return sum(random.randint(1, m) for _ in range(n))

def calculate(n, m, count=100_000):
    pmf = Counter(map(lambda _: sum_ndm(n, m), range(count)))
    return pmf

def calculate_threaded(n, m, count=100_000):
    with ThreadPoolExecutor() as executor:
        pmf = Counter(executor.map(lambda _: sum_ndm(n, m), range(count)))
    return pmf

This is a trivial conversion of a single-threaded approach to a thread pool, and if the single-threaded approach was unusably slow (for example, a count of a million, and a more complicated “calculate the result” function than sum_ndm), it’s very likely what I would use. Does it have any race conditions? I don’t know. And if I’m using it to estimate probabilities that I don’t know up front, I wouldn’t be able to say the answers “looked wrong”.

There’s no global state here, so my instinct says this is safe. But am I right?

And what if instead of using a Counter I wanted to store the results in a data structure? For simplicity, say I had a global results dictionary, and sum_ndm was modified to

results = {}

def sum_ndm(n, m):
    result = sum(random.randint(1, m) for _ in range(n))
    results.get(result, 0) += 1

Is that safe? A developer not familiar with the collections module might pick this approach first.

This is where I imagine nogil tempting users into writing unsafe (i.e., wrong) code, and I’m not as comfortable as some people seem to be that classifying the problem with pure Python code as “just exposing bugs your code already has” or “not needing any extra expertise in threading”.

To be clear, this is not artificial - it’s a simplified example of a genuine real-world problem that I absolutely would have used threading for if the GIL hadn’t prohibited it. And I would have boasted to a friend about how Python made solving this really easy, compared to the complicated C++ program that they wrote to do the calculation fast (but single threaded). So we’re talking high stakes here - do I get publicly embarrassed by writing incorrect code?

Rosuav · June 26, 2023, 12:55pm

You’re calling random.randint which has global state. It may be more efficient to have each thread establish its own random.Random instance, perhaps seeded from urandom, which I think is what happens after Python forks for multiprocessing.

Paul Moore:

And what if instead of using a Counter I wanted to store the results in a data structure? For simplicity, say I had a global results dictionary, and sum_ndm was modified to
results = {}

def sum_ndm(n, m):
    result = sum(random.randint(1, m) for _ in range(n))
    results.get(result, 0) += 1
Is that safe? A developer not familiar with the collections module might pick this approach first.

Not entirely sure what your last line should be there, as it’s illegal in Python currently. But let’s say you do that with a defaultdict:

results = defaultdict(int)

def sum_ndm(n, m):
    result = sum(random.randint(1, m) for _ in range(n))
    results[result] += 1

I would expect and assume that this is unable to corrupt internal state, and I would hope that the addition is performed atomically when using native integers. (What happens if you use dict[x] += [123] is a separate question and I wouldn’t demand atomicity there.) However, I would say that this is potentially a performance bottleneck; rapidly mutating global state is an inefficient way to collect thread data. So if this turns out to cause massive locking, I would be somewhat unsurprised.

An alternative strategy would be to isolate the threads a bit more: have each thread produce its own local counter, then sum those counters at the end. But your original (just count at the end) should be safe, with its only problem being that the final step of counting is single-threaded.

ntessore · June 26, 2023, 1:11pm

As far as I can tell, that last operation is atomic in current Python threading, but not safe in nogil-3.12. I think these examples only show that we will need new and explicit contracts that define what is safe in a free-threading Python world.

barry-scott · June 26, 2023, 1:51pm

I think its not thread safe today as its not atomic.

The dis output shows the BINARY_SUBSCR and STORE_SUBSCR below.

Its the classic read-modify-write that requires a lock.

:>>> import dis
:>>> def a(x):
:...    x[0] += 1
:...
:>>> dis.dis(a)
  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (x)
              4 LOAD_CONST               1 (0)
              6 COPY                     2
              8 COPY                     2
             10 BINARY_SUBSCR
             20 LOAD_CONST               2 (1)
             22 BINARY_OP               13 (+=)
             26 SWAP                     3
             28 SWAP                     2
             30 STORE_SUBSCR
             34 LOAD_CONST               0 (None)
             36 RETURN_VALUE

pf_moore · June 26, 2023, 1:52pm

Thanks. Doing that with concurrent.futures seems hard, in the sense that I don’t know how I’d do it. And managing my own pool of workers is a whole other level of messiness. That’s essentially my point here, you very quickly start to need to know how to do “clever threading stuff” to parallelise existing code, meaning nogil isn’t necessarily a “quick win” for non-experts.

Which suggests that straightforward attempts to exploit GIL-less threading in pure Python code will work well enough. That’s reassuring, as it means that trying to get a “quick win” does get you something that’s at least OK.

There’s actually a much more serious (potential) issue with this code, that I ignored for the purposes of the example. If I have 8 threads running, and they get exactly even access to the RNG, then each thread is effectively sampling every eighth value from the RNG. And without knowing the precise properties of the RNG you are using, there’s no assurance that every eighth value of a RNG will be statistically random. Combining the results into a single stream again might be enough to address this issue, and in any case there’s no guarantee that Python’s RNG is subject to this problem - or that the splitting is that even, for that matter. So for actual statistically-valid calculations, you do need at least an independent RNG per thread. Which means maintaining per-worker state, which isn’t something I’d expect the average user to know how to do (I had to look it up - I think I’d use a combination of the initializer argument to the executor, and threading.local for worker-local storage).

This is an instance of a more general issue, which is that for most of the stdlib, there’s no documentation on how the functions behave in a threaded context. That question might become more important if threading becomes more attractive for non-specialists writing pure Python code.

In practice, the results of using a single RNG are likely good enough - but as with a lot of this, the problem is that the failure mode is “subtly wrong answers” rather than an easily to debug error. In my actual use case, though, I’d probably YOLO it and use a single RNG.

ntessore · June 26, 2023, 1:58pm

In principle, of course, but does CPython switch threads in the middle of these operations? Atomic might have been a bad choice of word, but this is exactly the kind of contract, currently enforced by the GIL, that needs to be spelled out in the future.

lieryan · June 26, 2023, 2:08pm

I think that’s the problem with any free threading approach. I don’t think Python should ever define a contract about what is and isn’t atomic. That kind of low level details should be considered implementation details, not something that every programmers would’ve been required to understand some codebase that depends on them.

Defining atomic operations contract has a lot of undesirable impacts:

Making an atomicity contract restricts the kind of optimisations and future changes that can be done by alternative Python implementations and future CPython
I doubt that the vast majority of Python users, even those who regularly use multithreading, would ever understand all the subtleties of implicit atomic contracts
Currently, bytecode is considered an implementation detail in Python. Any contracts that are defined by how the current bytecode construct behaves is a non starter
Code that relies on atomic contract for correctness is brittle. Just moving code around and performing common refactoring could’ve broken the thread correctness of the code.
They are practically impossible to debug

The vast majority of developers should really just use explicit synchronisation with futures, queues, or even mutex and their friends rather than learning what is and isn’t atomic. I don’t think an atomic operation contracts have a place in Python.

Rosuav · June 26, 2023, 2:12pm

Yes and yes. From my understanding, a naive approach might result in inefficiencies (in the worst case, basically just serializing all your threads), but not major problems.

But maybe this would be a good use for thread-local storage. Python has this available, but IMO it’s not very well known about. threading — Thread-based parallelism — Python 3.12.1 documentation The ThreadPoolExecutor can take an initializer callable, too, so that might be all that’s needed:

counters = []
my_counter = threading.local()
def init():
    my_counter.counter = Counter()
    counters.append(my_counter.counter)

def calculate_threaded(n, m, count=100_000):
    with ThreadPoolExecutor() as executor:
        executor.map(lambda _: sum_ndm(n, m), range(count))
    pmf = sum(counters) # does this work? can't remember
    return pmf

That should give decent parallelism, with each one using its own separate counter.

That’s a very good point, but I don’t think it applies to the Mersenne Twister. However, proper use of thread-local should allow an easy conversion from random.randint to my_counter.randint with the initializer doing my_counter.randint = random.Random().randint, keeping the code almost as clean as it had been. (Though possibly, for performance reasons, it would be worth capturing that in a function-local - that wouldn’t affect correctness but I don’t know how fast thread-local storage is.)

barry-scott · June 26, 2023, 2:25pm

All the ceval loop does is count byte codes and then switch after N have been executed.
Unless it changed recently there is not code in cpython to do what you have assumed
to ensure atomicity at the python level.

I assume that all work in no-gil is to make each byte code atomic, not things at the
python source level.

pf_moore · June 26, 2023, 3:09pm

Neat, I hadn’t thought of keeping a global list of the workers’ counter objects. That solves the question I had of how to get the results back.

Although in trying to think about this (and specifically about why my_counter.counter[val] += 1 is safe to use) I gave myself a really bad headache trying to understand what removing the GIL actually did (Spoiler: the GIL doesn’t actually protect as much as you think it does, especially if you tend to think as individual bytecodes as indivisible chunks of work in the first place - which they aren’t of course…)

Rosuav · June 26, 2023, 3:13pm

It may be worth collecting some handy idioms and giving recommendations. I know there are some in the thread-local storage source code as linked from the docs, but if you don’t already know that thread-local is what you want, you wouldn’t think to look there.

jamestwebber · June 26, 2023, 3:19pm

This will work as long as you give it an initial empty Counter to start with (otherwise it starts with 0 and complains)

Rosuav · June 26, 2023, 3:35pm

Ah thanks. Anyhow, the idea is to minimize the work done in the single-threaded “gather” phase at the end, by having each thread individually count in a lock-free way.