How to know what is safe in threaded code

Rosuav · January 10, 2024, 4:17pm

If you can guarantee that nobody else has a reference to the object, you could do this without any semantic changes. But at that point, it’s not a global any more, since - by definition - there are other ways to get to that global.

MegaIng · January 10, 2024, 4:20pm

No, their example explicitly had a variable called local_var for the attribute lookup optimization.

Either way: The point is that the semantics of python are pretty clear all-in-all. A specializing interpreter/JIT can sometimes optimize within those bounds, but because of how dynamic python is, this is potentially a lot of effort. It definitely isn’t as easy as in statically typed languages, and trying to apply that framework is not going to work well.

Rosuav · January 10, 2024, 4:20pm

It’s worth noting, btw, that none of this requires threads. You can just as easily have this sort of behaviour using any other way of interrupting the code, such as signals:

import signal
go = True
def stop(*a):
    global go
    go = False

signal.signal(signal.SIGUSR1, stop)

while go: pass

print("stopping")

steve-s · January 10, 2024, 4:30pm

A JIT need some leeway to be able to actually optimize the code to get the desired benefit. If there is no such leeway, then the optimization potential of the JIT is (very) limited. All this Discuss thread and the parent thread are about is that this should be a conscious decision. With no-GIL adoption (if it happens) the implementation details regarding these things will surface and become much more important and relied upon in real code. I think it could still be OK to specify them properly now (with potential “breaking” changes), but not once lots of real code will start using them.

steve-s · January 10, 2024, 4:32pm

the signal handler is dispatched to from Python runtime at well specified point. Python JIT compiler will have to emit checks to “is signal handler pending” and then dispatch to it. As a part of the dispatch it can “repair” any temporary inconsistencies or re-read any state that may have changed.

MegaIng · January 10, 2024, 4:37pm

Ok, what would you actually suggest then? I haven’t seen an example from you where specifying more lenient behavior is a good idea.

No you can’t take global/attribute lookups out of a loop body without permanently checking that nothing relevant changed. This matches the current behavior. Would you argue that this shouldn’t be the case? If yes, how exactly would you formulate the new semantics?
No, you can’t inline a function without permanently checking that it didn’t change.

Note that I wouldn’t be opposed to a decorator or something like that that specifies a change in semantics for this one function (i.e. similar to numba.jit/numba.njit. But these changes shouldn’t be the default.

Rosuav · January 10, 2024, 4:51pm

Well, yes. That well-specified point is “between any two Python bytecode instructions”. At least, that’s how it is in CPython. (I think it only checks every 100 instructions or something, but it could happen between any pair.)

So how does it know what inconsistencies to “repair” in that way? How would it know what might have changed? Whatever strategy you use, this will basically result in the same as any other definition of volatility: you have to check if it’s changed. There’s no getting away from that.

steve-s · January 10, 2024, 5:18pm

I wouldn’t be sure if this isn’t already the behavior with the to be merged templated JIT. If the global read is compiled to some checks and actual memory read and the loop into simple conditional jump, then I think the CPU can move the memory read nor is required to ensure any consistency of the value it reads from its L1 cache with other CPUs caches. That applies ~~regardless of GIL or no GIL~~. No I take that back, GIL switch implies memory barrier.

yes you must make sure that it does not change. You do not have to check, it can be done by other means, e.g., if function changes you discard any compiled code that relies on it being constant. If it changes in another thread, you wait for the other threads to reach some point when they can safely transition from their compiled code that makes this assumption back to the interpreter and only then change the function. While Java may seem not dynamic enough as Python, this scenario applies to it too (functions can be changed at runtime) and that’s how it’s solved there.

Yes, the question is what should be the default. Strict memory ordering is a strong requirement needed only by tiny fraction of code. Should that be the default or should the relaxed behavior be the default and the decorators should signal otherwise?

MegaIng · January 10, 2024, 6:13pm

The default should be no surprises. Functions being called out-of-order ^[1] is a surprise. Attribute and variable access are just fancy function calls.

So I guess yes, the default should require memory barriers, unless the JIT can proof that not having the memory barrier can never have an observable effect, for example because global changes are only checked after a complete JITed sequence.

the order is well defined in the spec, although probably not clearly/concisely ↩︎

steve-s · January 11, 2024, 8:43am

Modern CPUs already do things out of order and they ensure that on the current CPU the state appears to be consistent, but they do not make that guarantee for other CPUs that share the same memory, because they do not share L1 caches and it would be too expensive to always synchronize the L1 caches. To avoid this one must use instructions that force the CPU to synchronize the caches after every memory access, which is significantly more expensive. CPUs chose this design for a reason. Many language runtimes follow this for a reason. I am not saying that Python must follow too, but it should seriously consider pros and cons. We want “no surprises” is valid option, but it should weigh the benefits and costs carefully. I would say that “no surprises” wasn’t applied to the C API, for example, because it was deemed only for experts who know what they are doing. I think that shared mutable state is also only for experts, not only in Python, but other languages.

AFAIK function calls (incl. indirect) have no special meaning in this equation, they do not imply memory barriers, so we can just forget them. There is a stream of instructions that one CPU executes and at some point it will write something to a memory location A and then later to memory location B. Another CPU may be reading A and B around the same time. Unless there are memory barriers, the other CPU can see any order of the writes unless the CPUs use fences (memory barriers). As @encukou pointed out GIL switchover implies a barrier, so for now that’s fine. With no GIL it may not be and the problem will be intensified by specialization in the interpreter and JIT compilation, because right now the code executing between “write A” and “write B” is probably still rather complex, and it will likely contain some unrelated memory barriers, but once it’s more specialized and even JIT compiled it may be just a few instructions, which is the whole point of JIT compilation in the first place. This does not consider any optimizations of the JIT that would move anything anywhere. Simple templated non-optimizing JIT can already have this problem.

markshannon · January 11, 2024, 10:35am

I think the discussion of optimizers and JITs is misplaced.
The zeroth rule of optimization is “don’t change the behavior”, where behavior means semantics and observable output. Making it faster is allowed, obviously.

As for behavior with free threading, things may change a bit.
I think the only sensible semantics for free threading is sequential consistency.
Sequentially consistency say the observable behavior is consistent with execution by a single CPU executing all threads.
Note that this is not the same as the observable behavior with the GIL, as the GIL is very coarse grained, but with a variant of the interpreter with the GIL that switches threads as randomly as it can, with as fine a granularity as allowed by the semantics.

To specify sequential consistency, you need a complete list of which operations are atomic, and which are not, so we know where it is legal for the hypothetical single CPU to switch threads. That’s the hard part in terms of specifying the semantics.

There is also the issue of timeliness, or fairness. How long can one thread run before it is expected to switch to another thread.
Given the example, and assuming reading and writing global variables is atomic:

a = False
def thread1():
    while not a:
        pass

def thread2():
    global a
    sleep(1)
    a = True

Sequential consistency alone would allow thread1 to run forever. So we need some degree of timeliness, or fairness. This is harder to specify, but as long as no thread starves it shouldn’t matter too much.
Given all that, we would expect thread1 to terminate after about 1 second.

markshannon · January 11, 2024, 10:43am

As for moving variable reads out of a loop, or function inlining, it is legal as long as it doesn’t change observable behavior. Like any other optimization.

Python already supports instrumentation (PEP 669 and sys.settrace) which can be turned on at any time.
We need to be able to handle instrumentation being turned on in one thread and immediately (within the bounds of allowed sequential consistency and fairness) see events for other threads.

If we can handle that, we can handle global variables, function inlining and loop invariant code motion.

steve-s · January 11, 2024, 2:14pm

Why do you think that it is the only sensible semantics? Other approaches are used in practice elsewhere.

markshannon · January 12, 2024, 9:18am

Because none of the others are sensible

Seriously though, a few reasons:

I think other approaches put too much burden on the programmer. Anything more complex than sequential consistency is too hard to reason about for most of us, most of time.
Anything else would be too big a jump from the current semantics, and would break too much code.
Traditionally, Python has always favored ease of use and simplicity over performance. Sequential consistency fits better into that tradition than something like the Java memory model.

Note: All the above assumes we aren’t considering anything even more restrictive that sequential consistency, like CSP, but that the acceptance of PEP 703 suggests to me that we aren’t.

steve-s · January 12, 2024, 11:43am

Yes agreed, but people should just avoid it unless necessary. That’s been the case in systems that do not provide sequential consistency. Many Java or C# developers are oblivious to these intricacies, they just know that using shared state is no-no in general and if they really need to use it they remember few easy to understand and use patterns. Only few people (mainly the VM devs) need to really understand the Java memory model and C# didn’t have a proper memory model for a long time and only got away with informal patterns that are safe.

CPython developers will have to deal with this anyway, because CPUs do not provide sequential consistency, so they’ll have to map the CPUs model to whatever CPython wants to have, be it sequential consistency or not.

In some sense I think that the pragmatic approach of C# would be simpler for both sides: the CPython development and Python users. Ensuring sequential consistency will be hard and there will be subtle bugs in the implementation. Learning few safe patterns that are very well implemented may not be such an ask from the Python users compared to days of debugging what turns out to be an internal bug. Those safe patterns are more likely to be implemented correctly, because they would map reasonably well to the CPU model as opposed to emulating full sequential consistency.

Are you aware of any “free threading” language that would guarantee sequential consistency? The performance implications of having to ensure it are huge. I think it basically means disabling the L1 cache, at least for data that can be potentially visible to other Python threads and proving that something is thread local is hard in a language with mutable objects and the presence of calls with unknown effects such as uninlined Python calls, calls to complex runtime functions (unless somehow annotated) or calls to extensions (I am looking at you numpy.array.__getitem__ and friends that like to be used in hot loops).

Yes, “favored” but I don’t think it’s been always a hard requirement, so alternatives are a possibility. As long as alternatives are seriously considered and the answer is still sequential consistency, then it’s all fine. Have you considered if the current to be merged JIT + no GIL provide sequential consistency, for example? Also note that using shared state without high level synchronization is something that majority of users should avoid even with sequential consistency, it does not prevent you from getting wrong other subtle things.

elfstrom · January 13, 2024, 3:21pm

So, to summarize my conclusions from this thread so far:

There is no specification or documentation for how threads and concurrency works in Python.
People have very different views on what can be considered correct and allowed by the implementation.
The current implementation behaviour can makes long bytecode sequences atomic, removing many race condition, even though the core developers do not want to guarantee this and these sequences might no longer be atomic at any point in the future.
All of the above is true even if we completely ignore the future free-threading/no-GIL work.

Given this, I think it’s fair to say that it’s very challenging for an average user to write correct concurrent code using threads. It seems like threading should be regarded as a somewhat experimental feature and that it’s mainly useful for two categories of users:

People who are intimately familiar with both the CPython implementation and the internals of the standard library (and can keep up with any relevant changes in these).
People for whom correctness is not critical and some amount of concurrency bugs can be accepted.

This leaves out a large part of the user base and I think it would be great if the usability of this feature could be improved in the future.

Rosuav · January 13, 2024, 3:44pm

Well, that’s just nonsense. Threading has been in Python for an incredibly long time and is used in production by people all over the world.

sirosen · January 13, 2024, 3:54pm

That reading is much too harsh to be accurate. It’s relatively easy for moderately experienced Python programmers to write correct threaded code today.
An awareness of the GIL is important, perhaps essential, but for simple real world problems with IO-bound tasks to which Python threading is well suited today, I’ve never found that the semantics were unpredictable.

This thread has delved into discussions of free threading, which is a possible future behavior and is not the current behavior of the interpreter.

All of the original examples in this thread were clearly fine, and nothing should have given you the impression that they were in any way unclear in behavior. One of them (a lock as an attribute of a shared object) could in theory become interesting to discuss if there were a custom __getattr__ in use, but as presented in the example it’s 100% clear what the behavior will be.

da-dada · January 13, 2024, 4:46pm

well, for them much more important is what all the other libraries are doing (that is the glue should have at least one easily usable level)

sirosen · January 13, 2024, 5:53pm

If you use 3rd party libraries, then yes, of course you need to care about what they do. That’s the case irrespective of any questions about threading.

This whole thread has departed significantly from the OP’s original question about what thread safety the language offers, which is an even tighter set of things than the full stdlib.

Limiting ourselves to the core language + the threading stdlib package, I’m not aware of any ambiguous cases. If there are any, they can be clarified. The original three which were asked about are borderline trivial to reason about, and I’m really confused about why so much ink is being spilled over this.