And it’s no harder to write correct code, it’s just no longer as easy to get away with “mostly correct” code.
I absolutely agree to that. But it requires that it is well documented and understood what you can rely on and what you have to take care of yourself. Without that it exactly happens what you already mentioned in your post:
It’s way too easy to get the impression that you should just slap locks round everything to ensure safety. But then, you’re just reimplementing a GIL for yourself (and almost certainly implementing it far worse than the actual GIL did).
Usually with locks or (preferably) higher-level concurrency APIs but otherwise, yeah, one of those two options.
Strictly speaking there is no answer, since there is no memory model. IMO that isn’t very useful though, since as a practical matter, in fact almost by definition, you can rely on all these.
Locks must guarantee writes are seen atomically, in order to work at all. Similarly for synchronisation primitives built on top of them such as Event, and higher-level APIs such as mp.Queue. So YMMV, but I really wouldn’t worry if you use them.
These claims confuse me quite a bit. I understand there are no guarantees and I’m still missing how to properly implement the check in the scenario:
x = 1
T1: x = 2
T2: waits in the loop till x == 2
That was a variation of what I have asked in my initial post. So you can also look at the beginning of this thread.
From what I understood there is no problem with your code without any additional measures by yourself. If x is for example a class variables and you assign a value to it in one thread and check its value from another thread the only outcome you should get is that T2 either gets the modified or unmodified value depending on timing. So if you missed the changed value at one loop iteration you should get it at the next one.
This is at least also my experience with excessive testing. However it has to be said that to prove something is correct in the context of multi-threading by testing is rather difficult or even almost impossible it the “duration of race” is rather small.
What has confused me lately in the discussion are emerging doubts about visibility. I know this problem from other languages but I have no clue how Python behaves there. Problems with this are even harder to test as to my understanding you will see them more often on systems with more than one CPU socket due to caching and memory architecture.
But why does this work?
In C or Rust x is typically a single memory location.
But in python x and the int() that is assigned to it involve lots of memory locations. The CPU can and does reorders reads and writes so that other threads may only see some of the changes but not all of them.
How come this does not lead to crashes?
I have to assume that memory barriers are used internally to prevent crashes as crashes are not being reported.
In the end I think python must publish a memory model to allow reasoning about this.
That is indeed an interesting question which I’m unable to answer cause knowledge of the internal implementation details of Python (CPython) would be necessary.
The only reassuring thing is that there are similar discussions for C++ however as this is much more low level finding the answer is a lot easier IMHO.
In that SO the OP is hoping that a single instruction turns into an atomic operation and it does not. X++ is a read-modify-write sequence and not synced across CPU cores.
Yes but the interesting things are the explanations for that. The thread gives a good impression of the complexity of that topic like the following quotes. For Python I think it is assured that a single bytecode operation is atomic and as thread switching, which only can occur on bytecode boundary, involves storing and restoring current execution state, I would assume this also ensures visibility between threads (at least with GIL).
So lock add dword [num], 1is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.
If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don’t misalign your atomic variables!
The BINARY_OP byte code (+=) cannot be assumed to be atomic. It calls the __iadd__ or possibly __add__ methods of the given object or possibly __radd__ and sp on. If implemented in Python that means that it would call many bytecodes but they can also be implemented in C. Even just looking up which of the methods to call is not atomic since it would involve multiple lookups. In the case of the GIL it might be the case that the GIL would be held while doing those lookups but I doubt that you will find written guarantees about that and it certainly does not apply under free-threading.
With proper OS threads there is no assurance that thread switching only occurs at bytecode boundaries with or without the GIL. The += operation above if called with e.g. a numpy array as an argument could release the GIL and allow another thread to run in the middle of that bytecode.
Any talk of bytecodes is CPython-specific and CPython does not even make implementation-specific guarantees for anything byte codes (a future version could remove them entirely). Python code typically calls into code from other languages (C etc) which can do anything so I don’t think it is useful to think in terms of byte codes here for thread guarantees at all.
The GIL is tightly integrated into CPython’s interpreter loop. The loop processes Python bytecode and periodically checks for events such as thread switches or signals. This periodic check is controlled by a “check interval” or “switch interval,” a counter that decrements with each bytecode instruction executed. When the counter reaches zero, the current thread releases the GIL, and the scheduler attempts to switch to another thread.
The discrepancy here is that a a bytecode can itself require executing multiple bytecodes. The GIL may guarantee that it only switches at a boundary, but when you write a function, you can’t assume the boundary it switches at is one within your function, rather than in a data model method of something passed to your function used by your function.
There are some non-guaranteed things that make this less visible and “sometimes just work”, (or not fail visibly immediately), but the part that’s actually safe to rely on here isn’t actually changing.
Yes, the implementation of namespaces (dicts, in this case) takes great care to use appropriate memory fences and atomic operations, such that you can never see a partial write or stale memory from mixing reads and writes in multiple threads using the same dict. The same is true for lists and other built-in types. The C-level mechanisms are the atomic operations declared in Include/cpython/pyatomic.h (and defined in platform-specific headers like Include/cpython/pyatomic_gcc.h, with convenience wrappers in Include/internal/pycore_pyatomic_ft_wrappers.h). If you look through Objects/dictobject.c you can see all the places where acquire/release (like _Py_atomic_load_ptr_acquire and _Py_atomic_store_ptr_release) or sequential consistency (like _Py_atomic_store_ptr) are used.