How to know what is safe in threaded code

Hi,

I’m trying to figure out where in the documentation I should look to know what guarantees I can rely on when writing threaded code.

As examples of questions I want to be able to find answers for:

Example1:

stuff = {}


def thread1(event: Event):
    sleep(0.5)
    stuff[5] = "cool"
    event.set()


def thread2(event: Event):
    event.wait()
    print("got some new stuff!", stuff[5])

Can I rely on that event.set() implies a memory barrier so that stuff[5] is always visible and consistent when read from thread2?

Example 2:

class Thing:

    def __init__(self):
        self._thing_lock = Lock()

    def maybe_do_scary_things(self):
        with self._thing_lock:
            self._scary()

    def _scary(self):
        # do things with self and attributes/member of self not accessed elsewhere
        pass

If maybe_do_scary_things is called from multiple threads, under what assumptions is it safe to access self in maybe_do_scary_things to get the _thing_lock attribute and lock it? Can code in _scary make access to self a race condition here? For example, what if other attributes are added or removed from self? Is this a bad pattern in general and should I always add one layer of indirection here and keep the lock on a non-shared object to be sure I don’t have a race?

Example 3:

a = False
b = 1

def thread1():
    sleep(1)
    b = 2
    a = True

def thread2():
    while not a:
        pass
    print(b)

I think this is obviously a race and I should not expect thread2 to do anything meaningful, certainly I should not expect it to always print 2. The problem I have is that I have not been able to find something in the documentation that clearly states that this is indeed a race and what can happen if this code were to be executed.

I’m mostly interested in pointers to where I can find answers to questions like these in the future. I’ve read the threading module documentation but it unfortunate says very little about how threads actually interact and what concurrency guarantees different python objects and synchronization primitives offer.

AFAIK, in current CPython, all of the above is correct, works as you’d expect, and has no race conditions.
At most one thread is active at once, and switching the active thread invokes a memory barrier (through e.g. pthread_mutex or Windows CriticalSection). The GIL makes things safe.

(If the GIL is removed, the documentation you ask for will need to be written. That would be a development question about current state of an experimental feature, not about using Python; best discussed in the parent topic.)

2 Likes

I don’t really follow why the exact implementation choice removes the need for documenting the current behaviour? I still need to know what is safe and what is not and that seems difficult to do without documentation as highlighted by the “AFAIK” in your answer.

Re. the detailed examples, if example3 is considered valid code then that would imply that all assignments are atomic. Is that really true? Others have said that depends on exact bytecode sequences, presence of exceptions and maybe other things I have forgotten.

If assignments really are always atomic then where does it break down exactly? What operations are guaranteed to be atomic?

If you assume that the C level code of python works correctly then what is needed is to know what operations are atomic at the python level.

I wonder if the answer is there are no atomic operations in python.

Objects shared between threads will always need a lock protecting read-modify-write and read-compare. Only objects that are never changed while the threads are running can be read without a lock.

Some objects may implement the locking internally and free the python programer from needing a lock, Queue() being an example and its documented as being thread safe already.

Its false I would expect.
Its clearly false if you mean assignment statements.
For example will not treat the call to the loading of c to call func and tuple assign to a and b atomic.

a, b = func(c)

Example 3 works because it reads a often to notice a become false.

However if thread1 was this:

def thread1():
    global a # missing in original version
    sleep(1)
    b = 2
    a = True
    a = False

It is not possible to know if thread2 will read a at a time it is True.
So long as the 2 assignments to a are not optimised away then thread2 must have had both values visible to it.

Common optimization compilers do is to move reads out of loops. It is important classical compiler optimization technique. Can an optimizing JIT compiler for Python move the read of a out of the loop, turning the code into:

    value = not a
    while value:
        pass

GCC, Clang, Java JITs, .NET JIT, all would do this optimization. Will ever Python want to be able to do this optimization? Note that this is illustrative example, take this code:

for i in range(len(my_list)):
    if my_list[i] == GLOBAL_CONSTANT:
        # ...

can an optimizing JIT compiler move the read of global variable GLOBAL_CONSTANT out of the loop? I think it is clear how that can be beneficial.

You are considering optimization here. Why not in the other cases? Standard optimizing compilers would be not only free to remove them, but also to reorder them (and so is the CPU). Classical example is:

a = 0
b = 0

def thread1():
    global a
    global b
    a = 1
    b = 2

def thread2():
    print(f"{a=}, {b=}")

can this print “a=0, b=2”? On Java, .NET, C/C++ it can. Those runtimes chose to specify this as permitted behavior, so that they can emit fast and cheap CPU instructions that store into memory. If “a=0, b=2” was a disallowed result, then they would have to emit way more costly CPU instructions that would ensure the order of the writes (and the reads too)

Probably, at least in theory. But the reference implementation of Python is nothing like an optimizing JIT compiler. Maybe PyPy can do it. But usually alternative implementations of Python try fairly hard to get the same behaviour as the reference implementation, as much as possible.

And so they are stuck with whatever happened to be the reference implementation behavior at the point when people started adopting some feature en masse. That behavior is arbitrary. I argue that it should be thought through and carefully designed with the future in mind and independently from whatever CPython happens to be doing at this point in time. Some time from now, the new CPython JIT will be in the very same position as alternative Pythons with JIT compilation are now. JIT compilation is nothing new; there is no magic that will allow CPython JIT to get away with this when numerous JIT compilers for various programming languages (including Python!) could not in the last decades. Having to ensure order of all (possibly) shared memory reads and writes (global variables, attribute accesses, dict lookups, …) would prevent many important compiler optimizations and would require expensive CPU instructions.

While at first sight, this looks like an issue that the no GIL folks should be concerned about, I believe that it should be of concern to the JIT folks as well /cc @markshannon

There are real issues with the C API for extensions exposing implementation details that it didn’t have to. I assume it was just the most straightforward approach to say that the specification is what the reference implementation happens to be doing. Now no GIL needs ABI4 because of that. That’s the same story. Does it make sense to repeat it again?

The GIL does not have to be removed for this to be an issue. Take current GIL threading + optimizing JIT compiler that can move reads out of loop and this example:

a = False
def thread1():
    while not a:
        pass

def thread2():
    global a
    a = True

it will already behave differently to what many people would intuitively expect and most crucially to how the current reference implementation happens to behave, which is all people have in the absence of documentation/specification.

Note that I am not advocating that Python users should study complex memory models and start writing some complex lock free algorithms in Python. It should be made clear that shared state is dangerous and should be avoided. Regular developers should use locks and other higher level abstractions. Few high performance/concurrency libraries authors should get some basic primitives to force some guarantees for reads/writes – there are such things in other languages, so it doesn’t need to be reinvented.

You keep assuming that “moving reads out of the loop” is a perfectly acceptable optimization. I don’t see why. Lookups of globals are NOT able to be optimized in that way. It would break a huge amount of code if you start making assumptions and then “optimizing” on that basis.

There is already a well-known idiom for triggering such an optimization, and that’s stashing the global in a local.

a = False
def thread1(a=a):
    while not a:
        pass

If the programmer did not say this, then Python should repeatedly check the global. I don’t see why you think that this is a mere optimization, when it’s such a massive semantic change.

2 Likes

What about this:

for i in range(len(my_list)):
    if my_list[i] == local_bar.foo:
    # rest of the loop does not touch local_bar.foo and doesn't call any user functions

can the attribute read be moved out of the loop? (Given that my_list is builtin list)

It is in many other systems (C#, Java, C++, …), so I think it is reasonable option to consider. In Python we don’t know, because there’s no specification.

Take this example:

def helper(l, i, x):
    return l[i] == x.foo

# ...
for i in range(len(my_list)):
    if helper(my_list, i, x):
    # rest of the loop does not touch local_bar.foo

should people be manually inlining the helper function, so that they can move the read out of the loop? That is typically the job of on optimizing compiler. Compilers end up doing many obvious optimizations that “one could just do manually”, because those optimization opportunities arose only once functions are inlined.

Can you point to any documentation that says so?

No, it can’t.

class Bar:
    _val = 0
    @property
    def foo(self):
        self._val += 1
        return self._val

def search():
    my_list = [42, 7, 3, 123, 4]
    local_bar = Bar()
    for i in range(len(my_list)):
        if my_list[i] == local_bar.foo:
            print("Found one!", my_list[i])

While this isn’t technically part of the discussion, it is notable that you’re using very non-Pythonic style here. Perhaps you’re expecting Python to behave like C does, with all its foibles, instead of treating Python as Python? for item in my_list would be a much better way to do this than iterating over range of length.

In any case, though: no, you absolutely can NOT move the attribute read above the loop. If the programmer wants to do that, it can be done with, you guessed it, an explicit pre-loop lookup.

Is it really? Do none of those languages support dynamic attribute lookups? Because that’s the only way to be sure. So either that statement is wrong, or those languages simply lack a feature that Python does… and, guess what, is very much part of the specification.

Lookups into globals are implemented as subscripting of the namespace dictionary. And you can use any object you like as your namespace (though a plain dict is, of course, the most common). Go read the docs on how namespaces work. It is entirely legal to put whatever code you like into that lookup.

There is not a document saying “you may not do X, Y, or Z”, because the things you’re asking are consequences of other facts. You will not understand this unless you first take the time to actually understand the way Python works.

Stop assuming that Python is just some other language with different syntax. It isn’t, and you’re coming at things completely the wrong way. Your starting assumptions are the problem here, not the documentation.

The difference is that those language are statically typed (mostly). You can very reasonable collect a lot of knowledge of what variables can be and what can change those variables. This just is not a valid assumption for python. Maybe you should look into something like mypyc, numba or cython, those redefine some semantics of python to make assumptions like this valid.

I wouldn’t be surprised if the specializing interpreter one day would want specialize the lookup to something like a class check, perhaps a dictionary version check and then a read from the dictionary. Then the template JIT would compile these bytecodes to simple CPU instructions for doing just that and then the CPU could move them around, unless the compiler emits expensive instructions that would prevent the CPU from doing that. Optimizing JIT, which is afaik planned, may also want to move them around.

The idea is that abstractions like attribute lookups are removed during the specialization and then compilation with inlining. If you overload array subscription operator in C++, then the C++ compiler won’t compile it like a read from builtin array. In C++ this is known statically. With dynamic languages this is done through runtime specialization.

But you can specialize to some known implementations as explained above.

In orther languages I have a way to tell the compiler that its not safe to optimise in this way. For example marking a variable as volatile or atomic. Python does not have this so the optimisation will be invalid.

Doesn’t it already do that? Or does it only do that for globals? Either way, this is only a valid assumption if the corresponding objects don’t overwrite __getattr__ or have a special __dict__ mapping. You can’t make this assumption before actually executing the code, only for the specializing interpreter. A JIT compiler compliant with python also has to respect this.

yes, but you can move (if it was permitted) that assumption check and the read before the loop

Assuming there is no other thread that magically modifies the object. You would have to somehow guarantee that there are no other referrers to the object, and the underlying class object. So I don’t see how that could be permitable without major changes to python’s semantics.

At that point, it’s actually solved the threading problem too. The benefits become far less (you can check the dict version and then reuse what you have, rather than looking up every time - but you still have to check every time), but if it’s worth the complexity, it can be done - as long as semantics don’t change.

This is true. But the nearest equivalent in C++ would be to have all Python object references done with a PyObject *thing and using thing->whatever for everything… and we’re right back where we started, since you can’t know what it’s capable of. (Every method is virtual, in C++ terms.)

Optimizing compiler could do escape analysis to find out that the object does not escape to other code or thread. Classes can have something like version. These are common approaches that, for example, high performance JavaScript VMs do.