What is the Long-Term Vision for a Parallel Python Programming Model?

First let me just say that I don’t write a lot of concurrent code, it’s not what I do and I’m not very good at it. Because of this, I have very little faith in my own ability to understand what’s going on and I need a very clear specification to have any chance of understanding what I’m doing. For example, lets say I’ve come up with this code:


stuff = {}

def e1_thread1(work_q: SimpleQueue):
    sleep(0.5)
    stuff[5] = "cool"
    work_q.put(5)


def e1_thread2(work_q: SimpleQueue):
    while (item := work_q.get()) != None:
        print("got some new stuff!", stuff[item])

Can I be certain that sufficient synchronization has happened so that the new item is present and consistent in the stuff dict? I would guess so, but theoretically the queue could have been written to only synchronize the items in the queue and nothing else. In reality it’s difficult to write lock-free queues and I would assume that it’s very unlikely that a lock was not taken and released while adding to the queue. That should be sufficient to make the update to the stuff dict visible as well. So while I can guess that this is intended to work I can’t actually point to a specification that says it’s guaranteed, I can only test it and hope for the best.

While the above is probably fairly clear cut, we can make incremental changes to the code and try to figure out when exactly I would run into dragons.


stuff = {}


def e2_thread1(event: Event):
    sleep(0.5)
    stuff[5] = "cool"
    event.set()


def e2_thread2(event: Event):
    event.wait()
    print("got some new stuff!", stuff[5])

Lets swap the queue for an Event. Here it’s not as obvious to me that some underlying lock is being taken. Maybe this is lock free? Maybe it’s some OS thingamajig? I really don’t have a clue. So can I still rely on the dict update to always be visible? I guess I can test the code and hope for the best…

flag: bool = False


def e3_thread1():
    global flag
    sleep(0.5)
    stuff[5] = "cool"
    flag = True


def e3_thread2():
    while not flag:
        pass
    print("got some new stuff!", stuff[5])

Lets go further and just have a boolean flag, what more could we need, right? Is this still correct and can I rely on the dict update to always be visible? I guess I can test the code and hope for the best…

Today, on my machine, using Linux with Python 3.10.12, temperature of 20C and a relative humidity of 32%, all of the above “works”. Will any of them work tomorrow? Who can say? Since I don’t have any clear specification available, I can only conclude that all of these examples are equally good (or bad).

Just for fun we can go further and do:

text = ""


def e5_thread1():
    global text
    for i in range(100000):
        text += "a"


def e5_thread2():
    global text
    for i in range(100000):
        text += "b"

and when I test this it does indeed produce a string of length 200000. Amazing! I guess I can conclude I never need any form of synchronization in python, everything just works! That must be why none of this seems to be documented anywhere…

So, to summarize, in other languages (I have mainly done c++) concurrency is still very hard, but at least I have some specification that tells me what I need to do and what I can expect. If I keep things simple enough then I can just about figure out how to write a working program and convince my self that I know why it works.

In python I need to rely purely on intuition and testing, and I am not smart enough to produce working threaded code in this way and there is no way that I can convince myself that I know why something works and that it will keep working tomorrow.

(Sorry for the long post…)

1 Like

This is getting off topic. If you have questions about how to write concurrent safe code, then ask in Python Help. I’m actually not sure what the “topic” of this “idea” is in general, so unless someone wants to clear that up and keep things on track, it’s probably time to close this discussion. The general answer seems to be “we don’t even have the stuff merged yet, let’s figure that out first” and “it’s fine for libraries to experiment and provide frameworks, as we discover what is or is not needed in Python itself”.

3 Likes

I was thinking mostly the same.

The problems discussed with the GIL are because it hampers parallelism with the current threading model.

If we forget about the threading module a bit (about how things have been done so far), we could come up with a parallelization model that keeps the VM state consistent and also allows to saturate available cores with parallel computations.

1 Like

The really key part missing from that idea is the enormous amount of expertise and work necessary to make it feasible.

Apologies for the off topic post, I have posted a question in python help if anyone is interested in discussing this further (How to know what is safe in threaded code)

Please don’t close. The question in the title is still being discussed here, and still relevant.

1 Like

Right, this “bubbling up” is unclear to me too.
To me it seems that the idea behind PEP-703 is to only avoid the fences/atomics when it can provide sequential consistency in another way. But if that’s the case, it would be nice to say that explicitly. @colesbury?

For what it’s worth, I consider the examples you posted right on the money.
These questions cannot be answered without a clear understanding of what the Python memory model should be in a free threaded world.

The reason I opened this discussion was to understand whether people had previously thought about this, and what their vision is for bringing the eco system from the Python+GIL to a free threaded Python.

There are other issues raised here, like parallel loops etc, which I consider somewhat less important, because they can indeed be solved by the ecosystem. Though, the memory model, the underlying considerations of what constitutes a correct program is something that needs to be solved at the language level. And yes, this will take a lot of expertise and time to get right.

Given that many people in the Python community will not have a good understanding of the many issues which can arise in a free threaded environment, I think it’s best to start from two ends:

1. Teach the community about common pit falls when dealing with multiple threads

Some examples:

  • need for locks around public variables and adding getters/setters to handle concurrent access
  • need for code to be reentrant
  • being aware of deadlocks and lock contention
  • dealing with new types of exceptions related to not being able to acquire a lock
  • clearly documenting threading capabilities of the code, e.g. which objects can be shared between threads

2. Gathering feedback about what kind of tooling and syntax would improve working in a free threaded environment

Some examples:

  • adding support and syntax for managing locks at the statement level, e.g. for critical sections
  • adding data types which support automatic locking, e.g. a thread shared dictionary and list
  • adding support for marking functions/methods/objects as only usable by a single thread (private to that thread)
  • adding syntax for parallel loops, hiding away the threading logic (perhaps using configurable executors and for x in data using executor: ...)

Whatever is added to Python should then also become available as part of the C API, since extension writers will have the same issues to handle.

Since 2 can only happen once there’s a way to run free threaded Python, perhaps it’s good to focus on 1 at this point and collect this information somewhere (e.g. the Python wiki or a separate repo on Github).

2 Likes

I’d also add here something along the lines of

3. Reassure the community that parallel programming is no harder than it was previously

All of this discussion, and the original debates over removing the GIL, demonstrate that a significant proportion of the wider Python community really doesn’t think very hard about multithreading. And things genuinely do (mostly) “just work”. Providing more information on threading pitfalls, discussing memory models and code reordering guarantees, and telling people that the things they fear the GIL removal might expose them to are actually existing problems that they could hit right now, is going to discourage a lot of people from parallel programming.

To give a concrete example, the following is a (somewhat stripped down) version of a program I’ve written. It follows a very common pattern that I use - split up my task into batches and run them in a concurrent.futures thread pool.

import httpx
from concurrent.futures import ThreadPoolExecutor
from io import BytesIO
from zipfile import ZipFile
import time

def get(url):
    tries = 0
    while True:
        try:
            r = httpx.get(url)
            return r.content
        except httpx.TimeoutException:
            tries += 1
            if tries > 10:
                print(f"Failed to get {url}: skipping")
                return None
            time.sleep(0.5)


def get_metadata(url, size, upd):
    filename = url.rpartition('/')[-1]
    content = get(url)
    if content is None:
        return
    with ZipFile(BytesIO(content)) as z:
        for n in z.namelist():
            if n.endswith(".dist-info/METADATA"):
                (LOC / filename).write_bytes(z.read(n))
                break
        else:
            (LOC / filename).touch()

def main():
    BATCH = 10000
    print(f"Downloading {len(ALL_TODO)} files")
    with ThreadPoolExecutor(max_workers=30) as executor:
            for start in range(0, len(ALL_TODO), BATCH):
                TODO = ALL_TODO[start:start+BATCH]
                for _ in executor.map(get_metadata, TODO):
                    pass

if __name__ == "__main__":
    main()

And to be absolutely clear - this program works. It does the job I want, and multithreading speeds it up significantly. If there are any threading issues or race conditions, I’ve never hit them, and even if I did, this is an adhoc script and I can just rerun it.

Should I have thought about threading issues? Absolutely not! I didn’t even think much about the risks of a HTTP call failing - the retry feature was added because the first version kept failing with HTTP timeouts and retrying fixed the issue quickly and simply. Why should I think more about threading than about the core functionality of the code, which is fetching a URL?

I’m very much in favour of adding more precision over what operations in Python are thread safe. And I’m absolutely on board with the idea of developing new, safer and easier to use parallel processing models. But for casual use, Python is already simple and convenient for parallel processing, and it’s important to keep that in mind as well.

6 Likes

There is no way to teach about pit falls if there is no specification. The canonical example:

    def __init__(self):
        self.keep_looping = True
  
    def thread1(self):
        while self.keep_looping:
            pass
        print("Done")
  
    def thread2(self):
        time.sleep(10)
        self.keep_looping = False

may be a pit fall or not depending on the specification of what the Python language guarantees or doesn’t guarantee w.r.t. visibility of writes made on one thread to other threads. If there is no specification, the behavior may depend on current HW architecture, whether the function was JIT compiled, was running in tier 0 interpreter or tier 1 interpreter, etc.

Based on that Python may need to expose some low level facilities like volatile variables and/or explicit barriers and then users will have to be taught about them. However, in the spirit of @pf_moore’s post: I think that practice from other languages shows that “ordinary” users prefer to rely on higher level constructs and libraries/frameworks that are simple to use and reason about. These low level constructs will be necessary for the authors of the higher level libraries or frameworks, however. They will need them and/or will need to know what guarantees Python, the language, as opposed to CPython X.Y.Z on x64, gives them.

3 Likes

I’m sort of in the opposite boat. I avoid threading as much as possible because I worry too much about details that probably don’t matter but I can’t ignore because there is no documentation telling me what’s safe. If I just write some one-off script for myself where I can easily verify the result then that doesn’t really matter but when I write any sort of serious code that others will use then I really don’t want them to have to debug strange threading issues, or worse, get incorrect results that they don’t notice.

I would agree with regards to multiprocessing, that is usually convenient and easy to understand (provided one avoids the whole fork with threads debacle and sets a working spawn method) so for all problems that map to that then the user experience is quite good. For problems that needs to be handled by threads in the same processes I’m much less convinced…

1 Like

1 Done is printed after 10 seconds.
2, 3, 4, 5 thread1 and thread2 run in parallel without any need for synchronization.
6 You should handle IndexError exceptions because of popping the list items in thread4 .
7 You should handle KeyError exceptions because of deleting the dict keys in thread4 .

That’s the intended behavior as described in the code. If there’s a difference in behavior, it indicates a flaw in the implementation.

I believe the best practice is to avoid making explicit synchronization promises and instead guide the user to handle exceptions where necessary.

I think that there are 3 memory models that need to be defined.

What is the memory model that is seen by the python programmer?
@smarr has been asking about this model.

What is the memory model that is seen by the C-API programmer?
But we also need to make C-API users aware of the rules for using the C-API in a free-threaded environment.

What is the memory model that the C-python implementation implements?
Is this the same as the C-API memory model?

That’s all reasonable and good to have.

But I don’t think that endlessly discussing memory models and race conditions will lead to a no-GIL CPython any time soon.

I think that if the change was just an option to disable the GIL (or make it a no-op), then many developers and researchers could start experimenting on an effective and safe parallelism model for Python.

In other words, we don’t need to define everything to know how a non-GIL CPython behaves. It’s possible to disable de Python-program-facing GIL and just “see what happens”, through experimentation.

I have no idea of what such a change entails, but just a python --no-gil option would provide for everyone’s experiments and findings.

Again, we don’t need to understand all the possible concurrency/parallelism models, escenarios, and outcomes before being able to experiment with a no-GIL CPython.

Corrupting the Python VM with a few iPython nstructions would be an interesting and important outcome in the way to a model of what’s needed.

p.s. Linux resolved important amounts of contention in the core using spinlocks a long time ago.

I’m confused. Are you unaware of PEP 703, linked in the first post? This isn’t a theoretical discussion about what a disabled GIL might look like. There’s a prototype, a provisionally accepted PEP, and PRs have been merged to make progress on implementing it.

1 Like

Today spinlocks in linux user space do not work. A spinning thread can stop running when the scheduler decides it is time to switch to another thread. That then means than the spinlock can be held for a very long time.

There is work on going to allow spinlocks to work in linux user space, but it’s not available today.

I’m not confused. I thoroughly read the original post and the accompanying blog post.

Of maybe I am confused over all the discussion that followed…

There are several parallelization models that can be tried in Python programmer space once the GIL is gone.

I mentioned spinlocks as an example of some success in the history of an operating system.

There’s an interesting proposal for locks with local and shared state in the original post.

But not in user space where python runs.

1 Like