What is the Long-Term Vision for a Parallel Python Programming Model?

With PEP 703 and various ongoing projects to bring just-in-time compilation to CPython, I would like to ask what the long-term vision for a parallel Python programming model is. For context, I am asking this is an academic programming language implementation researcher, who has been working on enabling efficient implementations for dynamic languages that support shared-memory parallelism and concurrency.

Having very recently looked at the precise details around the global interpreter lock, I would be very curious to hear what people think a no-GIL programming model should look like. To my understanding PEP 703, takes a very implementation-oriented perspective, and gives only few details on the supported programming model. Throwing things like just-in-time compilation into the mix, I could imagine a number of different directions the Python community might want to take.

One direction might be similar to the likes of Java and C++, where the language specifications define memory models that are very focused on low-level memory accesses, the semantics of memory operations and their observability. The main benefit of this approach seems me that it in some way enables aggressive optimizations that might be desirable especially if there should at some point be the desire for a highly optimizing just-in-time compiler.

Another direction could be more focused on high-level abstractions and guarantees one might want to have. For instance, PEP 703 ensures a certain level of container thread-safety, e.g., by making sure that the length of a list accurately matches the number of elements in the list. Here, I would imagine that the goal would be to maintain some level of guarantees to easy the migration of existing Python code to a no-GIL world. It might also be a way to be more precise about what high-level “pythonic”/idiomatic code in a no-GIL world is supposed to be. As one might imagine, giving high-level guarantees will come at a potential cost, either in terms of performance or in terms of the implementation complexity needed to ensure these guarantees.

With the complexity of the Python language itself, around objects, build-in collection types, and so on,
I would be curious to hear what the community thinks which “guarantees” and which “idioms” might be desirable for a parallel Python programming model.

6 Likes

All the race conditions that a free-threaded python (not called no-GIL) has are also present in GIL versions of python. But are harder to trigger.
As I understand it the free thread python is design not to corrupt its state when more then 1 thread is active.

2 Likes

Given our high level language nature and Zen of Python ethos, we already lack a spec for things like our memory model let alone the language itself. The guiding mantra is more of an implementation one: Pure Python code should not cause the VM to crash. Very much “don’t corrupt the Python VM state” as Stefan said. I do expect whatever we wind up with to be based on what makes the transition easy.

So much of this that you don’t see laid out in PEP-703 already is left up to the implementors based on real world results of testing for what is practical to achieve.

The reality we’ve already been living with as we evolve the project for a very long time is that existing code must be assumed to depend on all existing behaviors, documented or not. (Hyrum’s Law)

All the race conditions that a free-threaded python (not called no-GIL) has are also present in GIL versions of python. But are harder to trigger.

Well, that’s part of the problem. Some of them are impossible to trigger. The implementation details give “guarantees” that a “specification” might not, but because it’s observable behavior people may intentionally or unintentionally rely on them.

And my question is basically, which of the race conditions that free-threaded Python has are desirable and which are not.

Do you have a specific example or is this speculation?
(I don’t doubt that such use case may exist)

Do you have a specific example or is this speculation?
(I don’t doubt that such use case may exist)

Since Python 3.10, only function calls and backwards jumps may lead to a GIL release, which means, all other bytecode sequences are guaranteed to be atomic.

Please see this for a detailed account:

This is a misleading phrasing. There are no guarantees here. Just observations of how the runtime appears to work. A guarantee would be something that the language specification (or even the CPython implementation) promises to do in future versions. There is no such guarantee here, explicit or implied.

2 Likes

I missed that change.
Yes that makes a lots of races harder/impossible to hit.
Does that imply async code and free threading will need special care in ceval.c I wonder?

No, ceval.c is fine – or at least, don’t worry about it, it’s under control.

It’s user code that may have accidentally started relying on these “guarantees” (which are not guarantees at all) which needs special caRe – latent race conditions will become actual, common race conditions if you enable free-threading for your code.

Agreed, this change has side stepped the need for locks for some code that was not the case pre 3.10.

Not quite. The need is the same (if you care about forward or backward compatibility, which you should). You can’t so easily prove that need by experimenting. But there are many things that aren’t easily ascertained by experiments (Stefan’s blog post actually mentions some things that aren’t so easily born out experimentally in sections 4 and 5.)

2 Likes

For traditional, run-of-the-mill scientific computing [1], experience from OpenMP, Cython, and numba tells us that we should be able to cover a large part of use cases with just a few high-level constructs: a “parallel for” that is reasonably flexible (static/dynamic schedule, block sizes, number of threads), critical sections, and barriers. So I would hope that we could have a similarly limited/simple/easy high-level interface for free-threaded processing right in the core language.


  1. i.e. excluding ML/AI and GPGPU ↩︎

2 Likes

Whoa. How would you add a parallel for to Python? Syntactically, of course

parallel for var in seq:
    <block>

is simple to add (using a new “soft” keyword parallel).

But how would you implement it? Would it have to turn <block> into a function that is somehow executed in parallel? And how would you specify block sizes, thread count, etc.?

Or are you just proposing some new stdlib functions (possibly in the threading module), where the user is responsible for converting <block> into a function? (I wouldn’t call that “the core language” though, hence my – hopefully premature – worry about language proposals.)

3 Likes

I would like to see such implicitly parallel blocks, yeah. There is a lot of prior art in OpenMP’s #pragma directives, and closer to home in Cython’s parallel() block and prange(). Those are a good gauge for the complexity that inline-parallelisation would bring, which is considerable. On the other hand, this is a very successful model of parallelism because it makes writing parallel code almost as easy as writing serial code, with many fewer mental shifts than explicitly splitting out and joining threads.

So it would be my hope that OpenMP-style directives could be a stretch goal for the far future. In terms of style, they feel very “Pythonic” to me. There is probably a better line of sight on e.g. a native parallel map primitive, which would also be vastly useful. Any primitive that says “do this, but in parallel”, without having to explicitly think about creating, starting, and joining threads.

1 Like

The closest there would be concurrent.futures.ThreadPoolExecutor.map(), which takes a function and maps it over some data. It is currently limited by the GIL, but once we have free-threading, it won’t be, so you get instant parallelization without having to care about threads. (There’s a multiprocessing variant too, but it pickles the data.)

In the pragma science-fiction department, you could propose to take a page out of Chris Lattner’s Mojo’s book. In Mojo, you can apply certain (fixed, compile-time) “decorators” to certain statements, e.g.

    @parameter
    if size == 1:
        return x[0].to_int()
    elif size == 2:
        return x[0].to_int() + x[1].to_int()

Similarly, one day maybe we will be able to write something like

@parallel(max_workers=20)
for url in get_list_of_urls():
    data[url] = fetch(url)

It’s going to be an uphill battle to get this kind of thing designed until we’re all a lot more comfortable with free-threading though – we only get one chance to design that.

9 Likes

I think numba’s prange might make for a good candidate to compare behavior on. The internals on python would need to do things a little differently, but marking that something should be parallelized if possible, and letting the specializing interpreter watch for when there’s no interdependence between separate iterations could lead somewhere productive.

Perhaps it can have a slicker name like pmap then :slight_smile: Right now, buried three modules deep, it’s not exactly an obvious solution. [1]

Just for the record, I didn’t mean to suggest any specific syntax or semantics here, only that there are existing approaches to parallelism in the scientific space (OpenMP, Cython, numba) that cover a lot of ground with exactly that sort of built-in “decoration” of the usual single-threaded program flow.
And because the people who use this language are scientists more than software engineers, their actual workflow might be exactly that: Write the single-threaded code first, then decorate with parallel sections.


  1. And now I still need to figure out if ThreadPoolExecutor is what I want – if that wasn’t in the name, ignorance would be bliss. ↩︎

5 Likes

It might just be familiarity but I think rayon.rs provides a nice model for how this could work. Rather than adding syntax, it basically provides a parallelized version of itertools.

This does require rewriting some code to take more of a functional style, but it’s a simple way to provide drop-in parallelism for people who want to try it.

I tested a small example of this in an early pep-703 thread and it provided the expected speedup on the nogil branch. Pretty nifty.

4 Likes

What I was thinking about was the sort of help questions we will find posted by people whose code stops working on free-threading.

Not what I personally need to code robust threaded code.

Ok, perhaps I can give a few examples to better illustrate the problems I am interested in.

1. Memory Model and Classic Loop Optimizations

Let’s start with a classic one for optimizing VMs. For the following code, let’s assume the functions thread1 and thread2 are executed in parallel on two different threads.

class Example1:
    def __init__(self):
        self.keep_looping = True
  
    def thread1(self):
        while self.keep_looping:
            pass
        print("Done")
  
    def thread2(self):
        time.sleep(10)
        self.keep_looping = False

Which of the following behaviors would you ideally want to see:

  • 1.1 Done is never printed.
  • 1.2 Done is printed after 10 seconds.

2. Trivial Parallel Access to Objects

Again, let’s assume the functions thread1 and thread2 are executed in parallel:

class Example2:
    def __init__(self):
        self.x1 = 0
        # ...
        self.x16 = 0
    
    def thread1(self):
        for _ in range(100_000):
            self.x1 += 1
    
    def thread2(self):
        for _ in range(100_000):
            self.x16 += 1

What would you ideally want?

  • 2.1 thread1 and thread2 run in parallel without any need for synchronization.
  • 2.2 thread1 and thread2 need to synchronize somehow before accessing self.x1 and self.x16 respectively.

3. Trivial Parallel Access to Lists

Let’s assume two parallel threads as before:

class Example3:
    def __init__(self):
        self.list = [0] * 1_000

    def thread1(self):
        for _ in range(100_000):
            self.list[0] += 1

    def thread2(self):
        for _ in range(100_000):
            self.list[1_000] += 1

What would you ideally want?

  • 3.1 thread1 and thread2 run in parallel without any need for synchronization.
  • 3.2 thread1 and thread2 need to synchronize somehow before accessing self.list[0] and self.list[1_000] respectively.

4. Trivial Parallel Access to Dictionaries

As before:

class Example4:
    def __init__(self):
        self.dict = {'x1': 0, 'x2': 0}

    def thread1(self):
        for _ in range(100_000):
            self.dict['x1'] += 1

    def thread2(self):
        for _ in range(100_000):
            self.dict['x2'] += 1

What would you ideally want?

  • 4.1 thread1 and thread2 run in parallel without any need for synchronization.
  • 4.2 thread1 and thread2 need to synchronize somehow before accessing self.dict['x1'] and self.dict['x2'] respectively.
  • 4.3 thread1 and thread2 only need some minimal check that the dictionary is “consistent”.

5. Parallel Access to Objects While Objects Change Shape

Expanding on Example2, let’s define a class without any fields, and let’s add the fields dynamically in the parallel threads:

class Example5:
    def __init__(self):
        pass

    def thread1(self):
        setattr(self, "x1", 0)
        for _ in range(100_000):
            self.x1 += 1

    def thread2(self):
        setattr(self, "x2", 0)
        for _ in range(100_000):
            self.x2 += 1

Let’s change the question.
What do you think is a reasonable goal for practical implementation?

  • 5.1 thread1 and thread2 run their loop, without any need for synchronization, and only need some form of synchronization to add the fields.
  • 5.2 thread1 and thread2 need to synchronize every time before accessing self.x1 and self.x2 respectively.

6. Parallel Access to Lists While They Change Length

Expanding on Example3, let’s add two more threads that change the length of the list:

class Example6:
    def __init__(self):
        self.list = [0] * 1_000

    def thread1(self):
        for _ in range(100_000):
            self.list[0] += 1

    def thread2(self):
        for _ in range(100_000):
            self.list[1_000] += 1
    
    def thread3(self):
        for _ in range(500):
            self.list.append(0)

    def thread4(self):
        for _ in range(500):
            self.list.pop()

What do you think is a reasonable goal for a practical implementation?

  • 6.1 thread1 and thread2 run their loop, without any need for synchronization, and only need some form of synchronization to change the length of the list.
  • 6.2 thread1 and thread2 need to synchronize every time before list[0], list[1_000], list.append(0), and list.pop() are executed.
  • 6.3 thread1 and thread2 only need some minimal check that the list is “consistent”.

7. Parallel Access to Dictionaries Under Arbitrary Access

Adding two more threads to Example4:

class Example7:
    def __init__(self):
        self.dict = {'x1': 0, 'x2': 0}

    def thread1(self):
        for _ in range(100_000):
            self.dict['x1'] += 1

    def thread2(self):
        for _ in range(100_000):
            self.dict['x2'] += 1

    def thread3(self):
        for i in range(100_000):
            self.dict[f"x{i}"] = 0

    def thread4(self):
        for i in range(100_000):
            del self.dict[f"x{i}"]

What do you think is a reasonable goal for a practical implementation?

  • 7.1 thread1 and thread2 run their loop, without any need for synchronization. thread3 and thread4 need to synchronize somehow before accessing self.dict[f"x{i}"] and del self.dict[f"x{i}"] respectively.
  • 7.2 all threads need to synchronize before accessing the dictionary.
  • 7.3 thread1 and thread2 only need some minimal check that the dictionary is “consistent”, but thread3 and thread4 need to synchronize somehow before accessing the dictionary.

Back to My Original Question

So, with these examples in mind, let me restate my original question:

Where do you see Python going or want it to go with respect to its programming model and the guarantees it should provide in the presence of optimizing JIT compilers and free threading?

5 Likes