What is the Long-Term Vision for a Parallel Python Programming Model?

These are very thought provoking questions! I will try to ponder them before adding any more of my opinion, but one thing occur to me based on your examples about parallel access to collections.

Would it be interesting to others to have something like concurrent.builtins or collections.concurrent, that provide basic thread safe versions of built in collections? Somewhat like the Java or .NET model.

In Python perhaps losing the syntax for displays / comprehension if you do from concurrent.builtins import list might be much too limiting. But it’s a route that allows leaving the current collection types as they are by providing synchronized drop-in replacements (sort of).

I don’t know what those are – in this context, what would a “threadsafe” list or dict implementation offer that the builtin list and dict implementations as modified for the no-GIL work don’t? Take two threads that each concurrently appends an item to a list: this is going to result in one of two outcomes – the list grows by [a, b] or it grows by [b, a]. I don’t see what a separate “threadsafe” implementation could improve over that.

I burnt a few motherboards bringing all cores to 100% while processing large input batches. The multiprocessing implementations did it, despite the pickling (which can be limited by design).

I couldn’t get the threading alternatives to 100% CPU use, even using equivalent logic (probably a mistake of mine).

Note that I’m not talking about throughput (outputs/time), but about hardware utilization. I could not burn a motherboard with Python threading.

EDIT:
This was over eight years ago. I should run new tests now.

Actually, you should probably wait until the no-GIL build-time option is open for business. Until then, the GIL will indeed prevent burning motherboards. :slight_smile:

1 Like

:slight_smile:

I’m getting 100% core utilization (and an overheat hang) with ThreadPoolExecutor. I’ll take a look at the throughput when the batch finishes.

I have no idea what you’re doing, but presumably you’re using a library that releases the GIL and possibly creates extra threads. There’s a danger in that (other than frying motherboards): say you know you have 8 cores and you create a thread pool of size 8, and the tasks you’re running also know you have 8 cores and each fork 8 threads, now you have 64 threads competing for 8 cores.

According to my colleague @mdboom (who might provide a link to the talk about real world concurrency that he once showed me) this is a real thing in the numeric world. :frowning:

1 Like

I used to be a Java dev in a past life (almost a decade now), let me see what I can remember.

I think in Java when using a java.util.ArrayList from two separate threads there’s a third outcome: elements get swallowed. The operation of adding an element isn’t wholly atomic, hence all the articles on the Internet screaming about ArrayList not being thread-safe. The docstrings also dedicate a bunch of space to point this out (https://github.com/openjdk/jdk/blob/e47cf611c9490225e50a548787cbba66ab147058/src/java.base/share/classes/java/util/ArrayList.java#L61).

Are you saying a Python list in a free-threaded interpreter has all atomic operations? That’s pretty cool then.

I think that’s more of a hypothetical statement about how free-threaded Python should operate, not a guarantee about how it operates now, given that free-threaded Python hasn’t yet landed.

I believe the answer to Tin’s question is yes, and while it hasn’t landed yet, there’s a working prototype in Sam Gross’s git repo, and merging of PRs from there into main (to become 3.13) has definitely begun.

Now, an important caveat is that += and friends in Python are not atomic, so two threads doing a[i] += 1 concurrently may definitely lose values. And honestly I’m not sure about the semantics of operations that affect some or all existing operations such as sorting or slice assignment/deletion. If two threads concurrently do lst.pop(-1) the list will definitely be exactly two items shorter. I’m not sure (and too lazy to look in the code) about things like del lst[-1:] or del lst[0:1].

2 Likes

I’ll reference the .NET designs for example, though my C# has gotten quite rusty :slightly_smiling_face:

There are three namespaces containing collections with different implementation properties: System.Collections.Generic, System.Collections.Concurrent, and System.Collections.Immutable:

The Generic versions are optimized for single threaded use and are considered “not thread safe.” To mutate from multiple threads you have to provide your own synchronization as the classes themselves don’t provide any. The Concurrent versions are implemented using atomic operations and a handful of other synchronization primitives, and are “thread safe” without any external synchronization. The Immutable versions use copy-on-write, so they’re “thread safe” but much less efficient.

The reason to have both (as far as I understand it) is that individual mutating operations on Concurrent collections are less efficient than the same operations on the non-thread-safe equivalent. If you use a Concurrent collection without multiple threads you lose performance for no benefit.

I don’t know to what degree this applies to Python. (I see in a newer post you’ve referenced existing work for 3.13 that sounds quite interesting.) My original assumption was that any changes to the built in collections to provide thread safety guarantees would add overhead that would hurt performance for single threaded use. These other platforms seem to have “solved” that by providing multiple implementations and leaving it up to the developer to (hopefully) pick the right one. And to be clear I’m not advocating for this approach, it just seemed worth mentioning as a kind of “prior art”. (It’s certainly less sophisticated than the kind of research @smarr is interested in!)

1 Like

Your questions are not clear to me. When you say: thread1 and thread2 need to synchronize somehow before X do you mean

  • that the user must write synchronization code (to achieve what?), or
  • that Python would do synchronization (presumably defeating the benefit of free-threading)?
2 Likes
  • that the user must write synchronization code (to achieve what?), or
  • that Python would do synchronization (presumably defeating the benefit of free-threading)?

Right, good question. So, I posed these questions trying to use relatively “primitive” constructs only, and my intention was that when people look at them, they would have an intuition of “this should probably just work”. Though, I am sure opinions may differ, which is something I tried to hint at by having two styles of questions “what would you ideally want” vs. “what should a reasonable implementation do”.

In the end, I would really like to know what you/the community thinks is a useful/desirable vision. There’s a bit the tension between what one would want and what is practically achievable, however, as a researcher, I am of course interested in shifting the boundary of what is practically achievable.

So, my answer is really the question of what you think would be desirable.

Interesting. I suppose lists have internal locks or some other synchronization mechanism? I wonder how much faster a list without locking would be. That said, I 100% think that the safe version should be and stay the default (as opposed to Java and C# where it appears the non-thread-safe versions are “default”).

1 Like

There are some details in Sam’s PEP 703 in the section on Container Thread-Safety.

1 Like

I made a mistake yesterday. Sorry!

I am not able to saturate the available cores using Python threading. Rather I get less than 50% utilization.

When you say “need to synchronize” or “need for synchronization,” should this be interpreted to mean “the Python code must explicitly synchronize”? In other words, if the interpreter is implicitly synchronizing access under the hood to prevent some categories of problems, do you consider that a “need for synchronization” or “no need for synchronization”? I assume the latter, since you are talking about “goals for a practical implementation,” and implicit synchronization is part of the implementation?

If that’s correct, what is the implied failure mode if the Python code “needs to synchronize” but fails to do so? IMO whether “need to synchronize” is acceptable in any of these cases may depend quite a bit on whether the result of failing to do so would be “corrupted internals / interpreter crash” vs “data race” (e.g. missed or duplicate updates.)

2 Likes

I tried to leave that open on purpose. The only thing I assume is that internals should never be corrupted, and the interpreter should generally not crash uncontrolled no matter what code it is executing. So, no matter what, there should be a minimum level of safety. At least that’s my naive understanding of the current promise of Python.

Whether one wants to ensure stronger properties, such as no updates being lost, no out-of-thin air values (which I would assume to be a necessary requirement for a safe language), well, that’s really the question. If you tell me that to make any of these things correct and reliable, the pythonic way is to require a lock or similar, that would certainly be a valid answer to my question.

Oh, I missed that you’d already asked Stefan this same question before I did.

I don’t think “defeating the benefit of free-threading” follows here. It’s a question of the granularity of the synchronization. You can have fine-grained synchronization of accesses to data structures (and this is what PEP 703 in fact does) to provide thread safety, and still gain quite a lot of benefit from free-threading, compared to the much less granular “single global interpreter lock” we have currently.

There is also some cost to this synchronization; this is discussed extensively in the text of PEP 703. And PEP 703 goes to some lengths to avoid the need for this synchronization (at least usually) in a few common high-traffic cases.

1 Like

I agree. (I’m sure someone will be happy to point out that Python already doesn’t entirely meet that promise, but practically it mostly does unless you try pretty hard.)

This still leaves me confused about what sort of reply you expect to your long post with example cases. The post is structured to imply that one could just select some multiple-choice options (e.g. “7.1”) and that would be meaningful. But it seems you’ve left so much “open on purpose” that I don’t see how selecting from the given options is meaningful, without writing a short essay in addition.

1 Like

Worth noting two other implementations of the same ideas, which I believe rayon was inspired by–

  1. The original implementation, which was clojure transducers.
  2. The JuliaFolds ecosystem, which implements the same features in a high-level language (with no performance penalty, thanks to JIT).

JAX and Polars use a similar framework, albeit a bit different.