Pre-PEP: Safe Parallel Python

Indeed. Python and Java have very different runtime designs and use of data structures. For example, there is no list or dict “builtin” (java.lang.*) in Java: collection classes like java.util.ArrayList and HashMap are plain old Java classes. Since they are not special, providing separate, concurrent versions is not a problem. Whereas in Python the collection classes are ubiquitous to the language and are literally builtin to the interpreter itself. Module globals is a dict, object __dict__ is a dict, the same pervasiveness goes for lists and tuples. In making Python concurrent we may already have to make some of these builtins concurrent as well, therefore changing our needs. On the other hand, there are Python-specific constructs (perhaps, those related to caching?) the need to provide concurrent versions of those may arise in the future.

Though it is quite early to tell what concurrency models and constructs we will need in the future, I do agree that the concepts and experience of (shared-execution-context, i.e. multithreading, subinterpreters) concurrency in Python are relatively underdeveloped at the moment. As free-threading has now become a reality, what is needed IMO is a survey of usage of to identify Python-specific needs and see how they align with our proposed functionality. And doing so takes time and community experience — despite the Python language itself being notionally compatible with multithreading, CPython’s GIL had effectively suppressed the development and adoption of multithreading paradigms in the ecosystem over the years; other forms of concurrency such as asyncio and multiprocessing only work around the limitations. It will take some time for the community to fully embrace shared-execution-context concurrency and outline the possibilities that they enable.

2 Likes

I don’t see why the memory model would be related to concurrent data structures.
Their usefulness does not depend on the MM.
As in the example I linked above, there are real speedups a parallel Python program can gain right now by using a concurrent data structure. If Python had a memory model, that speedup will still be there.

Quite often I see memory models coming up in multithreading/PEP 703-related discussions, but I don’t see why. The absence of a memory model for Python does not make Python worse in my eyes. If anything, it makes Python much more approachable to new comers, which is a very good thing.

The CPython implementation essentially has this model: your code will run in the sequence specified by the source program, under all circumstances. (Note that for C and Java this does not hold.)

It might be overly-restrictive, sure, but I don’t see that as a big problem right now. If we had evidence of important speedups in the JIT that would go amiss by the absence of a MM than sure, we can discuss the pro’s and con’s and which model would be best. But there is no such evidence right now.

Concurrent data structures are useful at the present time, too.

We don’t have the history of C, where having a common MM across compilers and hardware architectures was tremendously useful. MM are complicated things, but having a MM in C is better then the big mess of relying on hardware- or compiler-specific guarantees.
We can choose a different path for Python, maybe one where we keep the strictly sequential model above.

The information that a C or Java compiler crucially does not have when reading a source file is which variables are shared across threads, which is the big motivation for MMs. Coming back to the PEP, it proposes having Python programs explicitly define which objects can be shared across threads, while most objects can’t by default. I think these could be some ground rules to avoid the need for a MM entirely. If this PEP was accepted, I’d see a Python MM as less likely to exist in the future. Which I believe would be a good thing for the Python community.

2 Likes

You’re right that Java (and many other concurrency-focused prograamming languages) has a memory model does not mean that Python will necessarily have one. The goal of this PEP is to eliminate unnecessary synchronizations and to help JIT implementation. If by JIT we do things like type specilization in bytecode and not machine-code generation, most likely we won’t ever need a memory model like Java does (which, granted, is confusing enough to read in its own). But if that’s the case, we have no need to directly mirror the concurrent constructs that Java offers.

Keeping things simple at first, without prejudicing the ability in the future for additional functionality, would mean we start with whatever we already have in the Python data model — which is much more dynamic than Java object fields which essentially point to somewhere on the JVM heap memory — the Java object model describes what can be done, the memory model restricts the timing. The Python data model is very different from the former and cannot (and doesn’t need to) afford the guarantees of the latter. Thus my concern with auxiliary objects in Python like __dict__: when we access an attribute stored in an instance dictionary from a thread, which objects need to be locked? This draft PEP has not yet addressed this issue. How far do concurrency constructs go if we end up locking on the __dict__on every access anyway? Alternatively, if we make the instance dictionary (and by extension, every builtins.dict and subclasses, since code like foo.__dict__ = {"bar": 42} is and should remain legal) concurrency-safe (barring the potential costs), will we still need a separate ConcurrentDict implementation?

2 Likes

That’s the thing. It would be prohibitively expensive. This is the reason why PEP 703 goes to great lengths to avoid a big performance penalty for dicts and lists, including changing the memory allocator to allow lock-free reads.

Furthermore, instance attribute hash tables have been removed by default for a few Python version AFAIK. When you call __dict__ a dictionary is created on the fly with the values stored in a specialized structure. (Ask Mark, not me, if you have questions on that.)

Yes, I think so.
IIUC under this PEP dict cannot be in the synchronized state, which I think would be a good thing since it doesn’t provide strong concurrent consistency guarantees, and it probably never will.
Thus, it would be desirable to have another dict that does live in the synchronized state (putting it in the terms of this PEP) and provides stronger guarantees, such as AtomicDict or ConcurrentDict, while leaving non-concurrent use-cases to dict.

Note that, usually, stronger concurrency guarantees go hand in hand with slower single-threaded performance.

Correct me if I’m wrong, but I think it has. That’s what the proposed object states amount to.

  • x is immutable: x.y = 42 is not allowed
  • x is local: x.y = 42 is only allowed when performed by the owner thread
  • x is protected: x.y = 42 is only allowed when performed by a thread that holds the mutex that protects x
  • x is synchronized: x.y = 42 is allowed, by any thread

Now, I had a few points I wanted to make after reading the PEP a few times.

Optimistic lock avoidance for lists and dicts

Under this proposal, lists and dicts can be immutable (freezed), local, stop-the-world mutable, and protected, but not synchronized, right? If that is so, does it mean that the optimistic reads detailed in PEP 703 would no longer be necessary? I reckon that would simplify a lot of code inside CPython.

SuperThreads

The name SuperThread doesn’t make me think about GIL-like serialization of threads. Maybe we can bikeshed on a better name? I found SerializedThreadGroup in §With the GIL enabled, that sounds like a better name to me.

About __protect__()

I tried coming up with a simple pure-Python mutex implementation based on this PEP, that would also resemble a Rust mutex. (Granted, it’s very easy to implement a mutex when you already have *.__mutex__.)
I stumbled on this problem: is a previously-protected object allowed to no longer be protected? I think the code below illustrates why it might be useful.

class Mutex:
    def __init__(self, obj):
        self.protected = self.__protect__(obj)

    class UnlockedMutex:
        def __init__(self, mx: Mutex):
            self.mx = mx
            self.__freeze__()

        def get(self):
            return self.mx.protected

        def set(self, obj):
            # previous = self.mx.protected
            self.mx.protected = self.mx.__protect__(obj)
            # self.__unprotect__(previous)  <---- is this needed? is it implicit?

    @contextlib.contextmanager
    def lock(self):
        with self.__mutex__:
            yield self.UnlockedMutex(self)

Given the above, what would be the result of this code?

my_mutex = Mutex({})

with my_mutex.lock() as unlocked_mx:
    current = unlocked_mx.get()
    updated = current | {"spam": True}
    unlocked_mx.set(updated)
    # current is now the sole reference to the empty dict used above

print(current)  # is this ok? was it implicitly un-protected? or does it raise?

Another question. This table says “no” under immutable x __protect__().
Does that mean the following code would raise an exception?

my_mutex = Mutex(0)

Yet another question. What does this do?

self.__protect__(self)

Do I take it correctly that it means in order to change the attributes of self, from now on you’ll need to hold self.__mutex__? If so, I guess the Mutex.__init__() above needs it, right?

On the benefits of this proposal

I would like to point out a few characteristics that make this simple Mutex worthwhile, because I think they really are benefits of this PEP.

my_list = []
my_mutex = Mutex(my_list)  # raises
# as desired: prevents the reference from escaping the mutex's scope.
# => can't mutate the list without calling my_mutex.lock()
my_mutex = Mutex([])

with my_mutex.lock() as unlocked_mx:
    lst = unlocked_mx.get()
    lst.append("spam")

lst.append("nope!")  # raises: this thread does not hold my_mutex.__mutex__
# again, prevents the reference from escaping the mutex.

I think it would be tremendously useful to have the guarantee of holding the sole reference to an object, especially in a multithreading context.

Furthermore, incrementally opting out of the GIL with “SuperThreads” sounds definitely less scary than switching from all-GIL to no-GIL. For anybody, beginners and experts alike. It would probably help the adoption of free-threading in the long run.

6 Likes

Revisiting the data model is quite necessary. Maybe we should document and emphasize that __dict__ and many other dunders are synthesized objects (IIRC for some objects this is currently the case) and that upon assignment to __dict__ ( foo.__dict__ = {"bar": 42} ) the interpreter is at liberty to transform the assigned value into whatever internal representation. This will relieve us of many assumed obligations, for example we can just expose a MappingProxyType or similar when an object is frozen.

A similar story can go for module global namespace, which from the language’s point of view has corresponded to be the module object’s __dict__. IIRC locals() has already been made a proxy of some sort recently, though that tends to be an easier story as locals() doesn’t correspond to an object’s __dict__except at the module level. OTOH, doing stuff with an object’s __dict__is a common operation.

I’m not suggesting that all these corresponding relationships between objects need to be severed. But in general I think we have stopped promising things to always be real dicts and lists in new APIs; many of the existing choices were made before ABCs were a thing, though even then there have been wordings like “dict-like object” that have given some freedom to how things need to be. Changes to existing promises are unlikely as they tend to be backwards incompatible, though there will be a lot more other things and assumptions in the data model that we may have to break or tweak as well if we move forward with the proposal for shared states. Mutable types having frozen instances is one and I’ve a feeling that will generate a lot of discussion within the typing community.

Would existing threaded code with the form:

my_mutex = threadling.Lock()

with my_mutex:
     my_shared_variable += 1

still work?

1 Like

Justa short note for the “bike shedding” there; That these are dunders is a minor thing.
But we could possibly keep an eye on it as the PEP discussion evolves, as adding
”protocol helper” functions where applicable to call these dunders should be a non-issue -

Those should be listed on the PEP, though
e.g.

from bobydrakelib import get_mutex, protect, freeze, share_status
1 Like

Python already has a memory model, not in the language reference where you’d expect to find it, but in the Library and Extension FAQ:

What kinds of global value mutation are thread-safe?

[…] In practice, it means that operations on shared variables of built-in data types (ints, lists, dicts, etc) that “look atomic” really are.

For example, the following operations are all atomic (L, L1, L2 are lists, D, D1, D2 are dicts, x, y are objects, i, j are ints):

L.append(x)
L1.extend(L2)
x = L[i]
x = L.pop()
L1[i:j] = L2
L.sort()
x = y
x.field = y
D[x] = y
D1.update(D2)
D.keys()

…followed by a warning about __del__ re-entrancy (but not more common sources of re-entrancy like comparisons) and the exhortation “When in doubt, use a mutex!” (but with no mention of deadlocks).

While not documented, CPython issue bpo-13521 / gh-57730 made dict.setdefault atomic because it “was intended to be atomic” and deployed code depended on it.[1]

PEP 703 (free-threading) contains statements like

Still, the GIL ensures that some operations are effectively atomic. For example, the constructor list(set) atomically copies the items of the set to a new list, and some code relies on that copy being atomic (i.e., having a snapshot of the items in the set). This PEP preserves that property.

which is also not documented, but is a similar “built-in types, looks atomic, is atomic” operation. Unfortunately PEP 703 makes handwavy statements like “per-object locking aims for similar protections as the GIL” instead of listing, even incompletely, the protections it aims to provide, though the thorough implementation details do implicitly list some of them.

I have three points to make from all this:

  • Python already makes some guarantees that deployed code relies upon, but there isn’t a single complete list of them.
  • Python either must uphold these guarantees, or ought to explicitly repudiate them (so code relying on them can be fixed).
  • Given that Python provides these guarantees, if most user code can live with shared-implies-immutable, Python may not need special concurrent data structures similar to Java’s.

For completeness, I feel I should mention PEP 583 A Concurrency Memory Model for Python from 2008. It surveyed the options as understood at the time rather than making a concrete proposal. One notable option is unconditional[2] sequential consistency, which most languages would reject out-of-hand for performance reasons.

PEP 583 contains one occurrence of the word “mutable” and zero instances of the word “immutable”, so it doesn’t have much bearing on the current proposal.


Personally, I think a memory model is essential, because without one it’s impossible to determine whether code is correct or not.[3] Python has a high abstraction level, so its memory model can be much simpler than Java’s or C++'s. Python is very dynamic, so the model can[4] be enforced at runtime rather than relying on documentation, linters and sanitizers. But that we already have one, even with the GIL, shows that we can’t really get away without one.


While I hope Python doesn’t have to get down in the details, if you’re curious about them, I enjoyed Russ Cox’s articles on hardware memory models (sequential consistency, x86, ARM, DRF-SC) and programming language memory models (Java, C++, and JavaScript).


  1. And I’ve since written code that depends on it, so please don’t break it. :slight_smile: ↩︎

  2. not the “… for data-race free programs” often found in hardware memory models ↩︎

  3. and I expect PEP 703 to fail in practice because it does not attempt to define one ↩︎

  4. I hope, and this pre-PEP seems to propose to ↩︎

6 Likes

Great research on Python literature. The implicit memory model is really “whatever CPython happened to have implemented at the time PEP 703 was written”. By promising “per-object locking aims for similar protections as the GIL” it implies a sequential memory model with atomic behavior for builtin objects. Only when free-threading gains widespread use can we see how well the promise really holds up.

I don’t think this would necessarily be the case, but it does seem that the various projects to enhance Python (free-threading, JIT, etc.) are sometimes not really tuned to the same wavelength. And here goes the classic open-source project management problem: “Organization A has successfully forked a software to implement feature A for their own use cases” and “Organization B has successfully forked a software to implement feature B for their own use cases” does not mean that we can make the mainline implement features A and B simultaneously without hassle. This is especially the case in Python as enhancements are often driven by enterprise- and domain-particular needs. The current proposal focuses on enabling JIT specialization (for which __shared__ state is useful) but doesn’t go as far as writing machine-specific assembly on-the-fly — in any case we still need to state what actually limits the future potential JIT from going so low-level. (In other words, write down why low-level JIT is a bad idea in Python.)

I really like how PEP 810 touts four attributes—Local, Explicit, Controlled, Granular—for lazy imports. I felt they’re also the guardrails that make Safe Parallel Python attractive: the move to free-threading stays local, visible, and opt-in, thus lowering the transition risk/pain.

for the record here, PEP 795 has just been rewriten and reposted with an alternative approach - I am still to read the changes there, but I’d favor the approach described here (PEP 805) better.

(Also, I just dropped here because someone just posted a stackoverflow question with minimal reproducible code which is really impaired by not having proper explicit imutable (or local mutable) data structures: https://stackoverflow.com/questions/79851420/multithreading-becomes-much-slower-than-multiprocessing-in-free-threaded-python#79851420 )

Thanks for linking that. I think the person on StackOverflow discovered some scaling issues we can fix: Multithreaded scaling bottlenecks on free-threaded Python · Issue #30494 · numpy/numpy · GitHub.

I also included some instructions in there for getting profiles, which allows better understanding of what a slowdown is caused by. In this case all speculation about refcounts is incorrect and instead it’s because of a mutex in the tracemalloc implementation and a critical section in numpy’s array creation utilities. There is probably a different way to set up both so that they scale better.

Reports like this are really useful because it’s still early days with free-threading and it’s hard to know where the scaling bottlenecks will show up.

4 Likes