Iād like to try to tackle a lot of things in a single post. Maybe this post is too long, but I donāt see how to fit everything into separate replies in a meaningful way.
First, it is becoming pretty clear to me that we should have scoped this PEP better. Thanks especially to Paul Moore for making that clear. We are in the process of updating the PEP so that it includes the ability to share immutable objects across subinterpreters by reference rather than having this as a separate PEP, which we intended for the same Python release anyway. This means that two things (which we have already implemented and will make sure ends up in the prototype) have to be added as well: support for managing immutable cyclic garbage without cycle detection and support for atomic reference counting for immutable objects. We will ping here as soon as the PEP text is updated.
Below I will try to explain the big picture for the immutable objects in PEP795, plus why we need to have strictly enforced immutability to be able to share immutable objects across subinterpreters, and also what subinterpreters and free-threaded Python stands to gain from a performance standpoint from immutability (and more).
Big picture of PEP795
Iāll start with the big picture. PEP795 builds towards a data-race free programming model for Python. What we envision beyond PEP795 is a Python that does not permit concurrent mutation, meaning it should not be possible for two threads to mutate an object at the same time. The same property falls out of properly synchronised programs ā we simply want to enforce that and throw an exception if the program fails in this regard. Ensuring that a program is properly synchronised is hard, and debugging things like data races is often very challenging. Our goal is to ensure that programs that could data-race throw an exception instead.
The way we propose to enforce data-race freedom is through the ability to create groups of mutable objects which we call regions. Regions are isolated, which is a bit hard to define precisely at this point as I havenāt introduced all the moving pieces yet. For now, we can define isolation as āobjects inside the region only point to other objects inside the same regionā. For now, all objects in a Python program belong to some region.
A region is owned by a thread and unless a thread owns a region, it cannot access the objects inside it. If you want to share a region between threads it must be wrapped inside a lock, which ensures that only one thread at a time owns the region. Trying to release a lock for a region while hanging on to direct references into the region throws an exception. (Otherwise, two threads could end up mutating the same object at the same time.) In the rest of this text, Iāll use ālockā to refer to a lock object that manages the ownership of a single region. (It is possible to extend this model with a read-write lock which permits multiple threads to read objects at the same time, but not mutate them.)
PEP795-style immutability extends this model by introducing immutable objects that live outside regions. Since they live outside of the regions, they can be accessed by anyone at any time. They can be accessed across threads, and importantly, objects in regions are permitted to reference immutable objects.
Analogous to PEP795, if you have a class Foo and you want to have instances of Foo in different regions, then you must first make Foo immutable. To simplify this, we propose that certain types of objects should support implicit (automatic) freezing. For example, if a type Foo is shared across two regions it would become immutable automatically. Objects that seem sensible to freeze automatically include types (modulo possible opt-in), strings, tuples, and numbers. (Note that the read-write lock mentioned above does not allow sharing of types between regions.)
Regions with locks add an escape mechanism to immutability that PEP795 does not discuss. Freezing does not propagate through locks. Thus, if you want an immutable object to contain some mutable state ā for example, letās say you want to have a counter inside an immutable class that counts how many instances we make of this class, you can do this by sticking a lock with a region inside in the immutable class to hold your counter.
Regions are created explicitly, like r = Region(). Additionally, each thread has a ādefault regionā. The default region is where objects reside after being created. This region is essentially a thread-local storage which also contains all the stack frames of that thread. Variables and objects in the default region are permitted to point into the regions that the thread owns. This means that objects are thread-local when created.
Objects live in the default region until some other region creates a reference to that object. Then the object will be moved from the default region into the other region. (This operation may propagate if the moved object contains a reference into the default region.) Moving an object out of a region other than a default region is supported but is slightly more involved. (It involves disconnecting the object or objects you want to extract from the remaining objects in the region.) A region can be merged into another region and regions can also be nested.
During the Python Language Summit in May this year, we presented this bigger picture. The slides can be found here and we also have a recorded version of the presentation here. You can specifically look at slides 9-17 which shows much of the stuff described above, with some minimal syntactic āexamplesā. I should point out that the move and share keywords used in the presentation were mock syntax that we invented for that presentation.
Interaction with the subinterpreters model
I will begin by describing subinterpreters, because that will exemplify some of the problems of concurrency when there isnāt a single GIL. This will show things that free-threaded Python has solved in the process.
A big benefit of our design is that it permits subinterpreters to directly share objects by reference. The subinterpreters model currently relies on the subinterpreters being isolated ā if you want to share an object, you have to pickle it and send across. This programming model is warranted by the fact that each subinterpreter uses a GIL to ensure that it can manipulate reference counts safely (in a way that makes them appear atomic in the program), and that its GC will only encounter objects created by that subinterpreter.
Unpacking subinterpreter isolation
Letās unpack this further with an example. Assume that we were able to directly share an object O by reference across two subinterpreters, and that both subinterpreters call a function F on O. Now, both subinterpreters will execute something like the following:
_tmp = O.F # get a reference to the function object for F
_tmp.RC += 1 # increase the reference count on the function object
_tmp(ā¦) # call the F function
_tmp.RC -= 1 # decrease the reference count
If we zoom in on the first reference count operation, it really looks something like this:
_tmp_rc = _tmp.RC
_tmp_rc = _tmp_rc + 1
_tmp.RC = _tmp_rc
That is to say that this operation is not atomic. Both subinterpreters use their own GIL to ensure that no other thread on the same subinterpreter is able to mess with this critical section. However, since they use different GILs, they do not synchronise with each other. Letās assume that the reference count of O.F is 1. Iāll use a _1 or _2 suffix on the _tmp variables to denote the different variables on the different subinterpreters and prefix each line by (1) or (2). It is possible that the two subinterpreters effectively execute their statements in the following global order.
(1) _tmp_rc_1 = _tmp.RC # _tmp_rc_1 holds 1
(1) _tmp_rc_1 = _tmp_rc_1 + 1 # _tmp_rc_1 holds 2
(2) _tmp_rc_2 = _tmp.RC # _tmp_rc_2 holds 1
(2) _tmp_rc_2 = _tmp_rc_2 + 1 # _tmp_rc_2 holds 2
(1) _tmp.RC = _tmp_rc_1 # writes 2 into _tmp.RC
(2) _tmp.RC = _tmp_rc_2 # writes 2 into _tmp.RC
As the example above demonstrates, because both subinterpreters read the same initial RC from O.F, one RC increment getās lost. Since we decrement the functionās RC after the call, we might bring the RC down to zero (unless we suffer the same race coming back).
When I said it was possible for the above to happen, I was not referring to a remote possibility. This is actually quite likely to happen because reference count manipulations happen so frequently. (Indeed, we suffered this bug early on to the point of hello world-size programs not being able to run.)
Does atomic reference counting fix the problem?
No. While making the _tmp.RC += 1 (etc.) operations atomic would address the problem with incorrect in the example above above, this happens to be the case only because both threads are reading O.F. (There is also an additional problem with respect to cycle detection/GC that we will return to later.) To illustrate why atomic reference counting does not work in all cases, letās change the example, and see what happens if two subinterpreters share an object O, where one writes to the field O.F at the same time as the other reads O.F.
Subinterpreter 1 does:
O.F = y
Subinterpreter 2 does:
x = O.F
Like before, letās zoom into what happens in Python due to these two statements:
Subinterpreter 1:
y.RC += 1
_tmp = O.F
O.F = y
_tmp.RC -= 1
Subinterpreter 2:
x = O.F
x.RC += 1
Letās assume that the RC manipulations are atomic, so there are no problems there. Assume the reference count of O.F is 1 from the start. Here is a situation that could occur, again writing statements in temporal order, and prefixing each line by (1) or (2) to show which subinterpreter is running.
(1) y.RC += 1
(1) _tmp = O.F
(2) x = O.F # x is a valid pointer here (*)
(1) O.F = y # (**)
(1) _tmp.RC -= 1 # here O.Fās RC hits zero, O.F is deallocated, and x is no longer valid
(2) x.RC += 1 # this write can write to something that is not an RC, and corrupt the heap (use-after-free)
So, even though the reference count manipulations were atomic, this program just wrote a value into memory which could be in the middle of a newly allocated object etc. If we ran the above example in free-threaded Python, it would avoid the use-after-free by the per-object locks that ensures that the operation marked (*) cannot happen before the line marked (**) when the two statements x = O.F and O.F = y are racing.
PEP795 will permit immutable objects to be shared across subinterpreters. By enforcing immutability rather than relying on programmer discipline, we can be sure that the 2nd example (the read-write race on O.F) will not happen ā it will result in an exception being thrown rather than the use-after-free. Thus, from a reference counting standpoint, we can safely share PEP795-style immutable objects across subinterpreters as long as we ensure that reference count manipulations on immutable objects are atomic.
For mutable objects the situation is even simpler. Because the contents of a region is only accessible to one thread at a time, neither the read-read race nor the read-write race are possible. So it will be possible to share mutable objects directly across subinterpreters using regions without any changes to Pythonās reference counting.
Now, letās bring back the foreshadowed additional problem with cycle detection / GC. Each subinterpreters expects to be able to manage its own heap in isolation. Every Python object is part of a doubly-linked list which is used by the GC. When we start sharing objects across subinterpreters, concurrent mutation could lead to a subinterpreterās list to be corrupted. Free-threaded Python avoids concurrent mutation corrupting this list by switching out Pythonās allocator and relying on mimallocās heap walk. In PEP795, we address this problem by removing frozen objects from cycle detection. This will not leak memory because we can use a clever trick where each strongly connected component of immutable objects can be managed by a single reference counter. Thus, we regain the possibility of detecting cycles in immutable garbage without the need to trace the immutable objects.
In our proposal, because of region isolation, (almost) every cycle will be contained within a single region. (This is analogous to, but more fine-grained than, how subinterpreters work today ā each subinterpreter can collect cyclic garbage in its isolated heap individually from the others.) Thus, we can make cycle detection incremental by checking individual regions, and in principle also without even stopping the program if we are doing cycle detection in regions which are not currently in use.
Interaction with free-threaded Python
Free-threaded Python has already fortified the interpreter against both read-read and read-write races so with free-threaded Python, our proposal does not enable new things that a program can do, but we do unlock many performance optimisations. The fortifications like atomic reference counting and per-object locking all come at a cost (hence the biased locking and deferred RC), and it would be possible to lower these costs relying on the invariants that the regions bring. With respect to cycle detection / GC, Stop-the-World (STW) pauses could be made shorter (or in the best cases avoided) by operating on individual regions instead of on the whole program. Also, when operating inside a region, per-object locks and atomic reference counting is not needed. When operating on immutable objects,
Why immutability through static typing is not enough for this PEP
Currently, the Python type system is not really expressive enough to express the immutability that this PEP proposes. But even if it was, as Pythonās type system is (intentionally) weak, there is no guarantee that just because a program passes type checking, it behaves according to the types. Some parts might drop out of type-checked land, some parts might use reflection, etc. Bottom line, checking immutability through types is ā from the perspective of the implementation on-top of subinterpreters ā no different than simply trusting that programmers do not mutate shared objects. The second that happens, we are wide open for the kinds of problems that were discussed above. (We are back to why free-threaded Python adds e.g. per-object locks.)
Benefits outside of new capabilities or performance
Above, we focused a lot on new ācapabilitiesā (e.g. sharing objects by reference in subinterpreters) and performance (e.g. removing unnecessary locks and atomics in free-threaded Python). Ultimately ā we believe that our proposal will make it easier to write concurrent Python programs, and that some of that is already enabled in PEP795. We have carefully designed it so that single-threaded programs will not observe any changes (all objects are inside the default region) unless it starts to explicitly create multiple regions.