This is somewhat me thinking out loud of a slightly different way of removing the GIL. Whilst avoiding some of the downsides mentioned for @colesbury 's PEP703.
I’m trying to propose a design, not an implementation. There are many ways of building something like this. These things are typically complex in reality, and require many hours to get working well.
Also worth noting that post this from a random guy rambling on about performance, without any code or numbers backing him up. I somewhat want to try building this and seeing how it performs. Tho I doubt I’ll get it done in a timely fashion. So if anyone else wants to, then go for it.
Advantages over PEP703:
- fast path needs an atomic read & branch per object (compared to a per object mutex, or rewriting the object to atomic thread safe code).
- Does not break any existing C module, or stable API.
- Allows for consistency across multiple objects at the same time.
- Allows for guard → action JIT optimizations.
Disadvantages over PEP703:
- Stealing an object off a running thread would be slower.
- Entering legacy C extensions requires picking up a lock, causing it to not be parallelized.
- Python must use the slower atomic reference counting for objects which have ever been handed to a legacy C extension.
- The new C api would do concurrency differently then typical concurrent C code. Potentially giving it a steeper learning curve.
Compatibility issues with existing python/C code:
- If a C extension tries to directly read the reference count, then it may get a lower value then what it should.
Things I’m not addressing:
- Maintenance complexity.
- What the new C API to get parallelism in a C extension would look like.
High level overview
Adding to all objects:
owner_id
: Optional[ThreadID]atomic_refcnt
: Atomic[int]in_C
: Atomic[bool]
Adding to thread structure:
steal_requests
: AtomicQueue[Tuple[Thread, object]] # Implementable as CAS on a linked listepoch
: Atomic[int]deep_sleep
: Atomic[bool]
All objects have a modified version of RW locks. However instead of lock/unlock, there is steal/preempt.
If a thread has it’s thread_id
in the owner of an object, then it may safely read/write that object without any atomic instructions. If an object has no owner, then any thread can read from it.
Threads have preemption points. When one is executed, objects may be stolen from the thread. To prevent deadlocks, object stealing is itself a preemption point (unless if the thread is holding the CLock
). A thread may continue to use an object until it is stolen (which will always happen at a preemption point)
sleeping is also a preemption point, and a deep sleep
requires a global_barrier()
before waking up.
Reference counting of objects uses two fields. An atomic one (atomic_refcnt
), and a non-atomic one (ob_refcnt
). The real reference count is ob_refcnt + atomic_refcnt
. ob_refcnt
must be greater than 0 if the object is alive. if atomic_refcnt
goes less then 0, then we move some from ob_refcnt
.
If an object has ever been past to a C extension using the limited C API < 3.12, then it gets the in_C flag set.
A thread may use the non-atomic RC field if either:
- The thread is holding the
CLock
(in_C
would already be set) - The
in_C
flag isFalse
and the thread is the owner of the object
Python-like pseudo-code for key operations
def global_barrier(self: Thread):
"""
returns after every other thread has hit a preemption point
"""
global epoch
original_epoch = atomic_read(epoch)
atomic_add(epoch, 1)
for thread in running_threads:
if thread == self:
continue
if atomic_read(thread.epoch) <= original_epoch:
# wait till thread preempts
# TODO: signal thread correctly
thread.steal_requests.push((self, None))
self.sleep()
def deep_sleep(o_thread: Thread):
if not o_thread.is_sleeping():
return False
atomic_write(o_thread.deep_sleep, True)
return o_thread.is_sleeping()
def wakeup(self: Thread):
self.sleeping = False
if self.deep_sleep:
self.deep_sleep = False
# TODO: global_barrier must not go into deep_sleep
global_barrier()
def preempt(self: Thread, keep_objects: Set[Object] = set()):
global epoch
if self.holds_CLock():
# C will only preempt if it drops CLock
return
if epoch != self.epoch and len(keep_objects) == 0:
# Acknowledge we have seen the epoch change
atomic_write(self.epoch, epoch)
delayed_steal_requests = []
while not self.steal_requests.is_empty():
thread, obj = self.steal_requests.pop()
if obj in keep_objects:
# multi_check in progress
# We can't deny the CLock thread from stealing objects
if not thread.holds_CLock():
delayed_steal_requests.append((thread, obj))
continue
# Check if we have already handed off ownership of the object before
if obj.owner_id == self.ID:
atomic_write(obj.owner_id, thread)
thread.wake()
# Schedule checking of delayed_steal_requests on next preempt
for thread, obj in delayed_steal_requests:
self.steal_requests.push((thread, obj))
def steal_write(self: Thread, o: Object):
if o.owner_id == self.ID:
return
while True:
self.preempt()
if o.owner_id is not None:
# The owner thread may of terminated already
owner = all_threads.get(o.owner_id, None)
if owner is None or deep_sleep(owner):
# since owner is not running, we can just grab the object
if CAS(o.owner_id, o.owner_id, self):
return
else:
# A different thread has write access
owner.steal_requests.push((self, o))
self.sleep()
if o.owner_id == self.ID:
return
else:
# Upgrading read access to write access
if not CAS(o.owner_id, None, self):
continue
# We will be the next owner. But must let all potential concurrent readers preempt
global_barrier()
return
def steal_read(self: Thread, o: Object):
if o.owner_id == self.ID or o.owner_id is None:
return
while True:
self.preempt()
# The owner thread may of terminated already
owner = all_threads.get(o.owner_id, None)
if owner is None or deep_sleep(owner):
# since owner is not running, we can just grab the object
CAS(o.owner_id, o.owner_id, self)
else:
owner.steal_requests.push((self, o))
self.sleep()
if o.owner_id is None:
return
elif o.owner_id == self.ID:
if heuristic_is_read_contended(o):
o.owner_id = None
return
def multi_steal(self: Thread, check_list: List[Tuple[Object, bool]]):
# fast path
for o, write_access in check_list:
if o.owner_id == self.ID:
continue
elif not write_access and o.owner_id is None:
continue
else:
break
else:
# We already have required access
return
# Do a preempt without any locked_objects to allow for ownerless objects
# to be stolen
self.preempt()
# Sorting check_list by the object id to prevent deadlocks
check_list.sort(key = lambda a: id(a[0]))
# To prevent livelocks, we limit which objects are stealable whilst
# running multi_steal. access_granted_objects may only grow in size,
# guaranteeing progress
access_granted_objects = set()
while len(access_granted_objects) < len(check_list):
o, write_access = check_list[len(access_granted_objects)]
if o.owner_id == self.ID or (o.owner_id is None and not write_access):
access_granted_objects.add(o)
continue
owner = all_threads.get(o.owner_id, None)
if owner is None or deep_sleep(owner):
# since owner is not running, we can just grab the object
CAS(o.owner_id, o.owner_id, self)
else:
self.preempt(keep_objects=access_granted_objects)
o.Owner.steal_requests.push((self, o))
self.sleep()
# A thread holding CLock may of stolen objects even tho we said not to.
# Rely on the fast path checking if this was the case
# (actual implementation to use a GOTO statement)
return self.multi_steal(check_list)
def Py_INCREF_fast(o: Object):
# Intentionally not changed from current python
if _Py_IsImmortal(o):
return
o.ob_refcnt += 1
def Py_INCREF_slow(o: Object):
if _Py_IsImmortal(o):
return
atomic_add(o.atomic_refcnt, 1)
def Py_DECREF_fast(o: Object):
# Intentionally not changed from current python
if _Py_IsImmortal(o):
return
o.ob_refcnt -= 1
if o.ob_refcnt == 0:
_Py_Dealloc(o)
def _Py_Dealloc(o: Object):
# Can we steal some from the atomic field?
atomic_refcnt = atomic_read(o.atomic_refcnt)
while True
if atomic_refcnt == 0 and o.ob_refcnt == 0:
# Actually free the object
_Real_Py_Dealloc(o)
return
elif o.ob_refcnt > 0:
# Successfully stole some reference count
return
# Stealing half the reference counts
# Adding 1 here to make sure o.ob_refcnt ends up > 0
o.ob_refcnt = atomic_refcnt / 2 + 1
atomic_refcnt = atomic_add(o.atomic_refcnt, -o.ob_refcnt)
# Atomic add returns the original value. So turn it into the new value
atomic_refcnt -= o.ob_refcnt
while atomic_refcnt < 0:
# We stole too much due to concurrent modifications
o.ob_refcnt += atomic_refcnt
prev_refcnt = atomic_add(o.atomic_refcnt, -atomic_refcnt)
atomic_refcnt = prev_refcnt - atomic_refcnt
def Py_DECREF_slow(o: Object):
if _Py_IsImmortal(o):
return
original_id = id(o)
if atomic_add(o.atomic_refcnt, -1) != 0:
return
# We have made o.atomic_refcnt go less than 0
steal_write(o)
if id(o) != original_id:
# Object has already had _Real_Py_Dealloc called
# See `Object free race` below
return
needs_clock = o.in_C and not CLock.held()
if needs_clock:
CLock.lock()
while True:
if o.ob_refcnt == 1:
# o.atomic_refcnt may of gone up while waiting
if atomic_read(o.atomic_refcnt) < 0:
if needs_clock:
CLock.unlock()
_Real_Py_Dealloc(o)
return
else:
# Some other thread till drop o.atomic_refcnt back past 0 later
if needs_clock:
CLock.unlock()
return
# Steal half of ob_refcnt
# Since o.ob_refcnt > 1, this will leave o.ob_refcnt with at least 1
steal_amount = o.ob_refcnt / 2
if o.owner_id is None and not o.in_C:
# Optimization: Steal all the reference count
steal_amount = o.ob_refcnt - 1
o.ob_refcnt -= steal_amount
if atomic_add(o.atomic_refcnt, steal_amount) >= -steal_amount:
# o.atomic_refcnt is now >= 0
if needs_clock:
CLock.unlock()
return
# We didn't steal enough (due to concurrent ref count drops). Retry
Object free race
If there are two threads with an object o
, with threadA being the owner. Both having one reference count each. Then:
- ThreadB calls
Py_DECREF_slow(o)
. Decrementso.atomic_refcnt
to 0 and sleeps onsteal_write(o)
- ThreadA calls
Py_DECREF_fast(o)
. Seeso.atomic_refcnt
is 0, and calls_Real_Py_Dealloc(o)
- ThreadB wakes up and continues executing with the de-allocated
o
Two solutions to the problem:
Having _Real_Py_Dealloc
wait on a barrier (all threads that were running Py_DECREF_slow
when the barrier started have finished Py_DECREF_slow
before the barrier completes) before releasing the memory. (Reusing the memory as a python object of the same size won’t need the barrier)
Or by using a CAS loop instead of atomic_add
. Allowing Py_DECREF_slow
to steal_write(o)
before o.atomic_refcnt
is 0
.
Legacy Extensions
There exists a global lock called CLock
. Anytime a “legacy extension” is executed, the CLock
is picked up. The CLock
now acts like how the GIL
use to for the C code.
As long as the thread holds the CLock
, it will never be preempt. (However, calling Py_BEGIN_ALLOW_THREADS
will drop the CLock
, causing a preemption point. Just like how the GIL use to work).
A thread holding the CLock
may steal objects from other threads (potentially having to wait till they are at a preemption point). But other threads must wait until CLock
is dropped to steal back. Since only one thread can hold CLock
at any one point in time, this cannot deadlock.
All functions from the legacy C API steal ownership of the objects before accessing them. Providing the illusion that no other threads are running.
A C extension using the limited C api < 3.12 directly access the reference counter field. Executing:
if (--op->ob_refcnt == 0) {
_Py_Dealloc(op);
}
We modify _Py_Dealloc
to potentially move across atomic_refcnt
to ob_refcnt
, and only actually do the deallocation if there are no references to move. The exact behavior of _Py_Dealloc
is not part of the stable API. So this is a safe change to make.
If the extension can be recompiled, then this flagging can be removed without changing the extension’s code.
JIT
Mentioned a couple of time by Mark Shannon, Guido [PEP 703: Making the Global Interpreter Lock Optional (3.12 updates) - #118 by guido] and others: the Faster CPython project is wanting to have a JIT with separate checks and actions.
This design plays nicely with that, as a check of “Is my thread the owner of this object” will remain valid until a preemption point is executed.
By the nature of preemption points: If one is hit then any object (which is accessible to other threads) may be modified, causing ~ all checks become invalid. And they must be re-checked with the new state of the world.
A preemption point can measure if any objects were stolen. Allowing for generated code like:
def super_useful_function(objectA):
objectA.c = objectA.inner.c
objectA.inner.c = None
def super_useful_function_JITv1(objectA):
# Since we are writing later, do a steal_write instead of a steal_read
steal_write(objectA)
inner = objectA.inner
# preemption point
steal_write(inner)
if objectA.owner_id != current_thread_id:
goto slow_path
# Since objectA was not stolen, then it has not been modified, and also we may still write to it
c = inner.c
objectA.c = c
inner.c = None
if objectA is inner:
# Self referential edge case. We are deleting inner.c
# preemption point
Py_DECREF(c)
else:
# inner.c needs no reference count modifications
pass
Or (taking a page from the “Ask forgiveness not permission” book):
def super_useful_function_JITv2(objectA):
if objectA.owner_id != current_thread_id:
goto slow_path
inner = objectA.inner
if inner.owner_id != current_thread_id:
goto slow_path
if objectA is inner:
# Assuming function typically isn't being called with self referential objects
goto slow_path
objectA.c = inner.c
inner.c = None
super_useful_function_JITv2
now no-longer contains a preemption in the fast path, which could play nicely with other optimizations
Single threaded performance slowdowns
If an object does not leave a thread, then that thread will always be it’s owner. Meaning there are no stealing costs for it. Also we never have to use CAS
or atomic_add
on the object.
This increases the size of all python objects. Increasing memory pressure etc. I think you could get away with packing the fields into 64 bits of space per object.
Python
All objects that get accessed need an atomic read on the ownership field. Followed by a Branch predictable check. These reads & checks are something that an OOO CPU are fairly good at hiding. But has some overhead.
Barrier calls after every op code are somewhat analogous to the work done in _Py_HandlePending
, and shouldn’t increase overhead
Reference counting
For objects owned by the thread, there is an extra check on object ownership (small to no slowdown)
Objects which are globally readable (and not immortal) require atomic instructions on the reference count. Causing a slightly larger slowdown
Also objects that have ever been handed to a C extension using the limited C API < 3.12, then atomic instructions must be used for the reference count. Causing the slightly larger slowdown
Legacy C extensions
A mutex must be picked up before entering the legacy C. Requires same overheads python code does
Multi threaded performance
Very hard to comment on without working code. If the code requires no object stealing, then it would have a linear improvement in performance. But also, it would work ~ just as well in the multiple interpreter model.
The main question is: How slow is object stealing? We could make some objects (dicts, arrays etc) internally thread safe to reduce stealing. But that may reduce available JIT optimization.
Other optimizations
- A JIT could remove redundant ownership checks
- Immutable objects could be accessed without stealing
- Stealing could busy wait a bit before sleeping
- Stealing could be implemented without signals