Alternate design for removing the GIL

This is somewhat me thinking out loud of a slightly different way of removing the GIL. Whilst avoiding some of the downsides mentioned for @colesbury 's PEP703.

I’m trying to propose a design, not an implementation. There are many ways of building something like this. These things are typically complex in reality, and require many hours to get working well.

Also worth noting that post this from a random guy rambling on about performance, without any code or numbers backing him up. I somewhat want to try building this and seeing how it performs. Tho I doubt I’ll get it done in a timely fashion. So if anyone else wants to, then go for it.

Advantages over PEP703:

  • fast path needs an atomic read & branch per object (compared to a per object mutex, or rewriting the object to atomic thread safe code).
  • Does not break any existing C module, or stable API.
  • Allows for consistency across multiple objects at the same time.
  • Allows for guard → action JIT optimizations.

Disadvantages over PEP703:

  • Stealing an object off a running thread would be slower.
  • Entering legacy C extensions requires picking up a lock, causing it to not be parallelized.
  • Python must use the slower atomic reference counting for objects which have ever been handed to a legacy C extension.
  • The new C api would do concurrency differently then typical concurrent C code. Potentially giving it a steeper learning curve.

Compatibility issues with existing python/C code:

  • If a C extension tries to directly read the reference count, then it may get a lower value then what it should.

Things I’m not addressing:

  • Maintenance complexity.
  • What the new C API to get parallelism in a C extension would look like.

High level overview

Adding to all objects:

  • owner_id: Optional[ThreadID]
  • atomic_refcnt: Atomic[int]
  • in_C: Atomic[bool]

Adding to thread structure:

  • steal_requests: AtomicQueue[Tuple[Thread, object]] # Implementable as CAS on a linked list
  • epoch: Atomic[int]
  • deep_sleep: Atomic[bool]

All objects have a modified version of RW locks. However instead of lock/unlock, there is steal/preempt.

If a thread has it’s thread_id in the owner of an object, then it may safely read/write that object without any atomic instructions. If an object has no owner, then any thread can read from it.

Threads have preemption points. When one is executed, objects may be stolen from the thread. To prevent deadlocks, object stealing is itself a preemption point (unless if the thread is holding the CLock). A thread may continue to use an object until it is stolen (which will always happen at a preemption point)

sleeping is also a preemption point, and a deep sleep requires a global_barrier() before waking up.

Reference counting of objects uses two fields. An atomic one (atomic_refcnt), and a non-atomic one (ob_refcnt). The real reference count is ob_refcnt + atomic_refcnt. ob_refcnt must be greater than 0 if the object is alive. if atomic_refcnt goes less then 0, then we move some from ob_refcnt.

If an object has ever been past to a C extension using the limited C API < 3.12, then it gets the in_C flag set.

A thread may use the non-atomic RC field if either:

  • The thread is holding the CLock (in_C would already be set)
  • The in_C flag is False and the thread is the owner of the object

Python-like pseudo-code for key operations

def global_barrier(self: Thread):
    returns after every other thread has hit a preemption point
    global epoch

    original_epoch = atomic_read(epoch)
    atomic_add(epoch, 1)

    for thread in running_threads:
        if thread == self:

        if atomic_read(thread.epoch) <= original_epoch:
            # wait till thread preempts

            # TODO: signal thread correctly
            thread.steal_requests.push((self, None))

def deep_sleep(o_thread: Thread):
    if not o_thread.is_sleeping():
        return False

    atomic_write(o_thread.deep_sleep, True)

    return o_thread.is_sleeping()

def wakeup(self: Thread):
    self.sleeping = False

    if self.deep_sleep:
        self.deep_sleep = False

        # TODO: global_barrier must not go into deep_sleep

def preempt(self: Thread, keep_objects: Set[Object] = set()):
    global epoch

    if self.holds_CLock():
        # C will only preempt if it drops CLock

    if epoch != self.epoch and len(keep_objects) == 0:
        # Acknowledge we have seen the epoch change
        atomic_write(self.epoch, epoch)

    delayed_steal_requests = []

    while not self.steal_requests.is_empty():
        thread, obj = self.steal_requests.pop()

        if obj in keep_objects:
            # multi_check in progress

            # We can't deny the CLock thread from stealing objects
            if not thread.holds_CLock():
                delayed_steal_requests.append((thread, obj))

        # Check if we have already handed off ownership of the object before
        if obj.owner_id == self.ID:
            atomic_write(obj.owner_id, thread)


    # Schedule checking of delayed_steal_requests on next preempt
    for thread, obj in delayed_steal_requests:
        self.steal_requests.push((thread, obj))

def steal_write(self: Thread, o: Object):
    if o.owner_id == self.ID:

    while True:

        if o.owner_id is not None:
            # The owner thread may of terminated already
            owner = all_threads.get(o.owner_id, None)

            if owner is None or deep_sleep(owner):
                # since owner is not running, we can just grab the object
                if CAS(o.owner_id, o.owner_id, self):

                # A different thread has write access
                owner.steal_requests.push((self, o))

                if o.owner_id == self.ID:

            # Upgrading read access to write access
            if not CAS(o.owner_id, None, self):

            # We will be the next owner. But must let all potential concurrent readers preempt

def steal_read(self: Thread, o: Object):
    if o.owner_id == self.ID or o.owner_id is None:

    while True:

        # The owner thread may of terminated already
        owner = all_threads.get(o.owner_id, None)

        if owner is None or deep_sleep(owner):
            # since owner is not running, we can just grab the object
            CAS(o.owner_id, o.owner_id, self)

            owner.steal_requests.push((self, o))

        if o.owner_id is None:

        elif o.owner_id == self.ID:
            if heuristic_is_read_contended(o):
                o.owner_id = None


def multi_steal(self: Thread, check_list: List[Tuple[Object, bool]]):
    # fast path
    for o, write_access in check_list:
        if o.owner_id == self.ID:
        elif not write_access and o.owner_id is None:
        # We already have required access

    # Do a preempt without any locked_objects to allow for ownerless objects
    # to be stolen

    # Sorting check_list by the object id to prevent deadlocks
    check_list.sort(key = lambda a: id(a[0]))

    # To prevent livelocks, we limit which objects are stealable whilst
    # running multi_steal. access_granted_objects may only grow in size,
    # guaranteeing progress
    access_granted_objects = set()

    while len(access_granted_objects) < len(check_list):
        o, write_access = check_list[len(access_granted_objects)]

        if o.owner_id == self.ID or (o.owner_id is None and not write_access):

        owner = all_threads.get(o.owner_id, None)

        if owner is None or deep_sleep(owner):
            # since owner is not running, we can just grab the object
            CAS(o.owner_id, o.owner_id, self)


            o.Owner.steal_requests.push((self, o))

    # A thread holding CLock may of stolen objects even tho we said not to.
    # Rely on the fast path checking if this was the case
    # (actual implementation to use a GOTO statement)
    return self.multi_steal(check_list)

def Py_INCREF_fast(o: Object):
    # Intentionally not changed from current python

    if _Py_IsImmortal(o):

    o.ob_refcnt += 1

def Py_INCREF_slow(o: Object):
    if _Py_IsImmortal(o):

    atomic_add(o.atomic_refcnt, 1)

def Py_DECREF_fast(o: Object):
    # Intentionally not changed from current python

    if _Py_IsImmortal(o):

    o.ob_refcnt -= 1

    if o.ob_refcnt == 0:

def _Py_Dealloc(o: Object):
    # Can we steal some from the atomic field?
    atomic_refcnt = atomic_read(o.atomic_refcnt)

    while True
        if atomic_refcnt == 0 and o.ob_refcnt == 0:
            # Actually free the object
        elif o.ob_refcnt > 0:
            # Successfully stole some reference count

        # Stealing half the reference counts
        # Adding 1 here to make sure o.ob_refcnt ends up > 0
        o.ob_refcnt = atomic_refcnt / 2 + 1
        atomic_refcnt = atomic_add(o.atomic_refcnt, -o.ob_refcnt)

        # Atomic add returns the original value. So turn it into the new value
        atomic_refcnt -= o.ob_refcnt

        while atomic_refcnt < 0:
            # We stole too much due to concurrent modifications
            o.ob_refcnt += atomic_refcnt
            prev_refcnt = atomic_add(o.atomic_refcnt, -atomic_refcnt)
            atomic_refcnt = prev_refcnt - atomic_refcnt

def Py_DECREF_slow(o: Object):
    if _Py_IsImmortal(o):
    original_id = id(o)
    if atomic_add(o.atomic_refcnt, -1) != 0:

    # We have made o.atomic_refcnt go less than 0


    if id(o) != original_id:
        # Object has already had _Real_Py_Dealloc called
        # See `Object free race` below

    needs_clock = o.in_C and not CLock.held()

    if needs_clock:

    while True:
        if o.ob_refcnt == 1:
            # o.atomic_refcnt may of gone up while waiting
            if atomic_read(o.atomic_refcnt) < 0:
                if needs_clock:


                # Some other thread till drop o.atomic_refcnt back past 0 later
                if needs_clock:


        # Steal half of ob_refcnt
        # Since o.ob_refcnt > 1, this will leave o.ob_refcnt with at least 1

        steal_amount = o.ob_refcnt / 2

        if o.owner_id is None and not o.in_C:
            # Optimization: Steal all the reference count
            steal_amount = o.ob_refcnt - 1

        o.ob_refcnt -= steal_amount
        if atomic_add(o.atomic_refcnt, steal_amount) >= -steal_amount:
            # o.atomic_refcnt is now >= 0
            if needs_clock:


        # We didn't steal enough (due to concurrent ref count drops). Retry

Object free race

If there are two threads with an object o, with threadA being the owner. Both having one reference count each. Then:

  • ThreadB calls Py_DECREF_slow(o). Decrements o.atomic_refcnt to 0 and sleeps on steal_write(o)
  • ThreadA calls Py_DECREF_fast(o). Sees o.atomic_refcnt is 0, and calls _Real_Py_Dealloc(o)
  • ThreadB wakes up and continues executing with the de-allocated o

Two solutions to the problem:

Having _Real_Py_Dealloc wait on a barrier (all threads that were running Py_DECREF_slow when the barrier started have finished Py_DECREF_slow before the barrier completes) before releasing the memory. (Reusing the memory as a python object of the same size won’t need the barrier)

Or by using a CAS loop instead of atomic_add. Allowing Py_DECREF_slow to steal_write(o) before o.atomic_refcnt is 0.

Legacy Extensions

There exists a global lock called CLock. Anytime a “legacy extension” is executed, the CLock is picked up. The CLock now acts like how the GIL use to for the C code.

As long as the thread holds the CLock, it will never be preempt. (However, calling Py_BEGIN_ALLOW_THREADS will drop the CLock, causing a preemption point. Just like how the GIL use to work).

A thread holding the CLock may steal objects from other threads (potentially having to wait till they are at a preemption point). But other threads must wait until CLock is dropped to steal back. Since only one thread can hold CLock at any one point in time, this cannot deadlock.

All functions from the legacy C API steal ownership of the objects before accessing them. Providing the illusion that no other threads are running.

A C extension using the limited C api < 3.12 directly access the reference counter field. Executing:

    if (--op->ob_refcnt == 0) {

We modify _Py_Dealloc to potentially move across atomic_refcnt to ob_refcnt, and only actually do the deallocation if there are no references to move. The exact behavior of _Py_Dealloc is not part of the stable API. So this is a safe change to make.

If the extension can be recompiled, then this flagging can be removed without changing the extension’s code.


Mentioned a couple of time by Mark Shannon, Guido [PEP 703: Making the Global Interpreter Lock Optional (3.12 updates) - #118 by guido] and others: the Faster CPython project is wanting to have a JIT with separate checks and actions.

This design plays nicely with that, as a check of “Is my thread the owner of this object” will remain valid until a preemption point is executed.

By the nature of preemption points: If one is hit then any object (which is accessible to other threads) may be modified, causing ~ all checks become invalid. And they must be re-checked with the new state of the world.

A preemption point can measure if any objects were stolen. Allowing for generated code like:

def super_useful_function(objectA):
    objectA.c = objectA.inner.c
    objectA.inner.c = None

def super_useful_function_JITv1(objectA):
    # Since we are writing later, do a steal_write instead of a steal_read
    inner = objectA.inner

    # preemption point

    if objectA.owner_id != current_thread_id:
        goto slow_path

    # Since objectA was not stolen, then it has not been modified, and also we may still write to it
    c = inner.c
    objectA.c = c
    inner.c = None

    if objectA is inner:
        # Self referential edge case. We are deleting inner.c

        # preemption point

        # inner.c needs no reference count modifications

Or (taking a page from the “Ask forgiveness not permission” book):

def super_useful_function_JITv2(objectA):
    if objectA.owner_id != current_thread_id:
        goto slow_path

    inner = objectA.inner

    if inner.owner_id != current_thread_id:
        goto slow_path

    if objectA is inner:
        # Assuming function typically isn't being called with self referential objects
        goto slow_path

    objectA.c = inner.c
    inner.c = None

super_useful_function_JITv2 now no-longer contains a preemption in the fast path, which could play nicely with other optimizations

Single threaded performance slowdowns

If an object does not leave a thread, then that thread will always be it’s owner. Meaning there are no stealing costs for it. Also we never have to use CAS or atomic_add on the object.

This increases the size of all python objects. Increasing memory pressure etc. I think you could get away with packing the fields into 64 bits of space per object.


All objects that get accessed need an atomic read on the ownership field. Followed by a Branch predictable check. These reads & checks are something that an OOO CPU are fairly good at hiding. But has some overhead.

Barrier calls after every op code are somewhat analogous to the work done in _Py_HandlePending, and shouldn’t increase overhead

Reference counting

For objects owned by the thread, there is an extra check on object ownership (small to no slowdown)

Objects which are globally readable (and not immortal) require atomic instructions on the reference count. Causing a slightly larger slowdown

Also objects that have ever been handed to a C extension using the limited C API < 3.12, then atomic instructions must be used for the reference count. Causing the slightly larger slowdown

Legacy C extensions

A mutex must be picked up before entering the legacy C. Requires same overheads python code does

Multi threaded performance

Very hard to comment on without working code. If the code requires no object stealing, then it would have a linear improvement in performance. But also, it would work ~ just as well in the multiple interpreter model.

The main question is: How slow is object stealing? We could make some objects (dicts, arrays etc) internally thread safe to reduce stealing. But that may reduce available JIT optimization.

Other optimizations

  • A JIT could remove redundant ownership checks
  • Immutable objects could be accessed without stealing
  • Stealing could busy wait a bit before sleeping
  • Stealing could be implemented without signals

If one thread is reading an object you must guarantee that it is not stolen and modified while it is being read. If that involves a lock the code will be very slow. I am not sure your design does this.

Use of atomic CPU operations has an impact on CPU performance that will slow single thread and uncontended object use. May have slow all processes on a system if the use of atomic to frequent, but I’m not sure of this.

Just fixed a minor typo in steal_write.

This is handled. (The top code block is longish, and has a lot of details in code rather than English)

Summarizing relevant code:

def steal_write(self: Thread, o: Object):
    if o.owner_id == self.ID:

    while True:

        if o.owner_id is not None:
            # Upgrading read access to write access
            if not CAS(o.owner_id, None, self):

            # We will be the next owner. But must let all potential concurrent readers preempt

So it will: Mark the object as the current thread having write access (to stop future readers/writers), and then issue a global_barrier(). The global_barrier() waits for all running threads to preempt.

This is fast for the fast path (reading an object to whom you have read access to), and slow for object stealing.

(A detail that I didn’t mention is the object cannot be stolen whilst that global_barrier() call is on going. Something that the deep_sleep stealing optimization needs to take into account)

It does mean the python runtime has trade off when deciding to make an object read only: Stealing from write access requires waiting for 1x thread to preempt (or be asleep), whereas stealing from read access needs all threads to preempt (much slower).

Details change between CPU architectures. But, atomic_read and atomic_write slowness comes from requiring memory fences, which prohibits the CPU from reordering memory accesses. They are the faster of atomic instructions. atomic_add, compare and swap and atomic exchange requires the CPU to pin the cache line to the core. Which is slower.

I could be wrong in this, but off the top of my head, (on x86,) all word aligned reads are atomic. And a memory fence is only required in the object stealing code. So the fast path (which single threaded code is guaranteed to hit) boils down to: Do an ordinary read from a cache line that you are about to read from anyway, and an ordinary branch-predictable-branch. Which is. Well. Fast.

How fast/slow exactly? idk. Would need to build it up and test.

This depends on undefined behaviour of the C language.
The Linux kernel code goes to great lengths to stop the compiler splitting the reads into multiple instructions.

I was more describing the performance of the final assembly. Not what you must do to get C to output that assembly. So yes, you do need to use the correct compiler intrinsic to generate the correct assembly. Even if it is just for the purpose of making architecture porting easier. But that is also very much an “implementation detail”. Which. Well, requires there to be an implementation.

In my pseudo-code, I had skipped writing in atomic_read(o.thread_id) cause things were getting a bit noisy. And the final code will likely merge thread_id, in_C and atomic_refcnt into the 1x word. Adding in more complexity which isn’t relevant to describing the underlying idea. Same with correct memory barrier placements.

The issues with the compiler i am thinking about are discussed here and other related articles.
I assume that you will need solutions just like the linux kernel does.

OK, but given that projects to eliminate the GIL have been ongoing for many years, and in particular @colesbury has spent a huge amount of time getting his current work to its current state, do you consider offering a new design at this point, with no implementation to back it up, to be a realistic suggestion?

What do you expect to happen here? Should the SC put PEP 703 on hold while someone tries to implement your idea? Should we implement PEP 703 and then rewrite it if your idea turns out to be a better approach? What about the other internals work (notably the “Faster CPython” work) that’s trying to get to grips with how PEP 703 affects their plans, approaches and timescales? Should they factor in yet another possibility that may or may not work when the details of implementation start getting tackled?

Talking about alternative approaches is fine, and very interesting (if you like interpreter internals :wink:) but this list is for proposals which are intended to be added to Python. I’m not sure your idea fits that classification. As a general “bouncing ideas around” conversation, I guess it’s fine, but don’t be surprised if the work on deciding whether PEP 703 gets accepted simply ignores it…


I somewhat tried to say this in my preamble. Sorry if I wasn’t too clear.

But I don’t really have any expectations.

Yes, these sorts of code takes a while to build, and an order of magnitude or two more time to debug. And yes, it would be very disingenuous to have a random person just show up and expect someone else to do a lot of hard work. And that’s not even starting to talk about how it impacts other projects.

I did intend it to be more of a “bouncing of ideas”. I guess the complexity of something like this made me want to specify it more formally, (to say that it is possible for this to work).

I would expect a group like “Faster CPython” to not even consider a large change without good performance numbers. Let alone having no numbers at all.

Good question.

I guess looking at the 3 main up sides:

  • Not breaking the ABI
  • JIT optimizations
  • maybe performs better?

Assuming PEP 703 gets accepted, then breaking backwards compatibility would be a sunk cost. Also, I’m guessing the “Faster CPython” would be happy with what they can JIT. So the top two no longer really matter. And at a guess, I don’t think PEP 703 is leaving enough performance on the table to be able to justify a change in implementation.

So if PEP 703 does get accepted then I doubt there would be any reason to accept this into Python.

Why bother talking about this then? 2x points:

  1. If PEP 703 does not get accepted then most of your arguments don’t apply anymore.

  2. There may be some ideas in this that Sam could use to improve on PEP 703 (eg, like how I’m abusing that _Py_Dealloc isn’t part of the stable API to get backwards compatibility of the ABI). I don’t know his design well enough to make any comments tho.

I would be somewhat surprised if this isn’t ignored.

I’m sorry. From the informal description on the ideas category I was under the impression that this post would of been ok.

Is there somewhere better that something like this should be moved to? Or maybe should the title be changed?


NoGIL movement definitely needs a better design for removing GIL. It seems unlikely that users would be willing to accept a XX% performance penalty for releasing the GIL.

On the other hand, in modern times, the cost of inter-process communication (IPC) and data serialization between processes in real-world applications is unlikely to account for a 10% reduction in processing time. If this is the case, maybe the algorithm is inherently serial and parallelization is not possible.

This is totally dependent on a) what the users are doing and b) what X is. Lots of users would gladly make that trade, at the right number.


Have you not seen the extensive discussions between the Faster CPython people and the PEP 703 people? There’s already a lot of debate about how 703 impacts the performance work. Your proposal is way behind where they are on that.

You don’t need to guess, there’s an extensive discussion here. And PEP 703 will be a bunch of extra work for the faster CPython team. So it’s far from no longer mattering.

Fair. But I doubt there would be any appetite for another run at the GIL for quite some time.

It’s not so off-topic that it’s inappropriate, I didn’t mean to give that impression. But it is unlikely to generate anything but speculation. If that’s all you want, then fine.

I still have to see a real-world use case where using threads (plus nogil overhead) is faster than using multiprocessing. I have been following the discussions about nogil, I didn’t see any number, only a fib() algorithm which can be done the same using multiprocessing.

What I’m trying to say is that the “parallelization problem” would still be there in the nogil world. You still have to use sync primitives (queue, locks, etc…). I believe that free threading would give developers false hopes.

Today’s your lucky day, there’s already an example linked in the other thread. He even compares it directly to using multiprocessing.

From what I can see, the people who are looking forward for free threading in python are well aware of what that entails and how it works, because they’re already going into other languages to accomplish it. The point is that we’d rather not have to do that.


I did see this use case, it can be done the same (using the same code) using multiprocessing.

What about the well-established user base, are they willing to pay a x% nogil tax? The way I see it, they will need x% more CPU cores to offer the same service, i.e., x% cost increase in CPU power. To think that many services operate on a 0.X% net revenue.

Note that I’m not against the free threading, but we have to think about its effects on the current user base.

Hmm, did you see the line where the author says:

Using a multiprocessing.Queue in the same place results in degraded performance (approximately -15%).

1 Like

I prefer not to comment on the work of others, but theoretically speaking, that is not a parallel problem at all (multiprocessing.Queue is redundant). Just download and write in drive.

The PEP has a section on multiprocessing, and the thread PEP 703: Making the Global Interpreter Lock Optional (3.12 updates) has had numerous people explaining why they would benefit from nogil while they don’t benefit from multiprocessing. If you want to comment constructively on this topic, you have to address those. It sounds like you’re ignoring them, which is not productive. Thank you.


I don’t want to fight with you, but if you actually read the backblaze post it clearly explains why this is not so.

1 Like

I don’t believe that asking for real-world use case comparison between threads and multiprocessing is nonconstructive.

Nonconstructive is hoping that removing GIL will solve parallelization problems, or thinking that GIL is stopping us from solving parallel problems.

Read Mark Shannon comment: PEP 703: Making the Global Interpreter Lock Optional (3.12 updates) - #9 by markshannon

For the purposes of this discussion, let’s categorize parallel application into three groups:

  1. Data-store backed processing. All the shared data is stored in a data store, such a Postgres database, the processes share little or no data, communicating via the data store.
  2. Numerical processing (including machine learning) where the shared data is matrices of numbers.
  3. General data processing where the shared data is in the form of an in-memory object graph.

Python has always supported category 1, with multiprocessing or through some sort of external load balancer.
Category 2 is supported by multiple interpreters.
It is category 3 that benefits from NoGIL.

How common is category 3?
All the motivating examples in PEP 703 are in category 2.

Look closely at performance overhead numbers. Also, we don’t know what the overhead will be in the algorithms that we hope to benefit from GIL removal, because I haven’t seen any number yet (If I have missed any number, please show me). In other words, would this benefit outnumber the performance overhead.

Frankly speaking, why should I have to pay more for something I don’t use? It’s like paying a toll for a road I never use. If it weren’t for the performance overhead, we wouldn’t be discussing this at all, and the PEP would have been accepted quietly.

If you can demonstrate a GIL removal design without any overhead, I would gladly put ‘NOGIL’ on a t-shirt and wear it proudly all year round.

But if you’re just ignoring the existing examples and comments from people who work on various packages because you don’t believe them. This isn’t a recipe for a productive discussion.