Alternate design for removing the GIL

These problems can be addressed in the same way using multiprocessing. I am referring specifically to problems that cannot be solved by multiprocessing alone.

Because there’s only one road, and all CPython users are on it. We have to make decisions and trade offs for everyone, not just you. (Or me: removing the GIL won’t benefit me in my current job.)

Will removing the GIL from CPython cause a slowdown for single threaded users? Almost certainly. Will it be better for the community as a whole? That’s the hard decision.

12 Likes

The PEP author and numerous people on the various discussion threads spent effort into explaining the details of why nogil would work for them and multiprocessing would not work or work less well. You make this sweeping claim but without backing it up in any way. I’m not going to try to argue about this or respond further; just be aware that claims without justification are rarely convincing.

6 Likes

I’m literally asking for help here, I’m not making any claims. I’m trying to implement a “3. General data processing where the shared data is in the form of an in-memory object graph.” problem with real-life usage using free threads and multiprocessing, but IPC is not the bottleneck.

If free threading provides better performance results, I will also embrace it.

Please keep this topic focused on the alternative proposal to PEP 703, not on the merits of removing the GIL as a whole as that would be considered off-topic. If the conversation continues to veer off I will go through and hide all posts not related to the direct merits to this alternative proposal.

14 Likes

it can be done the same (using the same code) using multiprocessing

I am the author of that article. When you try to download one huge object from a cloud at high speeds, the writer thread is busy enough - if you add the multiprocessing queue deserializaiton effort, it won’t be able to submit buffers to write() as fast as when it didn’t have to carry that extra burden. This is my common experience with multiprocessing - when I have a thread that is already trying to do more than a core of work, then trying to move part of it to separate cores ends up costing more on multiprocessing serialization/deserialization than it saves on processing. YMMV, it depends how much you do with a chunk of data relative to the communication effort. If you’d like to discuss that further, lets do it somewhere else. I might still have the multiprocessing branch somewhere, maybe you’ll want to run it yourself or maybe I could take a look at the problem you mentioned.

Now back to the alternative design:

I actually have a project where in order to serve queries, several gigabytes of tree-structured data needs to be loaded into memory. Extracting all of it in a flat format and writing it to an object storage can either be done via multiprocessing (then you are stuck on one process that holds the memory serving queries from other processes) or via threading (then you are stuck on GIL since only one thread can actively take in and serialize the data for sending it out). Ends up reading faster on a single thread and putting just the IO writes in threads. PEP-703 would allow all the threads to traverse the tree and prepare the data for sending.

I wonder how that would work with the discussion-28234 design (the one in this thread). I think slower reference counting for nodes of that huge tree as well as the necessity to steal the tree or node objects between threads could potentially make it slower than when single threaded.

Does in_C need to be Atomic? I think it might only switch from false to true, so perhaps if there is no risk of overwriting/data loss, it could just be a bool.

For me the most fascinating part of the proposed alternative is no ABI shift, but how would that work exactly? I see that you add three fields to all objects, so the structure of the python object changes. Does ABI stay stable because fields are added at the end and none of the fields are modified? If yes, then maybe the same trick can be used in PEP-703 to keep the thread-local reference counter where it used to be and only add the extra stuff at the end? It might not be as memory efficient as reusing the existing field, but not having to recompile every native extension for a while might be worth it.

5 Likes

Hi @Pawel Polewicz, your implementation serves as a perfect example of utilizing free threads (detailed answer in the Summary). Thank you for your time!

Summary

I generally refrain from commenting on other people’s work due to ethical reasons.

The implementation varies depending on the use case, so I’m not in a position to comment on your work (or anyone else’s work).

Theoretically speaking, you can create multiple parallel jobs (threads/multiprocessing) to download and write files, without any inter-process communication (IPC) between these processes. Considering that both downloading and writing operations are blocking and inherently sequential, there is no need to execute them in parallel. Using a multiprocessing.Queue to send data from the “downloader thread” to the “writer thread” is excessive. Writing cannot occur without first downloading the data, and you don’t have to download data at a rate faster than the size of the write buffer.

Implementation in pseudo-code for the job to download a list of URLs in a single process:

for URL in URLs:
    filename = URL + '.data'
    open(filename, 'wb', buffering...) as w:
        r = requests.get(url)
        for chunk in r.iter(...):  # blocking
            w.write(chunk)  # blocks on buffer full

If you spawn this job on as many cores as you have, you will inevitably encounter a bottleneck, whether it is in the drive or the network, depending on which one is slower.

Moreover, you can utilize the multiprocessing.Queue to evenly distribute the load on the processes by sending a list of URLs. For instance, if a process downloads all the given URLs, you can use the Queue to send additional URLs to that process.

The primary reason I chose to return to Python for production purposes is “Faster CPython”, and for research purposes is Codon.

It’s time to dissociate Python from the notion of being ‘slow’. Slowing down performance in a single thread (and by extension in multiprocessing) would, in my opinion, do more harm than good to the Python community.

My concern lies with algorithms that are IPC bound rather than perfectly parallel. Slowing down Python in a single thread inevitably affects its IPC processing as well. If I were to utilize 128 cores, I would need approximately 10% of these cores just to enable effective multithreading. While it is known that threading does not scale well due to hardware limitations, the exact overhead in this scenario is still uncertain.

I have consulted people smarter than me and it looks like if you’d change Object struct in any way, it will break the stable ABI. The native extensions that use old garbage collecting macros will not recognize owner_id of a thread and when atomic_refcnt >= 1 and ob_refcnt == 0, they will deallocate memory.

On the other hand Sam Gross says the stable ABI is not being used by many packages. So maybe it’s not that big of a deal?

If we take out the advantage from this proposal of not breaking ABI, then the other major one is JIT optimizations, but as mentioned by @pf_moore Faster CPython people mostly figured it out already. There is no implementation. There are some concerns about atomic reads on multi-CPU (not multicore) systems similar to what happened with Gilectomy.

I personally like the idea of stealing and JIT specializaton pinned to the thread that owns the object, it’s a pretty cool way of dealing with the guard/run problem Guido wrote about in the PEP703 thread. It also allows for data to be handed off between threads if necessary while nogil doesn’t allow changing the ownership of the object. I think we should think about this again, if nogil is accepted, but for now, I can’t see discussion-28234 as feasible.

1 Like

Ah, sorry, I missed your previous reply.

Same disclaimer about performance still applies: I have no actual numbers backing up any claim I’m making.

Your right, it doesn’t.

The “slower reference counting” I think is ~ the same atomic instructions as what the reference counting that PEP-703 would do for all threads that didn’t create the object. The main difference this design has it that this design can change which thread can use the fast non-atomic reference counting.

If your tree node is read only, then I would expect this design to perform similar or slightly better than PEP-703. The main performance trade off that this design makes is when you have “mostly read only” data. Then the mutating threads have to wait on all running threads to hit a preempt() when they want to write.

Stealing from write access is faster than stealing from read access (as you wait on only one thread hitting a preempt()). Stealing write access from a sleeping thread is ~ equivalent to picking up an uncontended mutex. An object going read only is technically an optimization that could be disabled/enabled at run time (even disabled per object or class).

I believe I’m handle this case correctly.

In particular: When the C extension drops ob_refcnt to 0, it calls _Py_Dealloc. But there is nothing saying that _Py_Dealloc must deallocate the memory. _Py_Dealloc can realizes that the atomic_refcnt is still not 0 yet, and move some of the reference counts to ob_refcnt. _Py_Dealloc then returns without deallocating the object. C can then drop ob_refcnt back to 0 again, repeating the process.
is
If the other edge case (Python runs out of atomic_refcnt) is hit, then the CLock is picked up, and reference counts can be moved back. There are also no concurrent access to the field as since in_C is set, neither C nor Python would touch ob_refcnt without holding CLock.

There could very well be other reasons why the object struct can’t be changed without breaking ABI in subtle ways. I very much not familiar with the code base.

Yeah. I may play around with it for a little bit anyway, cause I think it is nifty/cool. And I’m curious.

Anyway, thank you for spending time reading over it.

Perhaps in a post-PEP-703 future we’d like to find ourselves in a place where implementing stealing is possible without another major hiccup.

The implementation of stealing doesn’t really need to go into 3.12, but maybe it requires some preparation like adding a thief_id: Optional[ThreadID] to all objects?

GC in nogil is stop-the-world, so maybe the ownership of objects could change after a GC run… but then if we have three threads, the first two can both try to steal the same object from the third one. What should happen in this case? Maybe explicit strealing should return a boolean like success = steal(obj), so the caller might need to loop until this succeeds or it could also abort if it finds there is someone else trying to get it.

My point is that you brought up the discussion about stealing objects in a GIL-less world, but it hasn’t really been discussed for nogil itself and it looks like PEP-703 might not support it and not reserve memory to do so in the future without breaking ABI again. Someone at some point should start thinking about it or else we might see a python 5.0.0 release a year after 4.0.0 ships.

In your design steal_requests is a AtomicQueue[Tuple[Thread, object]]. Why? If that queue grows, the object ownership is going to fly around threads like a roulette ball. I don’t think this should be the low-level behavior - user should be able to try to steal it but if someone else attempts it in the same epoch, it should either end up in an error or a no-op. It looks like a no-op (returning False) would be easier to handle.

Do you feel strongly about it being a queue? Why?

Do you think your implementation of stealing could be portable between nogil and 28234?

I don’t think your design is ABI compatible with existing abi3 extensions, simply because it is not possible to add a new field to all objects.
If you add a new field at the end of PyObject, that breaks existing extensions that rely on the compiled value of sizeof(PyObject), e.g. because they allocate a struct starting with a PyObject field (as is common for extension classes).
If you add the new field before the start of PyObject (as happens with the magic gc-tracking-fields), then only objects allocated by Python will have those fields. But there’s also objects directly allocated by extensions (e.g. directly with malloc instead of PyObject_Malloc; or maybe as a static variable). This is currently valid for types without gc tracking. But it means there’s no way to add new fields to every Python object without recompiling extension modules.

1 Like

According to the text in PEP-703, PEP-703 seems to be stricter with how a C extension may allocate an object. By my current reading, PEP-703 makes it possible to add extra fields before the start of any object:

Python objects must be allocated through the standard APIs, such as PyType_GenericNew or PyObject_Malloc.

PEP-703 already stores all the things 28234 wants in the object anyway: a thread owner, an atomic RC field and a non-atomic RC field.

I’m somewhat more worried of PEP-703 leaking it’s current representation into the stable ABI (the same way reference count was). Eg, the PEP has the code: Py_BEGIN_CRITICAL_SECTION(a->ob_mutex); in it. This leaks that a mutex must exist. Sam has a nogil-3.12 repo which does: Py_BEGIN_CRITICAL_SECTION(self);. Better, but the macro is defined as:

#define Py_BEGIN_CRITICAL_SECTION(op) {         \
    struct _Py_critical_section _cs;            \
    _Py_critical_section_begin(&_cs, &_PyObject_CAST(op)->ob_mutex)

which still leaks the existence of the mutex. If _Py_critical_section_begin took in a raw PyObject, then the existence of ob_mutex would no longer being leaked.

Using the same terminology as what I’ve been using: The reason a stop the world GC can safely traverse objects is that a thread will only stop on a preemption point. Preemption points are significantly more frequent than GCs.

I’m not too sure what you are trying to gain by limiting stealing to only when the GC runs.

The pseudo-code I posted above internally does the retrying for you. I would more worried about the OS descheduling the owner thread you are stealing from over internal contention. So maybe a raw timeout would better suit? Idk what use case you are thinking of.

I would imagine the final code would have a return code of “did an exception occur whilst waiting?”.

The code was already very complex, and I thought that putting in an abstract queue gets across what is happening with the stealing parts. Whilst leaving out unnecessary implementation details.

One such implementation would to add 3x extra pointers to the thread struct: steal_head, steal_next and steal_object. A thread can then use a CAS loop to push itself onto the wait list of a different thread (and a CAS loop to pop off the wait list). So no memory allocations. Real world code would also have to deal with sleeping correctly etc too.

I wrote it as a queue because I think it has the best single threaded performance, and is conceptually simple. My main gripe about a queue like that is you can only steal one object per preemption point. Sure preemption points happen often. But it puts a rate limit on stealing.

A different approach would be for each thread to have an epoch. A thread then only owns an object if both it’s thread_id is in the object’s owner_id and the thread’s current epoch is in an object’s owner_epoch field. If you want to steal an object from a thread, you: Do a CAS on the (owner_id,owner_epoch) fields to (self.thread_id, -1). (-1 to say a steal is in progress). Then send over a signal to the thread to increment it’s epoch at the next preemption point. Since the thread’s epoch has changed, all objects are no longer owned by that thread, and it must use a CAS to resteal access. You can then write your epoch into owner_epoch claiming full ownership.

Requiring a thread to re-steal all objects would slow the thread down. But if the thread stops mutating shared state, then there would be no reason to increase it’s epoch. (And hence, the slowdows would stop). And maybe objects which are only accessible from the current thread don’t need re-stealing.

Somewhat unrelated, but another way of reducing stealing is to implement a golang’s channel like object in a C extension. The make_channel function returns two objects: A sender and a receiver. Two different threads can own the two sides of the channel independently. Then internally implement a wait-free ring buffer. As send() is being called, if the passing object is owned by the current thread, then you swap the (owner_id,owner_epoch) fields to be that of the receiver side. This allows for objects to be sent between threads without stealing.

Good question. I’ve not overly thought of it, and probably need to re-read the nogil details again. Short answer: Yes. It is a matter of if performance gains/losses is worth the code complexity. So thinking a bit out loud here. But:

We can rename our steal_write function to steal_exclusive, and steal_read to steal_shared. When some C code has “shared” access, then it must manage the object’s consistency itself. (An easy way of achieving this would be for the C code to not write to the object).

PEP-703’s Py_BEGIN_CRITICAL_SECTION becomes a steal_exclusive. Py_END_CRITICAL_SECTION becomes a preempt(), and resteals if you have nested critical sections.

Then I think things are now compatible. And we can now do things that are a hybrid between PEP-703 & 28234. Picking on a list implementation:

Reading an element from the list requires no stealing at all. (Assuming you do the shenanigans which are in PEP-703).

For __set__ you first call steal_shared(self). (which may give you exclusive access anyway. (In fact, this is guaranteed for objects which have never been handed to another thread)). If you get “exclusive” access, then you do a non-atomic swap. But if you get “shared” access, you use an atomic_exchange.

But now something like the append function can do a steal_exclusive, and not have to worry about concurrent mutations whilst resizing. Also a clone function could also call steal_exclusive to implement atomic cloning.

Ah, no, you are right. I completely forgot about static variables. :frowning:

Requiring restealing (in your design) would make stealing an O(N) operation as the number of operations required to complete the whole thing increases with the amount of objects owned by the thread we are stealing from. This would be a big problem for a thread that allocates a lot of objects that other threads take over and perform calculations on as the shared state may never end up being mutated.

In your design of stealing, when you write (self.thread_id, -1) to an object… lets say the old thread-local refcount is 5 and shared refcount is 10. The old thread can add the thread-local refcount to the shared refcount and zero out the thread-local one, but when the stealing thread gets to write in its current epoch, it should update the thread-local refcount, but how does it know how many references to that object it’s holding? At this point local is showing 0 and shared is showing 15, 5 of which have been owned by the old owner a millisecond ago, but the other 10 can be owned by anyone.

For cases that I would require restealing, the equivalent code in PEP-703 would require a mutex lock(). For the uncontended path, they are both a CAS. This design would be faster if it doesn’t have to (re)steal, but slower if the object is contented. Algorithmic complexity is the same.

The whole point of using a per thread epoch is that you only have to wait for the thread to hit a preempt() once. After that, all objects are available for a CAS taking by any thread.

Reference counting and thread epoch are independent from each other. When an object is stolen, the two counters are left as is. The reference counters don’t have an individual meaning. More like “This one is faster. Use it if you can”. This reference counting setup does lead to a bit of a weird situation tho:

  • A thread obtains an object it doesn’t own
  • It saves it to a local register (doing an atomic reference count increase)
  • It steals mutual access, becoming the owner
  • < Object is processed in some way >
  • It erases the object from the local register (doing a non-atomic reference count decrease)

The atomic reference count is being increment, and the non-atomic one is being decremented. I solve this by:

  • detecting if the reference count is < 0 (<1 for non-atomic RC)
  • stealing exclusive access (so the non-atomic RC can be changed by us)
  • If there are no counts left, then dealloc.
  • otherwise re-balance the two counters

My understanding of PEP-703 is that it can also run into this issue (tho harder to trigger). And it’s solution is to mark the object as to never use the non-atomic RC after this is detected. Merging the two fields into one. (With an edge case that it leaves up to the tracing GC to fix)

There is a bit of ambiguity during the time an object is being stolen. In that after a different thread sets (self.thread_id, -1), then you can no longer tell whether or you own the object. But even tho you may not necessarily know you own the object, your ownership still lasts until your next preemption point. The rules for reference counting are:

  1. If you don’t own the object, then you must use the atomic reference count.
  2. If you do own the object, then you may use either atomic or non-atomic reference count.

So the ambiguity doesn’t affect correctness.

1 Like

If my algorithm requires multiple objects to change under a lock then the stealing logic just slows me down.

I lock then stealing locks multiple times.
The I unlock.

Is that right?

I don’t have any code or numbers backing me up. So everything I’m saying is entirely theoretical. And to be clear, we are talking about the performance of C extensions (not python code)

I’ve been calling it steal/preempt rather than lock/unlock because they behave somewhat differently. Once you have done a steal() to get access, then you keep your access until a different thread steals from you (which will always happen at a preemption point). Whereas, with a classical mutex, after you lock(), then you have access until you unlock(). The stealing design somewhat matches what the GIL currently does. “Everything is static until you call something that drops the GIL”.

Compared to a mutex, I think that stealing is faster if you were the last thread to access the object, and slower if you weren’t. This is nice for the “thread mutating local state” case, as you would always hit this fast path.

To prevent deadlocks, stealing itself is a preemption point. In my first post, I did sketch out a way of stealing multiple objects at the same time without deadlocking or livelocking. In summary:

  • Fast path: loop over all objects. If you already have access to every object, then you are done
  • Sort the object list by memory address
  • preempt()
  • Steal each object in turn, disallowing other threads from stealing back

The fast path is faster. The slow path is slower. Whether or not this performs better or worse depends on how contended your locks are.

1 Like

That does work with classical locking and deadlock methodology.
Classically if you need 2 locks, A and B the deadlock happens if
thread-1 locks A then B but thread-2 locks B then A.

With the stealing how do prevent the reversal of A and B?

If you are preempting after each steal then my threads never get both A and B?

Hang on, I’m confused here. What does “preempt” mean, and what is a preemption point? Normally, that’s one of the key differences between “preemptive” and “cooperative” systems - in a preemptive system, any thread/process can take control at any time, but with cooperative ones, it only happens at well-defined yield points. (Compare multiprocessing with asyncio/await for a clear demonstration of that difference.) So what precisely is a preemption point, and does it relinquish access or is access taken from you?

There is a multi steal primitive (what I was describing in my last post) which you can use to steal two objects. It internally would sort the object by address to force a global order. For two object stealing, sorting is trivial and fast.

It looks like this is semantically equivalent to what is done in PEP-703. (If I would of realized that earlier, I would of worded my first post in terms of PEP-703). In PEP-703, there is both a lock one and lock two object primitives. And picking up a lock may temporally drop other locks held by the current thread. It also has some nice macros to conveniently do the re-picking up of locks for you. See PEP 703 – Making the Global Interpreter Lock Optional in CPython | peps.python.org

Ah, I think you are right. I’ve abused terminology here :frowning:.

It is a cooperative system. Only at fixed points in execution may an object be stolen from you. This point is semantically equivalent to dropping all held locks in a mutex system. It behaves like the dropping & picking up the GIL in current cPython. (And for a thread which happens to not access global state, it becomes a no-op.)

I’m not overly sure what terminology should be used here then. “Yield” is more giving up execution time. “Await” is more waiting for something to complete. “Unlock” implies you are putting the object back into a neutral state. Any better ideas?

I’d go with “yield”, even though you aren’t necessarily giving up execution time, since another thread that wants your locks is waiting for you to relinquish them. You have to yield to the other thread in order to get anywhere.

But, this sounds like a fairly big problem. If you don’t have a single global lock, how many locks are you going to have to release and reacquire every time you hit one of these points? And presumably these points will need to be fairly frequent. That’s a lot of locking overhead.