Alternate design for removing the GIL

I did see this use case, it can be done the same (using the same code) using multiprocessing.

What about the well-established user base, are they willing to pay a x% nogil tax? The way I see it, they will need x% more CPU cores to offer the same service, i.e., x% cost increase in CPU power. To think that many services operate on a 0.X% net revenue.

Note that I’m not against the free threading, but we have to think about its effects on the current user base.

Hmm, did you see the line where the author says:

Using a multiprocessing.Queue in the same place results in degraded performance (approximately -15%).

1 Like

I prefer not to comment on the work of others, but theoretically speaking, that is not a parallel problem at all (multiprocessing.Queue is redundant). Just download and write in drive.

The PEP has a section on multiprocessing, and the thread PEP 703: Making the Global Interpreter Lock Optional (3.12 updates) has had numerous people explaining why they would benefit from nogil while they don’t benefit from multiprocessing. If you want to comment constructively on this topic, you have to address those. It sounds like you’re ignoring them, which is not productive. Thank you.

7 Likes

I don’t want to fight with you, but if you actually read the backblaze post it clearly explains why this is not so.

1 Like

I don’t believe that asking for real-world use case comparison between threads and multiprocessing is nonconstructive.

Nonconstructive is hoping that removing GIL will solve parallelization problems, or thinking that GIL is stopping us from solving parallel problems.

Read Mark Shannon comment: PEP 703: Making the Global Interpreter Lock Optional (3.12 updates) - #9 by markshannon

For the purposes of this discussion, let’s categorize parallel application into three groups:

  1. Data-store backed processing. All the shared data is stored in a data store, such a Postgres database, the processes share little or no data, communicating via the data store.
  2. Numerical processing (including machine learning) where the shared data is matrices of numbers.
  3. General data processing where the shared data is in the form of an in-memory object graph.

Python has always supported category 1, with multiprocessing or through some sort of external load balancer.
Category 2 is supported by multiple interpreters.
It is category 3 that benefits from NoGIL.

How common is category 3?
All the motivating examples in PEP 703 are in category 2.

Look closely at performance overhead numbers. Also, we don’t know what the overhead will be in the algorithms that we hope to benefit from GIL removal, because I haven’t seen any number yet (If I have missed any number, please show me). In other words, would this benefit outnumber the performance overhead.

Frankly speaking, why should I have to pay more for something I don’t use? It’s like paying a toll for a road I never use. If it weren’t for the performance overhead, we wouldn’t be discussing this at all, and the PEP would have been accepted quietly.

If you can demonstrate a GIL removal design without any overhead, I would gladly put ‘NOGIL’ on a t-shirt and wear it proudly all year round.

But if you’re just ignoring the existing examples and comments from people who work on various packages because you don’t believe them. This isn’t a recipe for a productive discussion.

5 Likes

These problems can be addressed in the same way using multiprocessing. I am referring specifically to problems that cannot be solved by multiprocessing alone.

Because there’s only one road, and all CPython users are on it. We have to make decisions and trade offs for everyone, not just you. (Or me: removing the GIL won’t benefit me in my current job.)

Will removing the GIL from CPython cause a slowdown for single threaded users? Almost certainly. Will it be better for the community as a whole? That’s the hard decision.

12 Likes

The PEP author and numerous people on the various discussion threads spent effort into explaining the details of why nogil would work for them and multiprocessing would not work or work less well. You make this sweeping claim but without backing it up in any way. I’m not going to try to argue about this or respond further; just be aware that claims without justification are rarely convincing.

6 Likes

I’m literally asking for help here, I’m not making any claims. I’m trying to implement a “3. General data processing where the shared data is in the form of an in-memory object graph.” problem with real-life usage using free threads and multiprocessing, but IPC is not the bottleneck.

If free threading provides better performance results, I will also embrace it.

Please keep this topic focused on the alternative proposal to PEP 703, not on the merits of removing the GIL as a whole as that would be considered off-topic. If the conversation continues to veer off I will go through and hide all posts not related to the direct merits to this alternative proposal.

14 Likes

it can be done the same (using the same code) using multiprocessing

I am the author of that article. When you try to download one huge object from a cloud at high speeds, the writer thread is busy enough - if you add the multiprocessing queue deserializaiton effort, it won’t be able to submit buffers to write() as fast as when it didn’t have to carry that extra burden. This is my common experience with multiprocessing - when I have a thread that is already trying to do more than a core of work, then trying to move part of it to separate cores ends up costing more on multiprocessing serialization/deserialization than it saves on processing. YMMV, it depends how much you do with a chunk of data relative to the communication effort. If you’d like to discuss that further, lets do it somewhere else. I might still have the multiprocessing branch somewhere, maybe you’ll want to run it yourself or maybe I could take a look at the problem you mentioned.

Now back to the alternative design:

I actually have a project where in order to serve queries, several gigabytes of tree-structured data needs to be loaded into memory. Extracting all of it in a flat format and writing it to an object storage can either be done via multiprocessing (then you are stuck on one process that holds the memory serving queries from other processes) or via threading (then you are stuck on GIL since only one thread can actively take in and serialize the data for sending it out). Ends up reading faster on a single thread and putting just the IO writes in threads. PEP-703 would allow all the threads to traverse the tree and prepare the data for sending.

I wonder how that would work with the discussion-28234 design (the one in this thread). I think slower reference counting for nodes of that huge tree as well as the necessity to steal the tree or node objects between threads could potentially make it slower than when single threaded.

Does in_C need to be Atomic? I think it might only switch from false to true, so perhaps if there is no risk of overwriting/data loss, it could just be a bool.

For me the most fascinating part of the proposed alternative is no ABI shift, but how would that work exactly? I see that you add three fields to all objects, so the structure of the python object changes. Does ABI stay stable because fields are added at the end and none of the fields are modified? If yes, then maybe the same trick can be used in PEP-703 to keep the thread-local reference counter where it used to be and only add the extra stuff at the end? It might not be as memory efficient as reusing the existing field, but not having to recompile every native extension for a while might be worth it.

5 Likes

Hi @Pawel Polewicz, your implementation serves as a perfect example of utilizing free threads (detailed answer in the Summary). Thank you for your time!

Summary

I generally refrain from commenting on other people’s work due to ethical reasons.

The implementation varies depending on the use case, so I’m not in a position to comment on your work (or anyone else’s work).

Theoretically speaking, you can create multiple parallel jobs (threads/multiprocessing) to download and write files, without any inter-process communication (IPC) between these processes. Considering that both downloading and writing operations are blocking and inherently sequential, there is no need to execute them in parallel. Using a multiprocessing.Queue to send data from the “downloader thread” to the “writer thread” is excessive. Writing cannot occur without first downloading the data, and you don’t have to download data at a rate faster than the size of the write buffer.

Implementation in pseudo-code for the job to download a list of URLs in a single process:

for URL in URLs:
    filename = URL + '.data'
    open(filename, 'wb', buffering...) as w:
        r = requests.get(url)
        for chunk in r.iter(...):  # blocking
            w.write(chunk)  # blocks on buffer full

If you spawn this job on as many cores as you have, you will inevitably encounter a bottleneck, whether it is in the drive or the network, depending on which one is slower.

Moreover, you can utilize the multiprocessing.Queue to evenly distribute the load on the processes by sending a list of URLs. For instance, if a process downloads all the given URLs, you can use the Queue to send additional URLs to that process.

The primary reason I chose to return to Python for production purposes is “Faster CPython”, and for research purposes is Codon.

It’s time to dissociate Python from the notion of being ‘slow’. Slowing down performance in a single thread (and by extension in multiprocessing) would, in my opinion, do more harm than good to the Python community.

My concern lies with algorithms that are IPC bound rather than perfectly parallel. Slowing down Python in a single thread inevitably affects its IPC processing as well. If I were to utilize 128 cores, I would need approximately 10% of these cores just to enable effective multithreading. While it is known that threading does not scale well due to hardware limitations, the exact overhead in this scenario is still uncertain.

I have consulted people smarter than me and it looks like if you’d change Object struct in any way, it will break the stable ABI. The native extensions that use old garbage collecting macros will not recognize owner_id of a thread and when atomic_refcnt >= 1 and ob_refcnt == 0, they will deallocate memory.

On the other hand Sam Gross says the stable ABI is not being used by many packages. So maybe it’s not that big of a deal?

If we take out the advantage from this proposal of not breaking ABI, then the other major one is JIT optimizations, but as mentioned by @pf_moore Faster CPython people mostly figured it out already. There is no implementation. There are some concerns about atomic reads on multi-CPU (not multicore) systems similar to what happened with Gilectomy.

I personally like the idea of stealing and JIT specializaton pinned to the thread that owns the object, it’s a pretty cool way of dealing with the guard/run problem Guido wrote about in the PEP703 thread. It also allows for data to be handed off between threads if necessary while nogil doesn’t allow changing the ownership of the object. I think we should think about this again, if nogil is accepted, but for now, I can’t see discussion-28234 as feasible.

1 Like

Ah, sorry, I missed your previous reply.

Same disclaimer about performance still applies: I have no actual numbers backing up any claim I’m making.

Your right, it doesn’t.

The “slower reference counting” I think is ~ the same atomic instructions as what the reference counting that PEP-703 would do for all threads that didn’t create the object. The main difference this design has it that this design can change which thread can use the fast non-atomic reference counting.

If your tree node is read only, then I would expect this design to perform similar or slightly better than PEP-703. The main performance trade off that this design makes is when you have “mostly read only” data. Then the mutating threads have to wait on all running threads to hit a preempt() when they want to write.

Stealing from write access is faster than stealing from read access (as you wait on only one thread hitting a preempt()). Stealing write access from a sleeping thread is ~ equivalent to picking up an uncontended mutex. An object going read only is technically an optimization that could be disabled/enabled at run time (even disabled per object or class).

I believe I’m handle this case correctly.

In particular: When the C extension drops ob_refcnt to 0, it calls _Py_Dealloc. But there is nothing saying that _Py_Dealloc must deallocate the memory. _Py_Dealloc can realizes that the atomic_refcnt is still not 0 yet, and move some of the reference counts to ob_refcnt. _Py_Dealloc then returns without deallocating the object. C can then drop ob_refcnt back to 0 again, repeating the process.
is
If the other edge case (Python runs out of atomic_refcnt) is hit, then the CLock is picked up, and reference counts can be moved back. There are also no concurrent access to the field as since in_C is set, neither C nor Python would touch ob_refcnt without holding CLock.

There could very well be other reasons why the object struct can’t be changed without breaking ABI in subtle ways. I very much not familiar with the code base.

Yeah. I may play around with it for a little bit anyway, cause I think it is nifty/cool. And I’m curious.

Anyway, thank you for spending time reading over it.

Perhaps in a post-PEP-703 future we’d like to find ourselves in a place where implementing stealing is possible without another major hiccup.

The implementation of stealing doesn’t really need to go into 3.12, but maybe it requires some preparation like adding a thief_id: Optional[ThreadID] to all objects?

GC in nogil is stop-the-world, so maybe the ownership of objects could change after a GC run… but then if we have three threads, the first two can both try to steal the same object from the third one. What should happen in this case? Maybe explicit strealing should return a boolean like success = steal(obj), so the caller might need to loop until this succeeds or it could also abort if it finds there is someone else trying to get it.

My point is that you brought up the discussion about stealing objects in a GIL-less world, but it hasn’t really been discussed for nogil itself and it looks like PEP-703 might not support it and not reserve memory to do so in the future without breaking ABI again. Someone at some point should start thinking about it or else we might see a python 5.0.0 release a year after 4.0.0 ships.

In your design steal_requests is a AtomicQueue[Tuple[Thread, object]]. Why? If that queue grows, the object ownership is going to fly around threads like a roulette ball. I don’t think this should be the low-level behavior - user should be able to try to steal it but if someone else attempts it in the same epoch, it should either end up in an error or a no-op. It looks like a no-op (returning False) would be easier to handle.

Do you feel strongly about it being a queue? Why?

Do you think your implementation of stealing could be portable between nogil and 28234?

I don’t think your design is ABI compatible with existing abi3 extensions, simply because it is not possible to add a new field to all objects.
If you add a new field at the end of PyObject, that breaks existing extensions that rely on the compiled value of sizeof(PyObject), e.g. because they allocate a struct starting with a PyObject field (as is common for extension classes).
If you add the new field before the start of PyObject (as happens with the magic gc-tracking-fields), then only objects allocated by Python will have those fields. But there’s also objects directly allocated by extensions (e.g. directly with malloc instead of PyObject_Malloc; or maybe as a static variable). This is currently valid for types without gc tracking. But it means there’s no way to add new fields to every Python object without recompiling extension modules.

1 Like

According to the text in PEP-703, PEP-703 seems to be stricter with how a C extension may allocate an object. By my current reading, PEP-703 makes it possible to add extra fields before the start of any object:

Python objects must be allocated through the standard APIs, such as PyType_GenericNew or PyObject_Malloc.

PEP-703 already stores all the things 28234 wants in the object anyway: a thread owner, an atomic RC field and a non-atomic RC field.

I’m somewhat more worried of PEP-703 leaking it’s current representation into the stable ABI (the same way reference count was). Eg, the PEP has the code: Py_BEGIN_CRITICAL_SECTION(a->ob_mutex); in it. This leaks that a mutex must exist. Sam has a nogil-3.12 repo which does: Py_BEGIN_CRITICAL_SECTION(self);. Better, but the macro is defined as:

#define Py_BEGIN_CRITICAL_SECTION(op) {         \
    struct _Py_critical_section _cs;            \
    _Py_critical_section_begin(&_cs, &_PyObject_CAST(op)->ob_mutex)

which still leaks the existence of the mutex. If _Py_critical_section_begin took in a raw PyObject, then the existence of ob_mutex would no longer being leaked.

Using the same terminology as what I’ve been using: The reason a stop the world GC can safely traverse objects is that a thread will only stop on a preemption point. Preemption points are significantly more frequent than GCs.

I’m not too sure what you are trying to gain by limiting stealing to only when the GC runs.

The pseudo-code I posted above internally does the retrying for you. I would more worried about the OS descheduling the owner thread you are stealing from over internal contention. So maybe a raw timeout would better suit? Idk what use case you are thinking of.

I would imagine the final code would have a return code of “did an exception occur whilst waiting?”.

The code was already very complex, and I thought that putting in an abstract queue gets across what is happening with the stealing parts. Whilst leaving out unnecessary implementation details.

One such implementation would to add 3x extra pointers to the thread struct: steal_head, steal_next and steal_object. A thread can then use a CAS loop to push itself onto the wait list of a different thread (and a CAS loop to pop off the wait list). So no memory allocations. Real world code would also have to deal with sleeping correctly etc too.

I wrote it as a queue because I think it has the best single threaded performance, and is conceptually simple. My main gripe about a queue like that is you can only steal one object per preemption point. Sure preemption points happen often. But it puts a rate limit on stealing.

A different approach would be for each thread to have an epoch. A thread then only owns an object if both it’s thread_id is in the object’s owner_id and the thread’s current epoch is in an object’s owner_epoch field. If you want to steal an object from a thread, you: Do a CAS on the (owner_id,owner_epoch) fields to (self.thread_id, -1). (-1 to say a steal is in progress). Then send over a signal to the thread to increment it’s epoch at the next preemption point. Since the thread’s epoch has changed, all objects are no longer owned by that thread, and it must use a CAS to resteal access. You can then write your epoch into owner_epoch claiming full ownership.

Requiring a thread to re-steal all objects would slow the thread down. But if the thread stops mutating shared state, then there would be no reason to increase it’s epoch. (And hence, the slowdows would stop). And maybe objects which are only accessible from the current thread don’t need re-stealing.

Somewhat unrelated, but another way of reducing stealing is to implement a golang’s channel like object in a C extension. The make_channel function returns two objects: A sender and a receiver. Two different threads can own the two sides of the channel independently. Then internally implement a wait-free ring buffer. As send() is being called, if the passing object is owned by the current thread, then you swap the (owner_id,owner_epoch) fields to be that of the receiver side. This allows for objects to be sent between threads without stealing.

Good question. I’ve not overly thought of it, and probably need to re-read the nogil details again. Short answer: Yes. It is a matter of if performance gains/losses is worth the code complexity. So thinking a bit out loud here. But:

We can rename our steal_write function to steal_exclusive, and steal_read to steal_shared. When some C code has “shared” access, then it must manage the object’s consistency itself. (An easy way of achieving this would be for the C code to not write to the object).

PEP-703’s Py_BEGIN_CRITICAL_SECTION becomes a steal_exclusive. Py_END_CRITICAL_SECTION becomes a preempt(), and resteals if you have nested critical sections.

Then I think things are now compatible. And we can now do things that are a hybrid between PEP-703 & 28234. Picking on a list implementation:

Reading an element from the list requires no stealing at all. (Assuming you do the shenanigans which are in PEP-703).

For __set__ you first call steal_shared(self). (which may give you exclusive access anyway. (In fact, this is guaranteed for objects which have never been handed to another thread)). If you get “exclusive” access, then you do a non-atomic swap. But if you get “shared” access, you use an atomic_exchange.

But now something like the append function can do a steal_exclusive, and not have to worry about concurrent mutations whilst resizing. Also a clone function could also call steal_exclusive to implement atomic cloning.

Ah, no, you are right. I completely forgot about static variables. :frowning:

Requiring restealing (in your design) would make stealing an O(N) operation as the number of operations required to complete the whole thing increases with the amount of objects owned by the thread we are stealing from. This would be a big problem for a thread that allocates a lot of objects that other threads take over and perform calculations on as the shared state may never end up being mutated.

In your design of stealing, when you write (self.thread_id, -1) to an object… lets say the old thread-local refcount is 5 and shared refcount is 10. The old thread can add the thread-local refcount to the shared refcount and zero out the thread-local one, but when the stealing thread gets to write in its current epoch, it should update the thread-local refcount, but how does it know how many references to that object it’s holding? At this point local is showing 0 and shared is showing 15, 5 of which have been owned by the old owner a millisecond ago, but the other 10 can be owned by anyone.