I still have to see a real-world use case where using threads (plus nogil overhead) is faster than using multiprocessing. I have been following the discussions about nogil, I didn’t see any number, only a fib() algorithm which can be done the same using multiprocessing.
What I’m trying to say is that the “parallelization problem” would still be there in the nogil world. You still have to use sync primitives (queue, locks, etc…). I believe that free threading would give developers false hopes.
From what I can see, the people who are looking forward for free threading in python are well aware of what that entails and how it works, because they’re already going into other languages to accomplish it. The point is that we’d rather not have to do that.
I did see this use case, it can be done the same (using the same code) using multiprocessing.
What about the well-established user base, are they willing to pay a x% nogil tax? The way I see it, they will need x% more CPU cores to offer the same service, i.e., x% cost increase in CPU power. To think that many services operate on a 0.X% net revenue.
Note that I’m not against the free threading, but we have to think about its effects on the current user base.
The PEP has a section on multiprocessing, and the thread PEP 703: Making the Global Interpreter Lock Optional (3.12 updates) has had numerous people explaining why they would benefit from nogil while they don’t benefit from multiprocessing. If you want to comment constructively on this topic, you have to address those. It sounds like you’re ignoring them, which is not productive. Thank you.
For the purposes of this discussion, let’s categorize parallel application into three groups:
Data-store backed processing. All the shared data is stored in a data store, such a Postgres database, the processes share little or no data, communicating via the data store.
Numerical processing (including machine learning) where the shared data is matrices of numbers.
General data processing where the shared data is in the form of an in-memory object graph.
Python has always supported category 1, with multiprocessing or through some sort of external load balancer.
Category 2 is supported by multiple interpreters.
It is category 3 that benefits from NoGIL.
How common is category 3?
All the motivating examples in PEP 703 are in category 2.
Look closely at performance overhead numbers. Also, we don’t know what the overhead will be in the algorithms that we hope to benefit from GIL removal, because I haven’t seen any number yet (If I have missed any number, please show me). In other words, would this benefit outnumber the performance overhead.
Frankly speaking, why should I have to pay more for something I don’t use? It’s like paying a toll for a road I never use. If it weren’t for the performance overhead, we wouldn’t be discussing this at all, and the PEP would have been accepted quietly.
If you can demonstrate a GIL removal design without any overhead, I would gladly put ‘NOGIL’ on a t-shirt and wear it proudly all year round.
The PEP author and numerous people on the various discussion threads spent effort into explaining the details of why nogil would work for them and multiprocessing would not work or work less well. You make this sweeping claim but without backing it up in any way. I’m not going to try to argue about this or respond further; just be aware that claims without justification are rarely convincing.
I’m literally asking for help here, I’m not making any claims. I’m trying to implement a “3. General data processing where the shared data is in the form of an in-memory object graph.” problem with real-life usage using free threads and multiprocessing, but IPC is not the bottleneck.
If free threading provides better performance results, I will also embrace it.
Please keep this topic focused on the alternative proposal to PEP 703, not on the merits of removing the GIL as a whole as that would be considered off-topic. If the conversation continues to veer off I will go through and hide all posts not related to the direct merits to this alternative proposal.
it can be done the same (using the same code) using multiprocessing
I am the author of that article. When you try to download one huge object from a cloud at high speeds, the writer thread is busy enough - if you add the multiprocessing queue deserializaiton effort, it won’t be able to submit buffers to write() as fast as when it didn’t have to carry that extra burden. This is my common experience with multiprocessing - when I have a thread that is already trying to do more than a core of work, then trying to move part of it to separate cores ends up costing more on multiprocessing serialization/deserialization than it saves on processing. YMMV, it depends how much you do with a chunk of data relative to the communication effort. If you’d like to discuss that further, lets do it somewhere else. I might still have the multiprocessing branch somewhere, maybe you’ll want to run it yourself or maybe I could take a look at the problem you mentioned.
Now back to the alternative design:
I actually have a project where in order to serve queries, several gigabytes of tree-structured data needs to be loaded into memory. Extracting all of it in a flat format and writing it to an object storage can either be done via multiprocessing (then you are stuck on one process that holds the memory serving queries from other processes) or via threading (then you are stuck on GIL since only one thread can actively take in and serialize the data for sending it out). Ends up reading faster on a single thread and putting just the IO writes in threads. PEP-703 would allow all the threads to traverse the tree and prepare the data for sending.
I wonder how that would work with the discussion-28234 design (the one in this thread). I think slower reference counting for nodes of that huge tree as well as the necessity to steal the tree or node objects between threads could potentially make it slower than when single threaded.
Does in_C need to be Atomic? I think it might only switch from false to true, so perhaps if there is no risk of overwriting/data loss, it could just be a bool.
For me the most fascinating part of the proposed alternative is no ABI shift, but how would that work exactly? I see that you add three fields to all objects, so the structure of the python object changes. Does ABI stay stable because fields are added at the end and none of the fields are modified? If yes, then maybe the same trick can be used in PEP-703 to keep the thread-local reference counter where it used to be and only add the extra stuff at the end? It might not be as memory efficient as reusing the existing field, but not having to recompile every native extension for a while might be worth it.
Hi @Pawel Polewicz, your implementation serves as a perfect example of utilizing free threads (detailed answer in the Summary). Thank you for your time!
I generally refrain from commenting on other people’s work due to ethical reasons.
The implementation varies depending on the use case, so I’m not in a position to comment on your work (or anyone else’s work).
Theoretically speaking, you can create multiple parallel jobs (threads/multiprocessing) to download and write files, without any inter-process communication (IPC) between these processes. Considering that both downloading and writing operations are blocking and inherently sequential, there is no need to execute them in parallel. Using a multiprocessing.Queue to send data from the “downloader thread” to the “writer thread” is excessive. Writing cannot occur without first downloading the data, and you don’t have to download data at a rate faster than the size of the write buffer.
Implementation in pseudo-code for the job to download a list of URLs in a single process:
for URL in URLs:
filename = URL + '.data'
open(filename, 'wb', buffering...) as w:
r = requests.get(url)
for chunk in r.iter(...): # blocking
w.write(chunk) # blocks on buffer full
If you spawn this job on as many cores as you have, you will inevitably encounter a bottleneck, whether it is in the drive or the network, depending on which one is slower.
Moreover, you can utilize the multiprocessing.Queue to evenly distribute the load on the processes by sending a list of URLs. For instance, if a process downloads all the given URLs, you can use the Queue to send additional URLs to that process.
The primary reason I chose to return to Python for production purposes is “Faster CPython”, and for research purposes is Codon.
It’s time to dissociate Python from the notion of being ‘slow’. Slowing down performance in a single thread (and by extension in multiprocessing) would, in my opinion, do more harm than good to the Python community.
My concern lies with algorithms that are IPC bound rather than perfectly parallel. Slowing down Python in a single thread inevitably affects its IPC processing as well. If I were to utilize 128 cores, I would need approximately 10% of these cores just to enable effective multithreading. While it is known that threading does not scale well due to hardware limitations, the exact overhead in this scenario is still uncertain.
I have consulted people smarter than me and it looks like if you’d change Object struct in any way, it will break the stable ABI. The native extensions that use old garbage collecting macros will not recognize owner_id of a thread and when atomic_refcnt >= 1 and ob_refcnt == 0, they will deallocate memory.
On the other hand Sam Gross says the stable ABI is not being used by many packages. So maybe it’s not that big of a deal?
If we take out the advantage from this proposal of not breaking ABI, then the other major one is JIT optimizations, but as mentioned by @pf_moore Faster CPython people mostly figured it out already. There is no implementation. There are some concerns about atomic reads on multi-CPU (not multicore) systems similar to what happened with Gilectomy.
I personally like the idea of stealing and JIT specializaton pinned to the thread that owns the object, it’s a pretty cool way of dealing with the guard/run problem Guido wrote about in the PEP703 thread. It also allows for data to be handed off between threads if necessary while nogil doesn’t allow changing the ownership of the object. I think we should think about this again, if nogil is accepted, but for now, I can’t see discussion-28234 as feasible.
Same disclaimer about performance still applies: I have no actual numbers backing up any claim I’m making.
Your right, it doesn’t.
The “slower reference counting” I think is ~ the same atomic instructions as what the reference counting that PEP-703 would do for all threads that didn’t create the object. The main difference this design has it that this design can change which thread can use the fast non-atomic reference counting.
If your tree node is read only, then I would expect this design to perform similar or slightly better than PEP-703. The main performance trade off that this design makes is when you have “mostly read only” data. Then the mutating threads have to wait on all running threads to hit a preempt() when they want to write.
Stealing from write access is faster than stealing from read access (as you wait on only one thread hitting a preempt()). Stealing write access from a sleeping thread is ~ equivalent to picking up an uncontended mutex. An object going read only is technically an optimization that could be disabled/enabled at run time (even disabled per object or class).
I believe I’m handle this case correctly.
In particular: When the C extension drops ob_refcnt to 0, it calls _Py_Dealloc. But there is nothing saying that _Py_Dealloc must deallocate the memory. _Py_Dealloc can realizes that the atomic_refcnt is still not 0 yet, and move some of the reference counts to ob_refcnt. _Py_Dealloc then returns without deallocating the object. C can then drop ob_refcnt back to 0 again, repeating the process.
If the other edge case (Python runs out of atomic_refcnt) is hit, then the CLock is picked up, and reference counts can be moved back. There are also no concurrent access to the field as since in_C is set, neither C nor Python would touch ob_refcnt without holding CLock.
There could very well be other reasons why the object struct can’t be changed without breaking ABI in subtle ways. I very much not familiar with the code base.
Yeah. I may play around with it for a little bit anyway, cause I think it is nifty/cool. And I’m curious.
Anyway, thank you for spending time reading over it.
Perhaps in a post-PEP-703 future we’d like to find ourselves in a place where implementing stealing is possible without another major hiccup.
The implementation of stealing doesn’t really need to go into 3.12, but maybe it requires some preparation like adding a thief_id: Optional[ThreadID] to all objects?
GC in nogil is stop-the-world, so maybe the ownership of objects could change after a GC run… but then if we have three threads, the first two can both try to steal the same object from the third one. What should happen in this case? Maybe explicit strealing should return a boolean like success = steal(obj), so the caller might need to loop until this succeeds or it could also abort if it finds there is someone else trying to get it.
My point is that you brought up the discussion about stealing objects in a GIL-less world, but it hasn’t really been discussed for nogil itself and it looks like PEP-703 might not support it and not reserve memory to do so in the future without breaking ABI again. Someone at some point should start thinking about it or else we might see a python 5.0.0 release a year after 4.0.0 ships.
In your design steal_requests is a AtomicQueue[Tuple[Thread, object]]. Why? If that queue grows, the object ownership is going to fly around threads like a roulette ball. I don’t think this should be the low-level behavior - user should be able to try to steal it but if someone else attempts it in the same epoch, it should either end up in an error or a no-op. It looks like a no-op (returning False) would be easier to handle.
Do you feel strongly about it being a queue? Why?
Do you think your implementation of stealing could be portable between nogil and 28234?
I don’t think your design is ABI compatible with existing abi3 extensions, simply because it is not possible to add a new field to all objects.
If you add a new field at the end of PyObject, that breaks existing extensions that rely on the compiled value of sizeof(PyObject), e.g. because they allocate a struct starting with a PyObject field (as is common for extension classes).
If you add the new field before the start of PyObject (as happens with the magic gc-tracking-fields), then only objects allocated by Python will have those fields. But there’s also objects directly allocated by extensions (e.g. directly with malloc instead of PyObject_Malloc; or maybe as a static variable). This is currently valid for types without gc tracking. But it means there’s no way to add new fields to every Python object without recompiling extension modules.