PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

Hi Mark,

  • You are comparing to a different base than the PEP. As stated in the PEP and at the language summit, the comparisons are to the implementation of immortal objects from PR 19474, specifically commit 018be4c. I think that’s an appropriate comparison because supporting immortal objects is a performance cost shared between this PEP and 3.12. That commit was chosen because it was recent and the base of that PR has the same performance as 3.12.0a4.
  • I believe you are using an old commit for nogil-3.12, from before the language summit and PEP update when nogil-3.12 was still in active development. I am comparing d595911 with 018be4c on Ubuntu 22.04 (AWS c5n.metal) with GCC 11 and PGO+LTO. I am using d595911 because it’s the commit right before disabling the GIL; that avoids any complications from bm_concurrent_imap using threads.
  • You have linked to a document that reports 10% overhead, but write 11% in the post.
  • The GC numbers are a useful data point. Thanks for providing them; they were larger than I expected.
  • Mimalloc does not provide a 1% performance improvement. To the extent that it provided a performance benefit in nogil-3.9, that was because it offset the cost of removing freelists, but nogil-3.12 keeps freelists as per-thread state. You can compare a595fc1593 with d13c63dee9, (the commits that integrate mimalloc.) I measured a tiny <1% regression from adding mimalloc.
  • Locking is not 5% overhead; it’s about 1%. You can benchmark it with the critical sections API as no-ops.

Regarding parallel applications, I disagree with both the general categorization and the specific claims about those categories. Many Python applications do not fall neatly into a single category. Outside of simple examples, machine learning applications are rarely “just” machine learning. They frequently include things like, data processing, logging, model serving, I/O, and network communication. From what I’ve seen of Meta’s Python web servers, you might think from a high-level description that they fall neatly into category 1, but if you look more deeply, there is more shared state than you might expect because the underlying C++ libraries have their own thread pools and caches. (A lot of infrastructure designed to be used from both C++ and Python services.)

I also disagree with the notion that machine learning is well served by multiple interpreters. PyTorch has had a multiple interpreters solution (multipy/torch::deploy) for about two years. Based on the experience, I don’t think many PyTorch developers would consider multiple interpreters as solving the multi-core story once and for all. The challenges include both usability issues due to the lack of sharing in Python (even for read-only objects), and additional technical complexity due to the potential 1:N mapping from C++ objects to Python wrappers.

Regarding multiple interpreters, there are some implicit assumptions that I don’t think are correct:

  • That free-threading is a world-of-pain and that multiple interpreters somehow avoids this. The isolation from multiple interpreters is only on the Python side, but to the extent that there are hard to debug thread-safety bugs, they are pretty much exclusively on the C/C++ side of things. Multiple interpreters do not isolate C/C++ code.
  • That nogil is going to break the world, when there is an implementation that works with a large number of extensions (NumPy, SciPy, scikit-learn, PyTorch, pybind11, Cython to name a few) and extension developers have repeatedly expressed interest in supporting it. This is in contrast to multiple interpreters, which requires more work for extension developers to support. For example, it took only a few lines of code and less than a day to get PyTorch working with nogil-3.9. It would be a very large undertaking (weeks or months) to get it working with multiple interpreters. This is also in contrast to the 3.11 release; it’s taken over a month of engineering work for PyTorch to support 3.11.
35 Likes