PEP 703: Making the Global Interpreter Lock Optional (3.12 updates)

markshannon · June 2, 2023, 2:39pm

We (the Faster CPython team) have taken a careful look at the PEP and I have written up a summary. It is mainly about performance, but also compares PEP 703 to multiple interpreters.
I have attempted to be objective, please say if you think I am being unfair.

Performance assessment

Performance comparison of NoGIL with the status quo

PEP 703 claims that NoGIL has a 6% execution overhead on Intel Skylake machines, and 5% on AMD Zen 3.
Our benchmarking shows an 11% overhead on our (Cascade Lake) machine.

The NoGIL branch includes some major changes to the cycle GC and the memory allocator. Comparing NoGIL with our (relatively low effort) attempt to add those changes to the base commit shows 14% overhead.
Our attempt to isolate the changes to the cycle collector consist of reducing the number of generations to one, and setting the threshold of that generation to 7000.
A perfect comparison would be a lot more work, as the changes to the cycle GC in the NoGIL are hard to isolate.

Earlier experiments with mimalloc showed a small ~1% speedup, which gives a best guess number for the overhead of 15%.

A couple of things to note:

The overhead is not evenly spread. Some programs will show a much larger overhead, exceeding 50% is some cases (e.g. the MyPy benchmark), and some will have negligible overhead.
There are no benchmarks that have the workload spread across multiple threads, as Sam mentions in the PEP. Such benchmarks should show speedups with NoGIL.

Future and ongoing impact on performance

It is not clear how the overhead of NoGIL will change as CPython gets faster.
A large part of the overhead of NoGIL is in the more complex reference counting mechanism.
Reducing the number of reference counting operations is part of our optimization strategy,
which would reduce the absolute overhead of reference counting.
However, we expect large gains elsewhere, so the proportional overhead would probably increase.
There is also the overhead of synchronizing and locking. This is unlikely to change much in absolute terms, so would get considerably larger as a ratio.

Assuming a 10% overhead from reference counting and a 5% overhead from locking:
If we double the speed of the rest of the VM and halve the reference counting overhead, the overhead of NoGIL rises to 20%.
If we double the speed of the VM without changing reference counting, the overhead of NoGIL rises to 30%.

Summary of overhead estimates

All overheads as a percentage of the runtime without the NoGIL changes

Measurement	Overhead
PEP 703 claimed	6%
Our unadjusted measurement	11%
Adjusted for cycle GC changes	14%
Overall impact of NoGIL on 3.12*	15%
Guess 1 for 3.13/3.14	20%
Guess 2 for 3.13/3.14	30%

* Adjusted for cycle GC plus 1% estimated speedup from mimalloc integration.

Opportunity costs

The adaptive specializing interpreter relies on the GIL; it is not thread-friendly.
If NoGIL is accepted, then some redesign of our optimization strategy will be necessary.
While this is perfectly possible, it does have a cost.
The effort spent on this redesign and resulting implementation is not being spent on actual optimizations.

The interactions between instrumentation (PEP 669, sys.settrace, etc), the cycle GC, and optimization, are already subtle.
Ensuring that all the parts work correctly together takes considerable effort, and slows down work on speeding up CPython.
Adding free-threading to that mix, will increase the complexity considerably, resulting in a lot of effort being spent on handling corner cases that simply do not occur with the GIL.

Comparsion to multiple interpreters

Python 3.12 offers support for executing multiple Python interpreters in parallel in the same process.

For the purposes of this discussion, let’s categorize parallel application into three groups:

Data-store backed processing. All the shared data is stored in a data store, such a Postgres database, the processes share little or no data, communicating via the data store.
Numerical processing (including machine learning) where the shared data is matrices of numbers.
General data processing where the shared data is in the form of an in-memory object graph.

Python has always supported category 1, with multiprocessing or through some sort of external load balancer.
Category 2 is supported by multiple interpreters.
It is category 3 that benefits from NoGIL.

How common is category 3?
All the motivating examples in PEP 703 are in category 2.