There are several subsystems where we have struggled with performance and thread scaling. I will talk about the three main ones:
We have a NoSQL graph database that stores all the data for the system. Originally it was pure Python code. Now it is mostly C++ for performance, but some critical parts of it are still Python. The C++ is very heavily multi-threaded to be able to scale across CPUs, so calls into Python code are a significant crunch point. Both the C++ and Python parts have a great deal of state that must be shared between threads. If the Python parts could run truly concurrently, that would remove one of the most significant bottlenecks in the database. It is not plausible to use multiple processes here for the Python parts, and I do not think it would be possible to use multiple interpreters due to the shared state. The long-term thinking is that we will end up replacing all the Python parts with C++, but we would be able to reevaluate that if the Python could scale across CPUs.
The second part of the system of interest is the part that connects to all the computers in the environment and interrogates them. One example thing that it does is run commands on remote computers and parses the results that come back. It spends a lot of its time waiting for results, so to get good throughput it uses hundreds of threads to talk to many targets simultaneously. Of course then what happens is that multiple results arrive simultaneously and require parsing with Python code. The GIL becomes a significant bottleneck here. This is quite a difficult situation because usage flips between being blocked for long periods of time to requiring sudden spikes of processing, where at any given moment, hundreds of threads are blocked but a handful have CPU-intensive processing to do. We cannot predict which threads will complete at which times. Memory usage means we can’t run hundreds of processes. Clearly we could run a number of multi-threaded processes, but that would still suffer from the possibility that a particular process is unlucky to have a sudden spike of CPU load to handle. If the parsing could scale across CPUs, that would definitely be a significant benefit.
The third part to mention is what we call the “engine”, which is responsible for the main meat of the data processing, including running all the Python code generated from our in-house language. For this, we do run multiple processes to scale across CPUs, but that leads to quite a lot of complexity for coordinating the actions of the processes, and we still see situations where one engine process happens to get unlucky and has too much work to do while others are idle. A single truly multi-threaded engine would be more efficient and easier to manage.
In summary, BMC Discovery is a large real-world Python application that has several areas in which a NoGIL Python would make a substantial difference. Obviously faster processing of Python code is extremely valuable too, but given a choice between single threaded performance and scaling across CPUs, the CPU scaling is more valuable to us. Customers run this product on machines with 32 or 64 CPUs, so we will happily take a 10% hit in single threaded performance if it means we can get 30 or 60 times more performance by CPU scaling.