PyGILState_Ensure (take_gil) too much overhead

Hi, I noticed that when you call PyGILState_Ensure() it calls take_gil(tstate) internally. However, in some cases this atomic operation is very slow, in second scale. I only have 2 threads running: one is the CPython main thread running Python program, including GC, etc. The other is this helper thread who tries to hold GIL.

I didn’t fully read the impl line-by-line within take_gil() but I assume it just manipulates thread states. But how can this become so slow? Is it because scale of PyObject in some Python programs, or even GC? Thanks.