Trying to go through a profiler solely meant for C code seems like the wrong approach here. You’ll get all sorts of noise from the eval loop (namely, it will appear that 99% of the program is stuck in _PyEval_EvalFrameDefault
), which is going to make it very difficult to find where the slowdown actually is.
However, I do see a separate cause for the slowdown: your version isn’t exactly the same as the built-in version. You’re not using the private API, but _json
is, so you lose any internal optimization benefits that CPython might be getting (such as PGO, as previously mentioned). Similarly, private APIs generally don’t have as many checks as the public API, so they are slightly faster.
The main slowdown in your code, other than the lack of PGO, is probably due to the use of PyUnicode_FromString
over _Py_ID
inside _encoded_const
, which affects all uses of True
, False
, and None
. The former has to count the size of the string, and then go through the large UTF8 decoder, whereas _Py_ID
is quite literally a constant-time pointer lookup.
It’s possible to use the private API (simply define Py_BUILD_CORE
, as you’ve already done, and then switch internal headers to something that would work for extensions e.g. pycore_object.h
on CPython would be internal/pycore_object.h
in an extension), but there are some downsides to using it. In this case, I wouldn’t try and deal with _Py_ID
, but instead just make a global (or better yet, use module state to support subinterpreters) that contains the cached names, and then return them.