I’m working on a fork of the json module, but for some reason even with the same code (GitHub - nineteendo/jsonc), I’m seeing a difference in performance ranging from 30% slower to 70% faster:
encode
json
jsonc
unit (μs)
List of 65,536 booleans
1.00
0.93
1568.74
List of 16,384 ASCII strings
1.00
1.12
2830.72
List of 4,096 floats
1.00
1.00
2996.67
List of 4,096 dicts with 1 int
1.00
1.05
1369.86
Medium complex object
1.00
0.99
140.46
List of 4,096 strings
1.00
0.93
5979.93
Complex object
1.00
0.98
1590.89
Dict with 256 lists of 256 dicts with 1 int
1.00
1.05
22387.50
decode
json
jsonc
unit (μs)
List of 65,536 booleans
1.00
1.30
1177.90
List of 16,384 ASCII strings
1.00
0.89
1679.06
List of 4,096 floats
1.00
1.04
1086.71
List of 4,096 dicts with 1 int
1.00
1.08
1270.86
Medium complex object
1.00
1.13
100.01
List of 4,096 strings
1.00
0.60
1615.06
Complex object
1.00
0.87
1237.70
Dict with 256 lists of 256 dicts with 1 int
1.00
1.06
29709.79
Could someone explain me how this is possible, and how I can improve this? Am I missing compiler flags? @ZeroIntensity, have you got a clue?
At a glance, there’s no -O3 in your compile settings (though, I’m not too sure that it matters – setuptools might apply that by default). That, and CPython applies all sorts of fancy optimizations (LTO and PGO are the two I’m aware of) – it makes sense that a version compiled from setuptools is slower.
That being said, I’m not 100% sure if things like PGO are applied to extension modules, they could just apply to the interpreter core. (@eclips4 is pretty knowledgable about the builds, he probably has more insight than I do.)
I would suggest changing the with compiler (e.g., perhaps you compiled your version with gcc, but json was built with clang), and also seeing if the build system affects it (maybe things like scikit-build-core or meson-python have better optimizations than setuptools does).
I don’t think there’s support on MacOS, although you could use it inside a linux VM.
On Mac, I think the only way to do this is with the instruments app. I’ve never had much success with getting it working on a mac. Thankfully I have a trusty x86 linux laptop for stuff like this…
Oh right you are on macOS.
I assume you have xcode installed to get the llvm C compiler.
Web search for xcode profile and llvm profile.
That give you pointers to the tools you have.
Trying to go through a profiler solely meant for C code seems like the wrong approach here. You’ll get all sorts of noise from the eval loop (namely, it will appear that 99% of the program is stuck in _PyEval_EvalFrameDefault), which is going to make it very difficult to find where the slowdown actually is.
However, I do see a separate cause for the slowdown: your version isn’t exactly the same as the built-in version. You’re not using the private API, but _json is, so you lose any internal optimization benefits that CPython might be getting (such as PGO, as previously mentioned). Similarly, private APIs generally don’t have as many checks as the public API, so they are slightly faster.
The main slowdown in your code, other than the lack of PGO, is probably due to the use of PyUnicode_FromString over _Py_ID inside _encoded_const, which affects all uses of True, False, and None. The former has to count the size of the string, and then go through the large UTF8 decoder, whereas _Py_ID is quite literally a constant-time pointer lookup.
It’s possible to use the private API (simply define Py_BUILD_CORE, as you’ve already done, and then switch internal headers to something that would work for extensions e.g. pycore_object.h on CPython would be internal/pycore_object.h in an extension), but there are some downsides to using it. In this case, I wouldn’t try and deal with _Py_ID, but instead just make a global (or better yet, use module state to support subinterpreters) that contains the cached names, and then return them.