Performance difference on fork of json module

I’m working on a fork of the json module, but for some reason even with the same code (GitHub - nineteendo/jsonc), I’m seeing a difference in performance ranging from 30% slower to 70% faster:

encode json jsonc unit (μs)
List of 65,536 booleans 1.00 0.93 1568.74
List of 16,384 ASCII strings 1.00 1.12 2830.72
List of 4,096 floats 1.00 1.00 2996.67
List of 4,096 dicts with 1 int 1.00 1.05 1369.86
Medium complex object 1.00 0.99 140.46
List of 4,096 strings 1.00 0.93 5979.93
Complex object 1.00 0.98 1590.89
Dict with 256 lists of 256 dicts with 1 int 1.00 1.05 22387.50
decode json jsonc unit (μs)
List of 65,536 booleans 1.00 1.30 1177.90
List of 16,384 ASCII strings 1.00 0.89 1679.06
List of 4,096 floats 1.00 1.04 1086.71
List of 4,096 dicts with 1 int 1.00 1.08 1270.86
Medium complex object 1.00 1.13 100.01
List of 4,096 strings 1.00 0.60 1615.06
Complex object 1.00 0.87 1237.70
Dict with 256 lists of 256 dicts with 1 int 1.00 1.06 29709.79

Could someone explain me how this is possible, and how I can improve this? Am I missing compiler flags?
@ZeroIntensity, have you got a clue?

Are the times stable or do they vary a lot from run to run? And what if you switch the order of the two?

1 Like

The 30 and 70% are fairly stable. (The rest is mostly noise)
Changing the order doesn’t remove this discrepancy:

decode jsonc json unit (μs)
List of 65,536 booleans 1.30 1.00 1184.16
List of 4,096 strings 0.51 1.00 1879.38

At a glance, there’s no -O3 in your compile settings (though, I’m not too sure that it matters – setuptools might apply that by default). That, and CPython applies all sorts of fancy optimizations (LTO and PGO are the two I’m aware of) – it makes sense that a version compiled from setuptools is slower.

That being said, I’m not 100% sure if things like PGO are applied to extension modules, they could just apply to the interpreter core. (@eclips4 is pretty knowledgable about the builds, he probably has more insight than I do.)

I would suggest changing the with compiler (e.g., perhaps you compiled your version with gcc, but json was built with clang), and also seeing if the build system affects it (maybe things like scikit-build-core or meson-python have better optimizations than setuptools does).

1 Like

-O3 is enabled by default, so it’s not that:

clang
-fno-strict-overflow
-Wsign-compare
-Wunreachable-code
-fno-common
-dynamic
-DNDEBUG
-g
-O3
-Wall
-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk
-I/usr/local/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/include/python3.12
-c jsonc/_speedups.c
-o build/temp.macosx-13.0-x86_64-cpython-312/jsonc/_speedups.o

Try profiling the C code to see how the two versions differs.

How can I profile compiled C code? The source code is the same.

I tried with gcc and with another build system, but the performance remains the same.

scikit-build-core json jsonc unit (μs)
List of 65,536 booleans 1.00 1.34 1245.57
List of 4,096 strings 1.00 0.80 1699.13
meson-python json jsonc unit (μs)
List of 65,536 booleans 1.00 1.29 1158.28
List of 4,096 strings 1.00 0.78 1620.05

In my experience, austin is pretty nice.

Then I would imagine PGO is helping CPython’s version. I’ll wait for Kirill’s input, though.

How do I get austinp on macOS? There are only instructions for Linux.

But the machine code is not likely to be the same and that is what you can profile.

Isn’t that a Python code profiler?
Does it also know to profile in the extensions code?

Yeah, but which tool can I use for that? I only know of cProfile for Python code.

It does support extensions… or at least I think it does. I tried a number of extension profilers a while back — I think that was one of them.

What OS and what compiler are you using?
gcc has gprofile. Would have to lookup what llvm uses.

I don’t think there’s support on MacOS, although you could use it inside a linux VM.

On Mac, I think the only way to do this is with the instruments app. I’ve never had much success with getting it working on a mac. Thankfully I have a trusty x86 linux laptop for stuff like this…

Oh right you are on macOS.
I assume you have xcode installed to get the llvm C compiler.
Web search for xcode profile and llvm profile.
That give you pointers to the tools you have.

macOS 13.6.9 (22G830)
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Can you use a C profiler with a Python script?

I don’t, at least not anymore. It’s a very big application, and this is the first time I’m compiling C in any form.

Yes. What you end up with is details of the C functions that are called in python itself and in your json extension.

I suggest you install xcode again after you look up on the web how to profile code with xcode.

Trying to go through a profiler solely meant for C code seems like the wrong approach here. You’ll get all sorts of noise from the eval loop (namely, it will appear that 99% of the program is stuck in _PyEval_EvalFrameDefault), which is going to make it very difficult to find where the slowdown actually is.

However, I do see a separate cause for the slowdown: your version isn’t exactly the same as the built-in version. You’re not using the private API, but _json is, so you lose any internal optimization benefits that CPython might be getting (such as PGO, as previously mentioned). Similarly, private APIs generally don’t have as many checks as the public API, so they are slightly faster.

The main slowdown in your code, other than the lack of PGO, is probably due to the use of PyUnicode_FromString over _Py_ID inside _encoded_const, which affects all uses of True, False, and None. The former has to count the size of the string, and then go through the large UTF8 decoder, whereas _Py_ID is quite literally a constant-time pointer lookup.

It’s possible to use the private API (simply define Py_BUILD_CORE, as you’ve already done, and then switch internal headers to something that would work for extensions e.g. pycore_object.h on CPython would be internal/pycore_object.h in an extension), but there are some downsides to using it. In this case, I wouldn’t try and deal with _Py_ID, but instead just make a global (or better yet, use module state to support subinterpreters) that contains the cached names, and then return them.

1 Like