Performance difference on fork of json module

Nineteendo · September 16, 2024, 7:42pm

I’m working on a fork of the json module, but for some reason even with the same code (GitHub - nineteendo/jsonc), I’m seeing a difference in performance ranging from 30% slower to 70% faster:

encode	json	jsonc	unit (μs)
List of 65,536 booleans	1.00	0.93	1568.74
List of 16,384 ASCII strings	1.00	1.12	2830.72
List of 4,096 floats	1.00	1.00	2996.67
List of 4,096 dicts with 1 int	1.00	1.05	1369.86
Medium complex object	1.00	0.99	140.46
List of 4,096 strings	1.00	0.93	5979.93
Complex object	1.00	0.98	1590.89
Dict with 256 lists of 256 dicts with 1 int	1.00	1.05	22387.50

decode	json	jsonc	unit (μs)
List of 65,536 booleans	1.00	1.30	1177.90
List of 16,384 ASCII strings	1.00	0.89	1679.06
List of 4,096 floats	1.00	1.04	1086.71
List of 4,096 dicts with 1 int	1.00	1.08	1270.86
Medium complex object	1.00	1.13	100.01
List of 4,096 strings	1.00	0.60	1615.06
Complex object	1.00	0.87	1237.70
Dict with 256 lists of 256 dicts with 1 int	1.00	1.06	29709.79

Could someone explain me how this is possible, and how I can improve this? Am I missing compiler flags?
@ZeroIntensity, have you got a clue?

Stefan2 · September 16, 2024, 7:58pm

Are the times stable or do they vary a lot from run to run? And what if you switch the order of the two?

Nineteendo · September 16, 2024, 8:01pm

The 30 and 70% are fairly stable. (The rest is mostly noise)
Changing the order doesn’t remove this discrepancy:

decode	jsonc	json	unit (μs)
List of 65,536 booleans	1.30	1.00	1184.16
List of 4,096 strings	0.51	1.00	1879.38

ZeroIntensity · September 16, 2024, 8:03pm

At a glance, there’s no -O3 in your compile settings (though, I’m not too sure that it matters – setuptools might apply that by default). That, and CPython applies all sorts of fancy optimizations (LTO and PGO are the two I’m aware of) – it makes sense that a version compiled from setuptools is slower.

That being said, I’m not 100% sure if things like PGO are applied to extension modules, they could just apply to the interpreter core. (@eclips4 is pretty knowledgable about the builds, he probably has more insight than I do.)

I would suggest changing the with compiler (e.g., perhaps you compiled your version with gcc, but json was built with clang), and also seeing if the build system affects it (maybe things like scikit-build-core or meson-python have better optimizations than setuptools does).

Nineteendo · September 16, 2024, 8:08pm

-O3 is enabled by default, so it’s not that:

clang
-fno-strict-overflow
-Wsign-compare
-Wunreachable-code
-fno-common
-dynamic
-DNDEBUG
-g
-O3
-Wall
-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk
-I/usr/local/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/include/python3.12
-c jsonc/_speedups.c
-o build/temp.macosx-13.0-x86_64-cpython-312/jsonc/_speedups.o

barry-scott · September 16, 2024, 9:32pm

Try profiling the C code to see how the two versions differs.

Nineteendo · September 17, 2024, 8:35am

How can I profile compiled C code? The source code is the same.

Nineteendo · September 17, 2024, 12:11pm

I tried with gcc and with another build system, but the performance remains the same.

scikit-build-core	json	jsonc	unit (μs)
List of 65,536 booleans	1.00	1.34	1245.57
List of 4,096 strings	1.00	0.80	1699.13

meson-python	json	jsonc	unit (μs)
List of 65,536 booleans	1.00	1.29	1158.28
List of 4,096 strings	1.00	0.78	1620.05

ZeroIntensity · September 17, 2024, 12:39pm

In my experience, austin is pretty nice.

Then I would imagine PGO is helping CPython’s version. I’ll wait for Kirill’s input, though.

Nineteendo · September 17, 2024, 1:43pm

How do I get austinp on macOS? There are only instructions for Linux.

barry-scott · September 17, 2024, 2:15pm

But the machine code is not likely to be the same and that is what you can profile.

barry-scott · September 17, 2024, 2:17pm

Isn’t that a Python code profiler?
Does it also know to profile in the extensions code?

Nineteendo · September 17, 2024, 4:08pm

Yeah, but which tool can I use for that? I only know of cProfile for Python code.

ZeroIntensity · September 17, 2024, 5:24pm

It does support extensions… or at least I think it does. I tried a number of extension profilers a while back — I think that was one of them.

barry-scott · September 17, 2024, 5:28pm

What OS and what compiler are you using?
gcc has gprofile. Would have to lookup what llvm uses.

ngoldbaum · September 17, 2024, 5:36pm

I don’t think there’s support on MacOS, although you could use it inside a linux VM.

On Mac, I think the only way to do this is with the instruments app. I’ve never had much success with getting it working on a mac. Thankfully I have a trusty x86 linux laptop for stuff like this…

barry-scott · September 17, 2024, 5:51pm

Oh right you are on macOS.
I assume you have xcode installed to get the llvm C compiler.
Web search for xcode profile and llvm profile.
That give you pointers to the tools you have.

Nineteendo · September 17, 2024, 5:53pm

macOS 13.6.9 (22G830)
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Can you use a C profiler with a Python script?

I don’t, at least not anymore. It’s a very big application, and this is the first time I’m compiling C in any form.

barry-scott · September 17, 2024, 5:58pm

Yes. What you end up with is details of the C functions that are called in python itself and in your json extension.

I suggest you install xcode again after you look up on the web how to profile code with xcode.

ZeroIntensity · September 17, 2024, 8:20pm

Trying to go through a profiler solely meant for C code seems like the wrong approach here. You’ll get all sorts of noise from the eval loop (namely, it will appear that 99% of the program is stuck in _PyEval_EvalFrameDefault), which is going to make it very difficult to find where the slowdown actually is.

However, I do see a separate cause for the slowdown: your version isn’t exactly the same as the built-in version. You’re not using the private API, but _json is, so you lose any internal optimization benefits that CPython might be getting (such as PGO, as previously mentioned). Similarly, private APIs generally don’t have as many checks as the public API, so they are slightly faster.

The main slowdown in your code, other than the lack of PGO, is probably due to the use of PyUnicode_FromString over _Py_ID inside _encoded_const, which affects all uses of True, False, and None. The former has to count the size of the string, and then go through the large UTF8 decoder, whereas _Py_ID is quite literally a constant-time pointer lookup.

It’s possible to use the private API (simply define Py_BUILD_CORE, as you’ve already done, and then switch internal headers to something that would work for extensions e.g. pycore_object.h on CPython would be internal/pycore_object.h in an extension), but there are some downsides to using it. In this case, I wouldn’t try and deal with _Py_ID, but instead just make a global (or better yet, use module state to support subinterpreters) that contains the cached names, and then return them.