Profiling CPython with "perf"


(Neil Schemenauer) #1

This is a summary into some investigation into this topic raised by Raymond on python-dev. I’m using Discourse so I can embed some screenshots. Benchmarks on modern machines are hard, as explained by Victor. He has put a lot of work into pyperformance to try to get useful measurements.

I used Linux “perf” to do some profiling while running Raymond’s benchmarks (and some variations of them). It is interesting to see the annotated assembly code as it gives insight into where time is spent during execution.

First, I think the most likely change to have impacted performance is ef4ac967e2f3a9a18330cc6abe14adb4bc3d0465:
Factor out a private, per-interpreter _Py_AddPendingCall(). Some git bisection work by others has also found that.

Looking at the change, I think it is possible that changes to machine code for the ceval main loop are the cause of benchmark changes. Maybe the change to the DISPATCH() macro?

@@ -687,7 +722,7 @@ _PyEval_EvalFrameDefault(PyFrameObject *f, int throwflag)
 
 #define DISPATCH() \
     { \
-        if (!_Py_atomic_load_relaxed(&_PyRuntime.ceval.eval_breaker)) { \
+        if (!_Py_atomic_load_relaxed(&tstate->interp->ceval.eval_breaker)) { \
		     FAST_DISPATCH(); \
	 } \
	 continue; \

To exercise the DISPATCH() code path, I changed var_access_benchmark.py to run only the following function (BINARY_ADD ends with DISPATCH()):

def add_local(trials=trials, a=1, b=2):
    for t in trials:
	a + b; a + b; a + b; a + b; a + b; a + b
	a + b; a + b; a + b; a + b; a + b; a + b
	a + b; a + b; a + b; a + b; a + b; a + b
	a + b; a + b; a + b; a + b; a + b; a + b
	a + b; a + b; a + b; a + b; a + b; a + b
	a + b; a + b; a + b; a + b; a + b; a + b

To use “perf”, you should compile with special options. I used gcc-8 and changed OPT to:

-DNDEBUG -g  -O3 -Wall -fno-omit-frame-pointer -ggdb

You need the frame pointer so that perf can decode the stack frames. The -ggdb outputs more debugging symbol information (I think). To run with perf:

perf record  -F999 --call-graph lbr ./python bench_local_add.py

You want the frequency (999 in the example) to be high enough that you collect enough profile samples but not too high that the profiling impacts performance too much. It helps to have a long running benchmark so that -F can be low while still getting a lot of samples.

To view the call graph:

perf report -g

From that we see that essentially all the time is spend in _PyEval_EvalFrameDefault or functions it calls. 58% of the time is spent inside _PyEval_EvalFrameDefault itself. That is a reason why tiny changes to the code in that function can have big performance impacts. The function is huge and complicated and a small code change can cause the compiler to generate different code with fairly big performance differences.

You can bring up annotated functions from within “perf report”. Or, you can run “perf annotate”, e.g.

perf annotate -M intel _PyEval_EvalFrameDefault

Pressing ‘H’ will highlight the hottest instruction in the function. Hitting ‘TAB’ will go to the next hottest. The % values are based on 100% (i.e. they are fraction of time spend in function, not of total execution time). The hottest instruction for the add_local() benchmark is the INCREF of the variable pushed on the stack after the local variable lookup.

Next hottest is the DECREF of the argument of BINARY_ADD

image

I’m not going to show the details but I believe the DISPATCH() change has no significant impact. After looking at the annotated assembly code for bytecode instructions using DISPATCH(), it looks like the machine code is the same. So the source code change seems to have no impact on the machine code generated.

One observation: reference counting is quite expensive, presumably due to memory latency and branch prediction. Perhaps we could dust off the idea of changing ceval to use a register machine rather than a stack machine. Couldn’t the local variables become registers? I think that would eliminate the INCREF/DECREF hot instructions shown above.


Use register-based VM rather than stack-based
(Antoine Pitrou) #2

In my experience, the line-by-line annotation isn’t reliable in “perf”. It may depend on your processor. I have a AMD Ryzen CPU.

If you want to check whether the DISPATCH change is responsible for some slowdown, the easiest way is to undo it, i.e. manually hoist the &tstate->interp->ceval.eval_breaker in a local variable at the beginning of _PyEval_EvalFrameDefault (since this is a pointer value that won’t change).


(Brett Cannon) split this topic #3

4 posts were split to a new topic: Creating a “python-dev” equivalent category