Hi all,
I’m trying to use perf stat hardware counters (e.g. instructions, L1-icache-misses) together with pyperformance to understand micro-architectural effects (e.g. code layout / I-cache behavior). I noticed some experiments use commands like:
perf stat -e instructions,L1-icache-misses -- python -m pyperformance run
However, my understanding is that pyperformance runs benchmarks via pyperf, which (by design) performs a calibration run and uses multiple worker processes with warmups before collecting measured values. That means perf stat around the whole pyperformance run command will likely count:
-
calibration run (“Run 1”)
-
process start-up / orchestration
-
per-worker warmups
-
result aggregation / JSON writing
-
etc.
So the counters may reflect a mix of harness + benchmark code, and this could dilute the signal (especially for shorter benchmarks) or make it harder to interpret ratios (e.g. I-cache misses per instruction).
Questions
-
Is it considered acceptable/expected to use
perf stat ... python -m pyperformance runas-is for counter-based analysis? Any known pitfalls? -
If we want counters only for the measured region (excluding calibration/warmups/harness), what is the recommended approach?
-
pyperfhas--hook perf_recordwhich enables perf only while the benchmark is running (great for profiling), but is there an equivalent/recommended workflow forperf statcounters? -
Should we run individual benchmark scripts directly (bypassing the runner) for microarch counter collection?
-
-
If there is no standard solution today: would it make sense to add a
perf_stat-style hook topyperf(analogous toperf_record) that enables/disables counting only around the measured region? (e.g. usingperf stat --controlenable/disable, or in-process perf_event_open counters)
What I’m trying to achieve
-
Per-benchmark counters such as:
instructions,cycles,branch-misses,L1-icache-load-misses,L1-dcache-load-misses, TLB misses
-
Derived metrics like miss rates / MPKI (misses per 1000 instructions)
-
Preferably measured only during the benchmark’s steady-state measured loop, not including harness phases.
Thanks for any guidance or existing best practices!
Reference
https://github.com/faster-cpython/ideas/issues/224#issuecomment-1022371595