Best practice for collecting perf stat counters for pyperformance benchmarks without including harness overhead?

Hi all,

I’m trying to use perf stat hardware counters (e.g. instructions, L1-icache-misses) together with pyperformance to understand micro-architectural effects (e.g. code layout / I-cache behavior). I noticed some experiments use commands like:

perf stat -e instructions,L1-icache-misses -- python -m pyperformance run

However, my understanding is that pyperformance runs benchmarks via pyperf, which (by design) performs a calibration run and uses multiple worker processes with warmups before collecting measured values. That means perf stat around the whole pyperformance run command will likely count:

  • calibration run (“Run 1”)

  • process start-up / orchestration

  • per-worker warmups

  • result aggregation / JSON writing

  • etc.

So the counters may reflect a mix of harness + benchmark code, and this could dilute the signal (especially for shorter benchmarks) or make it harder to interpret ratios (e.g. I-cache misses per instruction).

Questions

  1. Is it considered acceptable/expected to use perf stat ... python -m pyperformance run as-is for counter-based analysis? Any known pitfalls?

  2. If we want counters only for the measured region (excluding calibration/warmups/harness), what is the recommended approach?

    • pyperf has --hook perf_record which enables perf only while the benchmark is running (great for profiling), but is there an equivalent/recommended workflow for perf stat counters?

    • Should we run individual benchmark scripts directly (bypassing the runner) for microarch counter collection?

  3. If there is no standard solution today: would it make sense to add a perf_stat-style hook to pyperf (analogous to perf_record) that enables/disables counting only around the measured region? (e.g. using perf stat --control enable/disable, or in-process perf_event_open counters)

What I’m trying to achieve

  • Per-benchmark counters such as:

    • instructions, cycles, branch-misses, L1-icache-load-misses, L1-dcache-load-misses, TLB misses
  • Derived metrics like miss rates / MPKI (misses per 1000 instructions)

  • Preferably measured only during the benchmark’s steady-state measured loop, not including harness phases.

Thanks for any guidance or existing best practices!

Reference

https://github.com/faster-cpython/ideas/issues/224#issuecomment-1022371595