Python benchmarking in unstable environments

FWIW, a comprehensive analysis of those factors (in all their configuration/OS/hardware permutations) would be very useful, if that is the only outcome, as it would necessarily result in either a clear enumeration of those factors (or a meaningful subset) and/or a clear conclusion that the space is generally intractable.

Assuming a positive outcome, such an enumeration (even if limited to a single OS/hardware platform) would enable (and direct) subsequent analysis of strategies that could mitigate the high degree of variance in benchmark results from unstable workers.


That said, the above factor analysis would also benefit (in the same way) the more traditional approach of benchmarking on a “stable” worker.

The A/A testing done by @mdroettboom (already mentioned above) indicates a much wider confidence interval than we might expect or hope for:

On bare metal, a performance improvement needs to be at least 2.5% for it to be an improvement 90% of the time.

We normally talk about results in terms of whole percentages (for now), so we would want that “2.5%” to be smaller than 1% and ideally closer to 0.1%. The insight provided by the proposed research would undoubtedly help us get nearer to that.

Furthermore, it would probably also help at least somewhat with the variability in stability we regularly see in results for specific benchmarks (also discussed in faster-cpython/ideas#480).