Python benchmarking in unstable environments

Cupprum · December 31, 2022, 4:36pm

As mentioned in this issue, python benchmarks are currently executed on a specific machine. The whole issue discusses running the benchmarking suite in cloud. It would be really nice, to run the benchmarks after the build and test step in azure pipelines.

The preferred way of running performance benchmarks is by executing the pyperformance test suite. However when i execute the benchmark on my own machine, which is currently not doing anything resource heavy, i get completely different results compared to when i execute the benchmarks while my machine is completely stressed out. My machine is macbook pro 2018 13" 4core i-5.

The whole next semester i will just be working on my master thesis, and i was thinking about using profiling tools (for instance scalene) to somehow try to account for this differences.

I would be interested in your insights and remarks on this topic.

steven.daprano · December 31, 2022, 5:52pm

Isn’t this to be expected? If the computer is stressed out, you won’t get clean benchmarks.

In the worst case, if the machine’s virtual memory is thrashing badly enough, or if the CPU load is increasing exponentially, the benchmarks may never get a chance to run no matter how long you leave them to run.

Multicore CPUs and multitasking operating systems try to hide the fact that, ultimately, they are sequential machines only capable of doing one operation at a time. (Or at least only a small number of operations at a time.) But if the machine is stressed enough, the abstraction leaks.

What is your masters thesis? It’s not clear whether your thesis is related to “benchmarking on stressed machines” or whether you want to benchmark code relating to your thesis on stressed machines.

barry-scott · December 31, 2022, 5:58pm

To get reproducable benchmarks take careful setup.

You cannot expect to be able to run reproducable benchmarks in a virtual machine environment as is typical in the cloud. The performance will depend on what the other virtual machines you are sharing the hardware with are doing. Its also common that over time you the VMs are not running on the same class of hardware.

To avoid caching effects it is often a good idea to reboot the machine and then run the benchmarks a few times to prime the caches before taking readings.

The machine that is running the benchmark needs to have no extranious processes running for example.

That is what I would expect. You are not controlling the benchmarking environment.

You did not say what OS you are benchmarking on. With Windows and macOS there are always lots of background services running. With Linux you would be able to boot into a benchmarking environment that only starts the minimal setup of services needed to run the benchmarks.

I do performance analysis as part of my day-job that includes a big python application.

Cupprum · December 31, 2022, 7:11pm

Yes it is, but i would like to account for this somehow to minimise the deviation from the clean benchmarks.

My master thesis topic is Benchmarking in environment of changing / stressed resources.

Rosuav · December 31, 2022, 7:22pm

The War Owl was asked “How do I deal with AWPs without utility?” The short answer is: Don’t deal with AWPs without utility.

You’re trying to figure out how to benchmark when resources are in contention. The short answer is: Don’t benchmark when resources are in contention.

I’m sure there are ways to get some sort of useful data, but fundamentally, you’re going to be fighting against that. So the obvious follow-up question is: WHY are you trying to benchmark on a stressed-out computer? What do you learn from it that you can’t learn from a quiet computer?

And I’m extremely curious as to what the answers to those questions are.

Cupprum · December 31, 2022, 7:26pm

MacOS Ventura.

But my naïve assumption would be, that once you move to cloud, using a full fledged version of linux or a minimal setup of linux would not make a huge difference. I will try to look into this a bit more in the thesis.
The pyperformance benchmark suite returns, what was the mean time of executing the benchmarks and the standard deviation of the runtimes. To my understanding, the pyperf tries to spreds the runs out on different cores.
But altogether, the value which is returned by the benchmark is in seconds. I am considering looking into something else then seconds. For instance number of instructions executed, of course not all instructions these days take the same time to execute, i would somehow try to account for this. I would also have to not count some instructions, because the number of sections the program would be split into will differentiate on stressed and non-stressed machine.

Rosuav · December 31, 2022, 7:40pm

This may be of interest: time — Time access and conversions — Python 3.12.1 documentation

But that doesn’t solve all the problems; if it did, performance testing would be a solved problem itself.

Cupprum · December 31, 2022, 7:42pm

I am not familiar with the term AWP only if it means a sniper rifle.

The main point is not running benchmarks on stressed-out computer, however in a shared environment, like for instance cloud, one must also consider this option. The stressed-out computer is only a way how to replicate unstable environments.

Depends what you mean by quiet computer.
If you mean the one machine which is currently used for performance benchmarks, then i cannot access that computer, so i cannot learn almost anything from it.
If i compare my machine to yours machine, we would both get different results, because we do not have the same baseline, right? If the benchmarks would be run on a shared hardware in the cloud, then all developers would be able to run these benchmarks on comparable machines.

Especially now, when the performance of cpython is discussed a lot, a way how developers could compare the performance using the same baseline would be helpful in my opinion.

iritkatriel · December 31, 2022, 7:49pm

I don’t think he’s trying to learn something that he cannot learn from a quiet computer. He’s trying to learn something despite not having access to a quiet computer.

oscarbenjamin · December 31, 2022, 7:59pm

There is a blog post here that describes using cachegrind which gives some reliable metrics that can be used for benchmarking without depending on the performance of the underlying hardware:

Rosuav · December 31, 2022, 8:10pm

Yes, the War Owl was talking about Counter-Strike gaming, and the difficulty of dealing with an enemy sniper without any sort of help. It’s hard. How do you do it? Don’t.

I simply mean a computer on which little or nothing else is running. In perfect theory, benchmarking would always be done on a perfectly clean computer with nothing else installed, much less running; in practice, that’s not possible, but anything else that’s running has to be accounted for. So a “quiet” computer is just one with an absolute minimum of other things running, where there are plenty of resources (CPU, RAM, whatever you’re testing for) available.

You’ll never get away from the “my machine vs your machine” problem, so yes, there’s a lot to be said for running benchmarks on something consistent. But let’s suppose that someone has a proposed change to the CPython source code, and wants to benchmark it. (This happens often and is hardly hypothetical.) Should this person:

Be given access to a single, communal computer, on which to run arbitrary and probably-buggy C code?
Be given instructions on how to rent a server of absolutely precisely the correct specification (eg from Amazon EC2 or Azure or Digital Ocean)?
Be given instructions on how to run the benchmark on their own computer at home?

None of these options is perfect. I think it’s pretty obvious that the communal option, while great for benchmark consistency, is horrific for security. The second is, on the face of it, decent; but you can’t get a perfectly clean virtual server, since details of the hadware it’s running on will affect things, so it still won’t eliminate all the problems. So ultimately, no matter what happens, variance cannot be eliminated, and must be managed.

I don’t think it is ever possible for developers to access the same baseline. Every performance test has to establish its own baseline. While it would be lovely to be able to say “CPython 3.11.1 scores 142857 on the Unified Performance Benchmark, and my proposed change increases that to 142858, which is a larger number and therefore better”, I don’t think it’s ever going to be that simple.

Mitigating the problem in various ways will certainly be interesting, though.

Thanks. Yep, that makes sense. I think that’s an admirable goal, but ultimately, the solutions will only ever be partial.

barry-scott · December 31, 2022, 8:55pm

Do you mean that you are benchmarking in an environment that is being busy on purpose? How fast can we do operation X in a busy production machine for example?

Do you mean you cannot get the resources to benchmark in a control environment?

The design of the OS has a huge impact on the amount of work that an application can get done on the same hardware.

In the case of Linux the choice of kernel features that are compiled into the code and how they are configured can make a big difference.
I think that for Windows Server vs. Desktop versions make a difference to the way the scheduler works.

My intuition is that you cannot determine the factors that are taking resources away from the benchmarking processes.

For example you would need to account for CPU architecture.
Not just arm vs. intel not each model of CPU has different performance.

As you run more code on a machine the way that the CPU caching help code to run faster is impacted.

As you run more processes and threads the cost of context switching will impact the benchmark results. macOS for example is considered to be very expensive in context switching.

The compiler used and its options will impact what performance you get on the same CPU model. When you have runs with diffrence compile options on different CPU models then its a big ask to be able to figure out the impact.

steven.daprano · January 1, 2023, 12:33am

Ah, I see. I suggest you wait for Samuel Branisa to finish his master’s thesis, and read that. wink

Seriously, I think this is something you will need to discover for yourself. Have you done a literature search to see if anyone else has researched this? I imagine you must have.

One thing which I sometimes do is try to account for machine differences by calculating benchmarks relative to pass. Benchmark a bare pass statement:

python -m timeit "pass"

to get an idea of how fast the machine is capable of running as a baseline, then time the thing I really care about:

python -m timeit "do_something()"

and report the benchmark timing as relative to the baseline.

In theory that relative benchmark should be roughly constant no matter the performance of the machine or the load on the machine, provided the load is the same when the two benchmarks run.

barry-scott · January 1, 2023, 7:53am

Maybe the timeit of pass fits in the cpu cache but your target code does not.
Maybe spme other hardware has diffenent cache sizes and it does fit in both cases.
That will lead to in an inaccurate ratios.

barry-scott · January 1, 2023, 7:58am

You may find this book interesting:

Systems Performance by Brendan Greg.

It has sections on the pit falls of performance measurements that apply generally.
The rest of the book is Linux based.

steven.daprano · January 2, 2023, 1:17am

Indeed, that is the difference between “in theory” and “in practice”.

mdroettboom · January 3, 2023, 3:19pm

Just catching up on this after being out on holidays.

I’m pretty excited @Cupprum will be exploring this. I think the impact of continuous integration systems for testing on open source have had an enormous impact on software quality – it would be great to move benchmarking into a similar place, but currently there’s an order of magnitude more “administration fiddliness” involved. I agree with others here that this is a challenging problem, and if the goal is to just get good data the easier path today is to have a dedicated, stripped down machine. But understanding exactly how shared infrastructure behaves for benchmarking, what kinds of workloads do better / worse, and then exploring what remediations are possible would really move things forward. Maybe there won’t be any reliable remediations possible, but negative results–understanding exactly why something isn’t possible–can still be really valuable in science.

I’m looking forward to seeing your findings.

eric.snow · January 3, 2023, 9:07pm

FWIW, a comprehensive analysis of those factors (in all their configuration/OS/hardware permutations) would be very useful, if that is the only outcome, as it would necessarily result in either a clear enumeration of those factors (or a meaningful subset) and/or a clear conclusion that the space is generally intractable.

Assuming a positive outcome, such an enumeration (even if limited to a single OS/hardware platform) would enable (and direct) subsequent analysis of strategies that could mitigate the high degree of variance in benchmark results from unstable workers.

That said, the above factor analysis would also benefit (in the same way) the more traditional approach of benchmarking on a “stable” worker.

The A/A testing done by @mdroettboom (already mentioned above) indicates a much wider confidence interval than we might expect or hope for:

On bare metal, a performance improvement needs to be at least 2.5% for it to be an improvement 90% of the time.

We normally talk about results in terms of whole percentages (for now), so we would want that “2.5%” to be smaller than 1% and ideally closer to 0.1%. The insight provided by the proposed research would undoubtedly help us get nearer to that.

Furthermore, it would probably also help at least somewhat with the variability in stability we regularly see in results for specific benchmarks (also discussed in faster-cpython/ideas#480).

eric.snow · January 3, 2023, 9:46pm

This is a great (and timely) topic.

I already mentioned in a reply how useful it would be to at least have available a comprehensive analysis of what factors contribute to the instability of benchmark results. This applies to dedicated, “stable” workers just as much as to “unstable” ones.

The topic is timely because there is already some effort going into getting useful benchmark results from cloud workers (at least with the “faster-cpython” team). Getting stable (enough) results from the cloud would be a game changer because it would greatly open up collaboration, reproducibility, and the availability of results. For example, we could run the pyperformance benchmarks nightly or even for every CPython PR.

FWIW, running benchmarks in CI (i.e. the cloud) would be a huge improvement, assuming stable-ish results, and would benefit most projects (not just CPython). It gets substantially easier if someone were to add a GitHub action or Azure Pipelines task to the respective marketplace. With something like that, it would effectively eliminate what I would argue is the main obstacle maintainers face when considering adding benchmarks to their project. It’s likely we would see a huge burst of new project-specific benchmarks (especially since pyperformance supports running custom suites as of the last year+). That’s an exciting prospect, which is potentially enabled by your research.

One notable point is that cloud workers are probably less “unstable” than folks think, with instability focused in fewer factors (and thus result variability will be more manageable). Of course, that’s exactly the sort of thing that needs to be researched. Do consider that the A/A testing @mdroettboom did indicated effectively double the variance relative to a “stable” worker, which is a lot smaller than I anticipated.

One final thought: keep in mind the difference between feature-oriented (“micro”) and workload-oriented (“macro”) benchmarks. (See the faster-cpython wiki.) I expect that workload-oriented benchmark results will be less affected by “unstable” workers, though that hypothesis would need to be proven.

eric.snow · January 3, 2023, 9:54pm

I’m aware of multiple teams at large enterprise companies using instruction counts in their profiling and benchmarking efforts, instead of time intervals and sample counts. The approach also lends itself to accurate results while capturing a full representation of execution (whereas sampling is lossy-er, even if faster). I expect there will be a good selection of discussion of this approach in the literature.