Where a, b, and c are typically numpy arrays with floats of size 100k. Because this is the main place we spend time, we experimented a bit and separated the code into:
# Version B
tmp = a / b
x = tmp * c
Unexpectedly, this gave a speedup of 2.6x (over 2k runs) of B over A. And only for arrays of sizes between 50k and 128k. The speedup only holds for python up to and including 3.12. At 3.13 the performance of B degrades to the performance of A and not vice versa.
I am trying to understand 2 things:
Why does the performance improve? As I understand it should essentially be the same operation with the exception that numpy has more information when the computation is chained. I suspect it has something to do with how numpy handles the cache?
What changed from 3.12 to 3.13 so that the improvement goes away?
Here is some code to reproduce the experiment (arr1, arr2, arr3) are fixed in this experiment. I get similar results if I change them during the for loop:
import timeit
import numpy as np
n = 100_000
repeats = 1000
arr1, arr2, arr3 = np.random.rand(n), np.random.rand(n), np.random.rand(n),
def A():
for _ in range(repeats):
x = arr1 / arr2 * arr3
def B():
for _ in range(repeats):
inter = arr1 / arr2
x = arr3 * inter
t_one = timeit.Timer(A).timeit(number=10)
t_two = timeit.Timer(B).timeit(number=10)
print("\nTIMEIT")
print(f"A: {t_one / repeats:.4g} s")
print(f"B: {t_two / repeats:.4g} s")
# TIMEIT
# A: 0.007657 s
# B: 0.001167 s
My cache sizes are: L1 1.1MB, L2 9.0MB, L3 18.0MB.
Your text version does (a / b) * c both ways. Your code version does (arr1 / arr2) * arr3 and arr1 * (arr3 / arr2), which are not quite equivalent. I doubt it matters, but I’d still do the same both ways, so that the variable use is the only difference.
On some platforms NumPy may do a bit of magic and execute the first as:
x = arr1 / arr2
x *= arr3
but thats all the magic NumPy does. Seems a bit large difference, but cache effects are weird, so I’ll just leave you with that bit of – likely unnecessary – information.
I would caution against reading too much into the result of timeit or of running this operation in a loop the way that you do in the benchmark. You don’t really want to compute a / b * c for the exact same arrays a, b and c 10000 times in a loop. That means that the benchmark is not realistic and if cache effects are significant then you should not assume that any differences would carry over to a real situation. In a real situation you must be either modifying the arrays a, b and c or using different arrays each time and either of those things can be expected to affect the sort of timing that you have measured.
Also you should not time B right after A especially when they both access the same memory. Either alternate measurements (A, B, A, B, …) or time them in separate processes.
Ultimately the thing to do here is to measure the effect of the different versions in the context of a real program with a real workload rather than a microbenchmark. Even then bear in mind that it could vary by hardware, operating system, how NumPy was built, Python version/implementation etc so it is very much possible that you see differences on one machine that don’t carry over to another.
If you do really need to optimise this one operation as hard as possible then there are many possible things that might speed it up but it really depends on the context. You could try numexpr or something that uses GPU. If you often use the same array b you could precompute b1 = 1 / b to use multiplication instead of division. It might be that you can get away with 32-bit floats. It might make sense to have globally allocated arrays that get reused. The operation is possibly too short to benefit from using threads internally but if you are doing it many times with different arrays then you could distribute the operations over multiple threads at a higher level.
Note that this is not equivalent because it mutates a. That might be fine in your application but these two functions are not interchangeable:
def f(a, b, c):
return a / b * c
def g(a, b, c):
x = a
x /= b
x *= c
return x
I have seen many people get into trouble with this subtle difference so a function like g is dangerous if it is not obvious to the caller what it is doing.
Thank you for the detailed reply. You’re right, the microbenchmark is not very realistic.
We measured the effect on real data first (it was faster, about 1.8x) and then tried to come up with a micro benchmark that replicated the effect (In hindsight a slightly more detailed benchmark would be better). In production the code is something like:
for group in groups:
a = mapper_a[f1(group)]
b = mapper_b[f2(group)]
c = mapper_c[f3(group)]
x = a / b * c
where we have many groups, but they map to only few a, b and cs (i.e. f1, f2 and f3 can only be one of few values). So it is possible that we reuse a, b and c in multiple instances. We also considered precomputing the combinations of as, bs and cs, which comes with its own advantages and downsides.
It is not so much about further optimising this specific instruction, I was more interested why we see the effect.
When I do alternate measurements or when I generate new arrays for every calculation the measurements are about equal with some variation. So you are right that the memory access does have some influence there.
So is the multiplication in (a / b) * c, under some conditions. If numpy can correctly figure out that an array only has one reference and is going to be deleted after an operation, it instead modifies it inplace and prevents the deletion. But when exactly this optimization can be applied is very unreliable and you should never rely on it. As you can see, it can easily change between python versions, in this case probably because of changes in the bytecode interpreter that means that this optimization is no longer as applicable.