I have many hundreds of python functions and I would like to optimize the following Python code so there is minimum intermediate variable assignments.
def convert_temps1(celsius_list):
# Two separate steps with intermediate lists
doubled = [c * 9/5 for c in celsius_list]
fahrenheit = [d + 32 for d in doubled]
return fahrenheit
def convert_temps2(celsius_list):
# Fused into one operation
return [(c * 9/5 + 32) for c in celsius_list]
I’m aware of the py_compile API which allows setting an optimization mode and can perform basic optimizations like dead-code elimination. However, my optimization needs are specialized. I’m trying to understand the different options available.
There’s the disassembler API where I can inspect and transform bytecode directly, but there’s also CPython’s frame evaluation hook which operates closer to the VM. What would be the differences between using these two APIs? I notice that PyTorch’s compiler (Dynamo) utilizes the evaluation hook to rewrite the bytecode.
if you’re attempting to optimize your script, specifically the one shown, you will want to reduce the number of division operations as much as possible since they tend to require the most latency cycles. For example, for the value 9/5 division operation, just provide its equivalent decimal number of 1.8 since this is a known value (a constant).
Here is a thread that provides a relative general breakdown of the required latency cycles for each of the most popular arithmetic operations.
If you’re using a microcontroller, reference its programming manual since it tends to provide the required latency cycles per mathematical operation.
If you have large numbers, and they’re being divided by an even number, and you don’t mind losing a little bit of precision, you can right-shift as an alternative. This might take one or two cycles … reference the related manual.
For example:
div_by_8 = 23445>>3 # since 2^3 = 8, you shift by 3
>>> 2930
Why are we looking at C information? I don’t see a clear indication for that. The name celsius_list and the usage of a list comprehension to me rather suggest that c is a Python int or float. In which case the C speed of the division is rather irrelevant. I measured and // 8 was indeed ~10% faster than >> 3.
I am falling back on my PID days where we actually wrote the PID loop in assembly. We measured the latency as it both entered and exited the loop by toggling an I/O and monitoring it on an o-scope. Performing right-shifting definitely was shorter by a factor of x than normal division by many cycles (don’t ask I don’t remember the exact value but it was definitely long enough that we collectively employed right-shifts versus normal division in C).
I’m trying to collapse intermediate variable assignments (STORE) rather than ALU ops here. Is this kind of optimization called function or assignment inlining?
Here’s another pattern:
def a_1(vals1, vals2):
diff = [(v1 - v2) for v1, v2 in zip(vals1, vals2)]
diff_sq = [d**2 for d in diff]
return(sum(diff_sq))
def a_2(vals1, vals2):
return(sum([(x-y)**2 for x,y in zip(vals1, vals2)]))
measuring via tracemalloc:
Peak memory usage of a_1: 66722 Peak memory usage of a_2: 34229
Understand if this is for a microcontroller where they can be in the low kb's of memory. But if this is for a desktop application, which can have many gigs of memory, why? Is this for a microcontroller?
I’m actually working at a lower level - at the Python VM’s bytecode level. Rather than using language features like generator expressions, I’m trying to detect patterns in the bytecode itself and rewrite them into more efficient versions right before they execute.
Think of it more like what PyTorch’s Dynamo or Numba does - actual code transformation at runtime, not just using different Python constructs.
If you observe the bytecode for a_1, you’ll see it has two separate loops, as opposed to a_2, which has one. However, that’s not the pattern I think I want to observe. There’s something else I see here: there is a STORE_FAST followed by LOAD_FAST for the same variable diff. What this tells us is that we’re storing a value (STORE_FAST diff), and the only use of this value is loading it (LOAD_FAST diff) for the next loop. After the second loop, it’s never used again.
Maybe I need to do some kind of a dataflow analysis to see how long these variables “live” in the function stack?
This is my first reaction when reading OP’s code. Replacing square brackets with round ones might be all they need.
def convert_temps1(celsius_list):
# Two separate steps with intermediate lists
doubled = (c * 9/5 for c in celsius_list)
fahrenheit = (d + 32 for d in doubled)
return fahrenheit # Use list(fahrenheit) if you need a list returned.