Python bytecode optimization

weareone · November 16, 2024, 4:00am

I have many hundreds of python functions and I would like to optimize the following Python code so there is minimum intermediate variable assignments.

def convert_temps1(celsius_list):
    # Two separate steps with intermediate lists
    doubled = [c * 9/5 for c in celsius_list]
    fahrenheit = [d + 32 for d in doubled]
    return fahrenheit

def convert_temps2(celsius_list):
    # Fused into one operation
    return [(c * 9/5 + 32) for c in celsius_list]

I’m aware of the py_compile API which allows setting an optimization mode and can perform basic optimizations like dead-code elimination. However, my optimization needs are specialized. I’m trying to understand the different options available.

There’s the disassembler API where I can inspect and transform bytecode directly, but there’s also CPython’s frame evaluation hook which operates closer to the VM. What would be the differences between using these two APIs? I notice that PyTorch’s compiler (Dynamo) utilizes the evaluation hook to rewrite the bytecode.

onePythonUser · November 18, 2024, 12:23am

Hello,

if you’re attempting to optimize your script, specifically the one shown, you will want to reduce the number of division operations as much as possible since they tend to require the most latency cycles. For example, for the value 9/5 division operation, just provide its equivalent decimal number of 1.8 since this is a known value (a constant).

Here is a thread that provides a relative general breakdown of the required latency cycles for each of the most popular arithmetic operations.

If you’re using a microcontroller, reference its programming manual since it tends to provide the required latency cycles per mathematical operation.

If you have large numbers, and they’re being divided by an even number, and you don’t mind losing a little bit of precision, you can right-shift as an alternative. This might take one or two cycles … reference the related manual.

For example:

div_by_8 = 23445>>3   # since 2^3 = 8, you shift by 3
>>> 2930

Stefan2 · November 18, 2024, 12:46am

23445 is not at all a large number, and the shift might very well be slower than a division.

onePythonUser · November 18, 2024, 1:05am

Here is a relative comparison:

For this specific example, 21 cycles (normal division) vs 3 cycles (bit-shifting). Again, you would have to look at your specific device for details.

Stefan2 · November 18, 2024, 1:16am

Why are we looking at C information? I don’t see a clear indication for that. The name celsius_list and the usage of a list comprehension to me rather suggest that c is a Python int or float. In which case the C speed of the division is rather irrelevant. I measured and // 8 was indeed ~10% faster than >> 3.

onePythonUser · November 18, 2024, 1:25am

I am falling back on my PID days where we actually wrote the PID loop in assembly. We measured the latency as it both entered and exited the loop by toggling an I/O and monitoring it on an o-scope. Performing right-shifting definitely was shorter by a factor of x than normal division by many cycles (don’t ask I don’t remember the exact value but it was definitely long enough that we collectively employed right-shifts versus normal division in C).

weareone · November 18, 2024, 1:28am

I’m trying to collapse intermediate variable assignments (STORE) rather than ALU ops here. Is this kind of optimization called function or assignment inlining?

Here’s another pattern:

def a_1(vals1, vals2):
    diff = [(v1 - v2) for v1, v2 in zip(vals1, vals2)]
    diff_sq = [d**2 for d in diff]
    return(sum(diff_sq))

def a_2(vals1, vals2):
    return(sum([(x-y)**2 for x,y in zip(vals1, vals2)]))

measuring via tracemalloc:

Peak memory usage of a_1: 66722
Peak memory usage of a_2: 34229

Stefan2 · November 18, 2024, 1:33am

Hmm, you measured only memory usage? Is memory what you’re trying to optimize, rather than speed?

MRAB · November 18, 2024, 1:37am

9/5 will be calculated at compile time anyway.

Stefan2 · November 18, 2024, 1:40am

Not in that code, no.

Rosuav · November 18, 2024, 1:40am

Not in this instance. The spacing is deceptive, and it’s not actually calculated as 9/5.

>>> dis.dis(lambda c: c * 9/5)
  1           RESUME                   0
              LOAD_FAST                0 (c)
              LOAD_CONST               1 (9)
              BINARY_OP                5 (*)
              LOAD_CONST               2 (5)
              BINARY_OP               11 (/)
              RETURN_VALUE

weareone · November 18, 2024, 1:45am

Yes, only optimizing the memory component of the program is relevant to me here.

Stefan2 · November 18, 2024, 1:46am

Then how about using generator expressions instead of list comprehensions except where you actually want a list?

onePythonUser · November 18, 2024, 1:52am

Understand if this is for a microcontroller where they can be in the low kb's of memory. But if this is for a desktop application, which can have many gigs of memory, why? Is this for a microcontroller?

weareone · November 18, 2024, 1:52am

I’m actually working at a lower level - at the Python VM’s bytecode level. Rather than using language features like generator expressions, I’m trying to detect patterns in the bytecode itself and rewrite them into more efficient versions right before they execute.

Think of it more like what PyTorch’s Dynamo or Numba does - actual code transformation at runtime, not just using different Python constructs.

weareone · November 18, 2024, 2:23am

If you observe the bytecode for a_1, you’ll see it has two separate loops, as opposed to a_2, which has one. However, that’s not the pattern I think I want to observe. There’s something else I see here: there is a STORE_FAST followed by LOAD_FAST for the same variable diff. What this tells us is that we’re storing a value (STORE_FAST diff), and the only use of this value is loading it (LOAD_FAST diff) for the next loop. After the second loop, it’s never used again.

Maybe I need to do some kind of a dataflow analysis to see how long these variables “live” in the function stack?

zhangyx · November 18, 2024, 2:39am

This is my first reaction when reading OP’s code. Replacing square brackets with round ones might be all they need.

def convert_temps1(celsius_list):
    # Two separate steps with intermediate lists
    doubled = (c * 9/5 for c in celsius_list)
    fahrenheit = (d + 32 for d in doubled)
    return fahrenheit # Use list(fahrenheit) if you need a list returned.