Compiler-Level const-Optimization

Python-Interpreter does not check for unused (or write only) variables, and does introduce unnecessary variables (local values). At least for numbers (integer and float) it should be possible to improve byte-code-level (regarding performance and efficency).

  1. I want to propose a new pseudo-type “const” (acting like C-defines).
  2. Also “const” auto-detection on complier/interpreter-level to create better byte-code improves here.
  3. Skipping Write-Once-Read-Immediately temp-values.
# check Interpreter bytecode here:  godbolt.org

c1 : int = 40 # unused in this file (but external usage possible)
c2 : const = 41 # new pseudo-type should act like C-define (at least within file-scope)
# focus here on numbers (int, float), and perhaps "constant-strings".

def prod1(num):
    a,b = 10,11 # write once (no modify)
    return num * (a * b) # 10*11=110

def prod2(num):
    a,b = 20,21 # local + unused (still put on stack)
    return num * (20*21)

def prod3(num):
    _,_ = 30,31 # never used (still put on stack)
    return num * (30*31)

def fnkt(num):
    i = num+2
    return i # why not skip in-between value automatically?

#EoF.

Before we could even consider using annotations for this, we would have to break the seal on the idea of using annotations for any kind of optimization. Right now, Python does not care about your type annotations; the only observable runtime effect is to add __annotations__ attributes to some objects, and the annotations exist only for use by third-party tools (and human code reviewers).

Have you actually studied what bytecode is produced by your examples? Can you show what the optimized results should look like instead?

Please also keep in mind that this would be a breaking semantic change. For example, if I have

def prod1(num):
    a, b = 10,11
    return num * (a * b)

then I can subsequently do

>>> type(prod1)(prod1.__code__.replace(co_consts=(None, (3, 4))), {}, '')(1)
12
1 Like

I assume you did study what the bytecode looks like, since you referred to godbolt - your comments also suggest this. But as you must have seen then, the bytecode shows that type hints play no role in the runtime code or even in code generation. Changing this would make Python a fundamentally different language.

Leaving aside the ‘const’ idea, it seems that you what you want is a kind of “(more) optimizing bytecode compiler”? I think that’s an interesting idea… I always assumed that the current bytecode compilation was already (about) as optimized as possible, given the general constraints of the language. It seems that you’re trying to find examples where this is not the case. I don’t know enough to really evaluate this - but it seems to me that just pointing to a few very small examples is not enough argument to make any changes to bytecode compilation.

I think the folks at faster-cpython would disagree with that, but they’re working on it :joy:

As far as I know there haven’t been proposals to incorporate type-hinting into that project, but at the point that a JIT becomes feasible that might make sense. But I don’t know if const would help, as a JIT is already going to trace the execution and might be able to figure out that something is a constant.

1 Like

Thanks for that link – should also be interesting to @alexander ! (Glad I hedged my statement with “(about) as fast as” :slight_smile: )

Hi Hans, thanks for you comments.

  1. yes, I know cpython. but python 3…11 still has a significant overhead (factor >10).
  2. Yes, I checked the Bytecode via gobolt.org (and see useless variables and reads and writes). most likely each one using a hash-table to find the varibale-space (which should be significant more slow than C or cpython).
  3. I understand, that some changes are more complicated than others (API, etc).
  4. Still, I think it should be possible to improve the actual interpreter/compiler.
  5. If variables hold numbers, and never change (as they are local and written once), then it would help to avoid waste of resources (energy, etc).
  6. At the moment many users dont care (as they underestimate the global summary consequences in resource consumtion), or are forced to decide between readable code and a faster version.
  7. I know " annotations-attributes" are only hints. But as long it is man-made and backward compatible, a new special keyword in annotation is not something unthinkable…
  8. I try to benchmark within 1 week.
    I think, we should concider such low-hanging-fruit optimizations!
    Thanks for reading and thinking about it.
1 Like

I would like to see a kind of “named value” that behaves like a literal. However, annotations don’t seem to be the right way to do this. It would make more sense to have a different kind of notation, such as:

constant vector = 3+4j

and then, any time the name vector is used, the compiler would substitute in the actual literal value of whatever it calculated on the RHS.

But there are a lot of problems with this sort of proposal, such as exactly when it gets evaluated.

Oh, there are definitely improvements possible, especially if the bytecode standard is tweaked (as tends to happen with most minor versions anyway). Code that unpacks and re-packs sequences is especially bad:

>>> def x():
...     a, b, c, d, e = 1, 2, 3, 4, 5
...     return a, b, c, d, e
... 
>>> import dis
>>> dis.dis(x)
  2           0 LOAD_CONST               1 ((1, 2, 3, 4, 5))
              2 UNPACK_SEQUENCE          5
              4 STORE_FAST               0 (a)
              6 STORE_FAST               1 (b)
              8 STORE_FAST               2 (c)
             10 STORE_FAST               3 (d)
             12 STORE_FAST               4 (e)

  3          14 LOAD_FAST                0 (a)
             16 LOAD_FAST                1 (b)
             18 LOAD_FAST                2 (c)
             20 LOAD_FAST                3 (d)
             22 LOAD_FAST                4 (e)
             24 BUILD_TUPLE              5
             26 RETURN_VALUE

Even if we treat it as a requirement to set the values of the locals (say, for the benefit of introspective, multithreaded code), the re-loading of what should have already been on the stack seems pretty awful.

1 Like

Do you have an example of “real” code like that? To me, that looks like the author’s fault, and would mean putting effort into optimizing/encouraging something that shouldn’t have been written in the first place. Same with the first post’s examples.

1 Like

Just some py-samples (which 1:1 tranlated to C would give optimization).

  1. a=mm.sin(x); return aa+3.0x+2.2
  2. mm.cos(math.pi*x) # there is no comfortable way (expect writing “3.1415926…”) to avoid a lookup for math.pi # often it is used multiple times
  3. ,a, = f3(x) # afaik it also stores the “_” values (I have to check later)
  4. param1,param2=3.3;4.4; # followed form other code within same function many times using these values (if you would write it hard-coded, then the code becomes unchangeable and unreadable)
    Edit: just checked: “, = 30,31” produces 4 lines of ByteCode !
  5. I’m teaching also computer-science, and I know what code students write. Peaple use python because it is easy and you dont have to think about details. Sadly this causes usually code far away from experts with internal knowledge would have written it.

Novice: writes the simple and obvious code, because it’s good enough.
Mid-level: No no, I must optimize, I must optimize!
Expert: writes the simple and obvious code, because it’s good enough.

Have you benchmarked any of these issues to find out whether they make a measurable difference?

2 Likes

Simple tests show about 15…45 percent difference on Python 3.11.5 (Windows 10, i7-11gen).

# Python Testfile 2023
import math # math.pi
import time

# global parameter
p0,p1,p2,p3,p4 = 0.1,1.1,2.2,3.3,4.4


# Block 0: Reference
dt = time.time()
# nothing (NOP)
dt1 = (time.time() - dt)*1000
dt = time.time()
for i in range(1000*1000):
    s = 1
dt2 = (time.time() - dt)*1000
dt,dtr = dt2-dt1,-1.0
print("t_nop=%.6f, t_for=%.6f, d=%.6f(--) [ms]" % (dt1,dt2,dt)) # 0.0 + 60 (d=100%) ms
tfor=dt2

# Block 1: unused (forgotten) variables
dt = time.time()
for i in range(1000*1000):
    _,_,_ = 1.1,2.2,3.3 # not used
    # something else independent
dt1 = (time.time() - dt)*1000
dt = time.time()
for i in range(1000*1000):
    a,b,c = 1.1,2.2,3.3
    # something else independent
dt2 = (time.time() - dt)*1000
dt,dtr = dt2-tfor,100.0*(dt2-tfor)/dt2
print("t_uu_3=%.6f, t_uu3a=%.6f, d=(%.6f)(%.1f%%) [ms]" % (dt1, dt2,dt,dtr)) # 100+100 (d=40% vs t_for) ms
# compare to t_for=60 gives >50% timeloss

# Block 2: PI
s = 0.0
dt = time.time()
for i in range(1000*1000):
    s = 3.141592654
dt1 = (time.time() - dt)*1000
dt = time.time()
for i in range(1000*1000):
    s = math.pi
dt2 = (time.time() - dt)*1000
dt,dtr = dt2-dt1,100.0*(dt2-dt1)/dt2
print("t_3141=%.6f, t_m.pi=%.6f, d=%.6f(%.1f%%) [ms]" % (dt1,dt2,dt,dtr)) # 60 + 85 (d=30%) ms

# Block 3: many used temp
dt = time.time()
for i in range(1000*1000):
    # a,b,c = s*0.1,s*0.2,s*0.3
    s = ((1.1+(s*0.1)*1.2)*(s*0.2)-1.3)*(s*0.3)-1.4
dt1 = (time.time() - dt)*1000
dt = time.time()
for i in range(1000*1000):
    a,b,c = s*0.1,s*0.2,s*0.3
    s = ((1.1+a*1.2)*b-1.3)*c-1.4
dt2 = (time.time() - dt)*1000
dt,dtr = dt2-dt1,100.0*(dt2-dt1)/dt2
print("t_mutD=%.6f, t_mutV=%.6f, d=%.6f(%.1f%%) [ms]" % (dt1,dt2,dt,dtr)) # 220 + 330 (d=30%) ms

# Block 4: global parameters (polnomial x**3)
dt = time.time()
for i in range(1000*1000):
    s = ((1.1+p0*1.2) *p0-1.3) *p0-1.4 # p0 avoids compile-time-number
dt1 = (time.time() - dt)*1000
dt = time.time()
for i in range(1000*1000):
    s = ((p1+p0*p2) *p0-p3) *p0-p4
dt2 = (time.time() - dt)*1000
dt,dtr = dt2-dt1,100.0*(dt2-dt1)/dt2
print("t_polD=%.6f, t_polG=%.6f, d=%.6f(%.1f%%) [ms]" % (dt1,dt2,dt,dtr)) # 183 + 211 (d=30) ms

# Block 5: temp (single usage var.)

# Sample output (run again for averages)
# Win10 + Python 3.11.5 + mobile Intel i7-11xxx
# t_nop=0.000000, t_for=54.032087, d=54.032087(--) [ms]
# t_uu_3=117.016554, t_uu3a=99.645853, d=(45.613766)(45.8%) [ms]
# t_3141=50.169706, t_m.pi=80.700159, d=30.530453(37.8%) [ms]
# t_mutD=236.411810, t_mutV=328.007460, d=91.595650(27.9%) [ms]
# t_polD=178.054333, t_polG=207.845211, d=29.790878(14.3%) [ms]

#EOF.

I would say: the novice writes simple and obvious code because they don’t know better, the expert writes it when they know it is better.

The thing is that a novice often doesn’t write obvious code, but messy and unnecessarily complicated code, less efficient than simpler code. Novices also have no concept at all of maintainability or technical debt – it took me years before I saw how important that is, and CS courses mostly seem to ignore it. But anyway - optimizing the compiler will not help with this :slight_smile:

True. In any case, the point is, the expert knows that it’s often not worth optimizing a tiny little bit of runtime away, if it comes at the cost of making the code harder to read.

1 Like

This looks a lot like Final. Final communicates the same intent, but has no runtime effect.
As a result, any suggestion for a new keyword in this space introduces two ways to spell the same thing, one for typing and one for runtime. I think that’s generally a bad idea.

(Oh, and don’t reuse Final for this – being able to monkeypatch Final variables in tests is very handy.)

The goal of faster cpython, as I understand it, is to make the interpreter faster without significant changes to the language.

I’m fine with any name (final const define), but I like positive runtime effects. Why not concider “final” (at least numbers within functions) as define-like objekt on comiler-level?

Mojo lets you efficiently define write-only constants with let keyword. You can give Mojo a shot if performance is your concern, though it still has some rough edges.