Function call elision

kfsone · February 7, 2022, 9:38pm

Function call overhead has improved greatly in Py 3, but I still regularly encounter hot paths that accidentally go function-call crazy for trivial operations that are intended to be “cheap” or no-op in production.

Think:

In [24]: DEBUG = False
    ...: ITERS = int(1e8)
    ...: def debug_only_work():
    ...:     if not DEBUG: return
    ...:     time.sleep(99)
    ...: def hotpath_plus():
    ...:     for i in range(ITERS):
    ...:         debug_only_work()
    ...:         pass
    ...: def hotpath():
    ...:     for i in range(ITERS):
    ...:         pass
    ...:

In [25]: %timeit hotpath_plus()
6.9 s ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [26]: %timeit hotpath()
1.34 s ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Could the runtime avoid calling functions that have no body?

Obviously, in the above example, the interpreter can’t know whether DEBUG is a runtime constant or not, but if the interpreter skipped the cost of invoking empty-body functions, a decorator could be employed to achieve the equivalent of:

if  DEBUG:
  def report_host(ip): print(f"{ip} = {resolve(ip)}")
else:
  def report_host(ip): pass

I wondered about the ability to tag functions with a simple predicate to allow the runtime to skip calling it:

class Logger:
    enable_debug = False
    enable_warning = False
    # ...
    def set_level(level):
        if level >= logging.WARNING: self.enable_warning = True
        if level >= logging.DEBUG: self.enable_debug = True
        # ...

    @call_only_if(self.enable_debug)  # some kind of built-in
    def debug(fmt, *args, **kwargs):
        # ...

but it would need to refuse "@property"s, or people would just be switching one hole for another:

In [40]: class A:
     ...:     _name = "world"
     ...:     @property
     ...:     def name(self):
     ...:         return self._name
     ...:

In [41]: a = A()

In [42]: %timeit a.name
77.5 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [43]: %timeit a._name
28.6 ns ± 0.222 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

but that said, a lot of the big performance hits I end up squishing are from people assuming that logger.debug costs nothing when the debug level is higher.

But then they end up with constructs like the following:

logger.debug("hosts: %s", list(resolve(ip) for ip in host_ips))

Their reasoning? ‘%s’ printed <generator …>, so they had to turn it into a list, but only when we log debug… (and I’ve seen this repeatedly over the last 10 years from some very smart engineers)

I’m also seeing this with format masks:

message: bytes
logger.debug("%s:%s:%s:%s message %s", time, str(ip), str(port), user_name, str(message))
ourlog.debug(f"{time}:{ip}:{port}:{user} message {message}")

where ‘message’ can be 1000s of bytes long, and they think that this is “free” when wrapped in a logger.debug call.

In the last 5 years, looking at tools with seconds-to-minutes of startup time, huge quantities of it boil down to expensive no-ops that “we can’t do anything about”, because of incorrect perceptions about no-ops.

In [136]: import logging ; logger = logging.getLogger("timing")
     ...: big_text = ("x" * 16385).encode('utf-8')
     ...: def f1(text: bytes) -> None:
     ...:     if logger.isEnabledFor(logging.DEBUG):
     ...:         logger.debug("%s", str(text))
     ...: def f2(text: bytes) -> None:
     ...:     logger.debug("%s", str(text))
     ...: def f3(text: bytes) -> None:
     ...:     logger.debug(f"{text}")
     ...:

In [137]: %timeit f1(big_text)
198 ns ± 0.983 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [138]: %timeit f2(big_text)
57.5 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [139]: %timeit f3(big_text)
57.4 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

(In-case you missed it: that went from nanoseconds for f1 to microseconds for f2 and f3)

what people seem to expect is that

    def f(user, message):
        logger.debug("%s:%s: messsage %s", user.name, user.ip, str(message))

rather than producing:

  2           0 LOAD_GLOBAL              0 (logger)
              2 LOAD_METHOD              1 (debug)
              4 LOAD_CONST               1 ('%s:%s: messsage %s')
              6 LOAD_FAST                0 (user)
              8 LOAD_ATTR                2 (name)
             10 LOAD_FAST                0 (user)
             12 LOAD_ATTR                3 (ip)
             14 LOAD_GLOBAL              4 (str)
             16 LOAD_FAST                1 (message)
             18 CALL_FUNCTION            1
             20 CALL_METHOD              4
             22 POP_TOP
             24 LOAD_CONST               0 (None)
             26 RETURN_VALUE

will instead produce something equivalent to:

  2           0 LOAD_GLOBAL              0 (logger)
              2 LOAD_ATTR                1 (is_enabled_for_debug)
              4 POP_JUMP_IF_FALSE       17 (to 34)
              6 LOAD_GLOBAL              0 (logger)
              8 LOAD_METHOD              2 (debug)
             10 LOAD_CONST               1 ('%s:%s: messsage %s')
             12 LOAD_FAST                0 (user)
             14 LOAD_ATTR                3 (name)
             16 LOAD_FAST                0 (user)
             18 LOAD_ATTR                4 (ip)
             20 LOAD_GLOBAL              5 (str)
             22 LOAD_FAST                1 (message)
             24 CALL_FUNCTION            1
             26 CALL_METHOD              4
             28 POP_TOP
             30 LOAD_CONST               0 (None)
             32 RETURN_VALUE
        >>   34 LOAD_CONST               0 (None)
             36 RETURN_VALUE

because here, in the no-op case, marshalling the arguments are the real hidden cost.

– related anecdote –

During these investigations, the engineers who created the problem typically tell me “well, python is just slow”. They’d told this to end-users who had complained of 1-5 minute start-up times of critical tools, or it’s because it’s reading yaml files.

In one case, a tool for configuring a checkout directory and interacting with pulls, took nearly 7 minutes to startup sometimes.

6 minutes and 43 seconds of that time was a combination of the diligence in logging and testing-support features in the 1st-party code, and logging and optional-validation features in the yaml library, all of which was no-op activity.

Just by changing the following pattern in both

def do_a_thing(config, args):
    logger.debug("%s: %s: %s: says %s", ...)
    expensive_validation(config, args)

def expensive_validation(config, args):
    if not config.enable_validation(): return
    ...

to

def do_a_thing(config, args, log_debug: bool=False):
    if log_debug: logger.debug(...)
    if config.validation_enabled: expensive_validation(args)

I was able to reduce the startup time to 58 seconds, and applying similar across the 1st-party code and a 2nd library brought us down to 13s.

On hearing the speed-up was not the result of converting to go, most people said: “Huh, this will save me hours of time, I thought Python was just slow”

Other - actual performance tuning - eventually got it down to 4 seconds.

steven.daprano · February 8, 2022, 2:58pm

Hi Oliver,

You ask some interesting questions, and since nobody has responded, I

wanted to step in and say that you are not being ignored. There’s a lot

to digest.

One of the problems is that it is hard (impossible?) for the interpreter

to tell at runtime that a function is empty without actually calling it.

Another thought comes to mind. Do you know about the magic value

__debug__ and assertions?

When you run Python normally, __debug__ is True, and assert statements

are run. When you run Python with the -O switch, __debug__ is False,

and assertions are removed from your code. But so are code blocks

guarded with if __debug__: ....

So one trick is:

def hotpath():

    for i in range(ITERS):

    if __debug__:

        debug_only_work()

    pass

When you run Python normally, debug_only_work() is called. When you run

under -O, it isn’t called.

But it’s cooler than that: either way, the if __debug__ test is

skipped, and the code is compiled as either unconditionally calling

debug_only_work(), or unconditionally skipping it.

I don’t know if that helps you or not, but it’s something to keep in

mind.

kfsone · February 9, 2022, 7:45pm

Holy hot-diggity-cow! I’m … having an actual moment From the C code, I “had known” that. Any yet, from printed text elsewhere, including

       -O     Turn on basic optimizations.  This changes the filename extension for compiled (bytecode) files from .pyc to .pyo.  Given twice, causes docstrings to be discarded.

       -OO    Discard docstrings in addition to the -O optimizations.

I had willfully mentally reduced it to “-O disables docstrings”. No - I see that’s not what it says, and I’ve seen it every time I’ve read it, and then on the next mental clock cycle resumed believing that it just disables docstrings.

If I can ask a speculative rather than “under the current implementation”: I’d thought it might be relatively easy for the interpreter to recognize an empty function based on the op-pair:

The first, obvious thought being a glance at empty string (“perhaps all empty functions are the same empty function”). Obviously not the function itself, but perhaps fn.code.co_code? Alas, pass vs docstring makes a difference.

Just to clarify, clearly this is different than a function that opens with an "if’ and bails out early. This was for the ability to do:

if FeatureFlags.enabled("send-packets-to-bing-for-giggles"):  # one-time evaluation
    def nothing_to_see_here():
        ...
else:
    def nothing_to_see_here():
        pass

but I guess the problem is that outside these trivial cases, you still end up doing a lot of work:

I guess this would need a (given significantly more thought than I’m about to use to describe it) proposal to replace that leading LOAD_GLOBAL with a similar op that knows you’re looking up a callable so the vm could potentially skip to the call function and, well, not.

Given that there are side-effects to be considered

call_possibly_empty_function(cake.pop())  # LIES!

and hidden side-effects:

class Party:
    cake: List[Cake]
    @property
    def fun(self):
        return cake.pop()

call_possibly_empty_function(party.fun)  # LIESENCAKE! (portmanteau of lies, heisen bug, and cake, but also probably very rude in some language)

I realize it’s not going to be as simple as my spitball outline But perhaps not anything significantly worse that the above?

kfsone · February 9, 2022, 7:49pm

PS it may sound like a limited gain, if I go to a proposal, I’ll gather/present stats on the places I’ve seen littered with def <impl> if enabled else def pass and how they can impact startup of most of the python tools we use as part of it to answer that…

But it acts as a step to a decorator to do def-if-else-pass which might encourage people to use it more.