Function call overhead has improved greatly in Py 3, but I still regularly encounter hot paths that accidentally go function-call crazy for trivial operations that are intended to be “cheap” or no-op in production.
Think:
In [24]: DEBUG = False
...: ITERS = int(1e8)
...: def debug_only_work():
...: if not DEBUG: return
...: time.sleep(99)
...: def hotpath_plus():
...: for i in range(ITERS):
...: debug_only_work()
...: pass
...: def hotpath():
...: for i in range(ITERS):
...: pass
...:
In [25]: %timeit hotpath_plus()
6.9 s ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [26]: %timeit hotpath()
1.34 s ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Could the runtime avoid calling functions that have no body?
Obviously, in the above example, the interpreter can’t know whether DEBUG is a runtime constant or not, but if the interpreter skipped the cost of invoking empty-body functions, a decorator could be employed to achieve the equivalent of:
if DEBUG:
def report_host(ip): print(f"{ip} = {resolve(ip)}")
else:
def report_host(ip): pass
I wondered about the ability to tag functions with a simple predicate to allow the runtime to skip calling it:
class Logger:
enable_debug = False
enable_warning = False
# ...
def set_level(level):
if level >= logging.WARNING: self.enable_warning = True
if level >= logging.DEBUG: self.enable_debug = True
# ...
@call_only_if(self.enable_debug) # some kind of built-in
def debug(fmt, *args, **kwargs):
# ...
but it would need to refuse "@property"s, or people would just be switching one hole for another:
In [40]: class A:
...: _name = "world"
...: @property
...: def name(self):
...: return self._name
...:
In [41]: a = A()
In [42]: %timeit a.name
77.5 ns ± 0.336 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [43]: %timeit a._name
28.6 ns ± 0.222 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
but that said, a lot of the big performance hits I end up squishing are from people assuming that logger.debug
costs nothing when the debug level is higher.
But then they end up with constructs like the following:
logger.debug("hosts: %s", list(resolve(ip) for ip in host_ips))
Their reasoning? ‘%s’ printed <generator …>, so they had to turn it into a list, but only when we log debug… (and I’ve seen this repeatedly over the last 10 years from some very smart engineers)
I’m also seeing this with format masks:
message: bytes
logger.debug("%s:%s:%s:%s message %s", time, str(ip), str(port), user_name, str(message))
ourlog.debug(f"{time}:{ip}:{port}:{user} message {message}")
where ‘message’ can be 1000s of bytes long, and they think that this is “free” when wrapped in a logger.debug call.
In the last 5 years, looking at tools with seconds-to-minutes of startup time, huge quantities of it boil down to expensive no-ops that “we can’t do anything about”, because of incorrect perceptions about no-ops.
In [136]: import logging ; logger = logging.getLogger("timing")
...: big_text = ("x" * 16385).encode('utf-8')
...: def f1(text: bytes) -> None:
...: if logger.isEnabledFor(logging.DEBUG):
...: logger.debug("%s", str(text))
...: def f2(text: bytes) -> None:
...: logger.debug("%s", str(text))
...: def f3(text: bytes) -> None:
...: logger.debug(f"{text}")
...:
In [137]: %timeit f1(big_text)
198 ns ± 0.983 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [138]: %timeit f2(big_text)
57.5 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [139]: %timeit f3(big_text)
57.4 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(In-case you missed it: that went from nanoseconds for f1 to microseconds for f2 and f3)
what people seem to expect is that
def f(user, message):
logger.debug("%s:%s: messsage %s", user.name, user.ip, str(message))
rather than producing:
2 0 LOAD_GLOBAL 0 (logger)
2 LOAD_METHOD 1 (debug)
4 LOAD_CONST 1 ('%s:%s: messsage %s')
6 LOAD_FAST 0 (user)
8 LOAD_ATTR 2 (name)
10 LOAD_FAST 0 (user)
12 LOAD_ATTR 3 (ip)
14 LOAD_GLOBAL 4 (str)
16 LOAD_FAST 1 (message)
18 CALL_FUNCTION 1
20 CALL_METHOD 4
22 POP_TOP
24 LOAD_CONST 0 (None)
26 RETURN_VALUE
will instead produce something equivalent to:
2 0 LOAD_GLOBAL 0 (logger)
2 LOAD_ATTR 1 (is_enabled_for_debug)
4 POP_JUMP_IF_FALSE 17 (to 34)
6 LOAD_GLOBAL 0 (logger)
8 LOAD_METHOD 2 (debug)
10 LOAD_CONST 1 ('%s:%s: messsage %s')
12 LOAD_FAST 0 (user)
14 LOAD_ATTR 3 (name)
16 LOAD_FAST 0 (user)
18 LOAD_ATTR 4 (ip)
20 LOAD_GLOBAL 5 (str)
22 LOAD_FAST 1 (message)
24 CALL_FUNCTION 1
26 CALL_METHOD 4
28 POP_TOP
30 LOAD_CONST 0 (None)
32 RETURN_VALUE
>> 34 LOAD_CONST 0 (None)
36 RETURN_VALUE
because here, in the no-op case, marshalling the arguments are the real hidden cost.
– related anecdote –
During these investigations, the engineers who created the problem typically tell me “well, python is just slow”. They’d told this to end-users who had complained of 1-5 minute start-up times of critical tools, or it’s because it’s reading yaml files.
In one case, a tool for configuring a checkout directory and interacting with pulls, took nearly 7 minutes to startup sometimes.
6 minutes and 43 seconds of that time was a combination of the diligence in logging and testing-support features in the 1st-party code, and logging and optional-validation features in the yaml library, all of which was no-op activity.
Just by changing the following pattern in both
def do_a_thing(config, args):
logger.debug("%s: %s: %s: says %s", ...)
expensive_validation(config, args)
def expensive_validation(config, args):
if not config.enable_validation(): return
...
to
def do_a_thing(config, args, log_debug: bool=False):
if log_debug: logger.debug(...)
if config.validation_enabled: expensive_validation(args)
I was able to reduce the startup time to 58 seconds, and applying similar across the 1st-party code and a 2nd library brought us down to 13s.
On hearing the speed-up was not the result of converting to go, most people said: “Huh, this will save me hours of time, I thought Python was just slow”
Other - actual performance tuning - eventually got it down to 4 seconds.