Full disclosure: I’m currently paid to contribute to PyTorch, so my views might be limited towards their use case. Please take this into consideration when forming your own views. None of my views here represent or are endorsed by the PyTorch team or my employer and are are made in my own technical capacity.
Some time back there was a discussion for adding an
inline keyword. While it was generally rejected by me and the other core devs commenting on it, I have now observed a genuine use case for a C API function that “suggests” to CPython that a function is safe to inline.
The context behind this is that since CPython 3.11, we have performed stack frame inlining, by checking whether PEP 523
state->interp->eval_frame is set to
NULL (ie there is no custom evaluating frame). We are likely going to perform further optimizations in CPython 3.13 that rely on that, such as tracing through function calls, and truly inlining bytecode by completely removing frame push and pops. While that check is correct, it leaves out the option of custom evaluating functions selectively letting the CPython interpreter inline code.
// Sets an flag in PyFunctionObject
void PyFunction_SetInlineAble(PyObject *func, bool can_inline)
The semantics to CPython are as such: similar to C, it is only a suggestion that something can be inlineable. If CPython deems it not inlineable, we are free not to inline. Other implementations are free to ignore this suggestion as well.
At runtime execution, the CPython bytecode will
store the current frame evaluating function (if any), and restore it after returning from the function. This allows it to be somewhat compatible with PEP 523 - that is, setting a function to be inlineable has no side-effects on the frame evaluation outside of the function it indicates. ignore PEP 523 flag and just enter the code object if it sees that the flag is set.
PyFunction_SetInlineAble is a stronger guarantee than PEP 523, and as such overrides it. When we see a
PyFunction_SetInlineAble, CPython will try to inline it regardless of what is the current
PyTorch’s TorchDynamo uses PEP 523 to trace through CPython bytecode to feed information to its JIT compiler backend. Once it JIT compiles code, the compiled function is treated as a global, and new bytecode is written that just loads the global and calls it. Thus a whole function is just reduced to a simple
This is of course an oversimplification. When TorchDynamo cannot trace and JIT compile certain code, it will do what’s termed there a “graph break”. This entails returning to standard CPython bytecode. When it finds something it can JIT compile again, it compiles yet another global and writes that to the new bytecode, as a resume/continuation function of sorts. So for example, 2 compileable regions and 1 graph break will look like this:
<Standard CPython bytecode> # The break
The JIT compiled functions themselves are simple Python functions that wrap compiled CPU/GPU code. I would like to reduce the overhead when calling these compiled functions. The problem is that in graph break, TorchDynamo has to start tracing again, because a graph break could be a call to another function that it needs to trace.
The current solution to work with CPython optimizations would be to do something like this:
<Standard CPython bytecode>
As can be observed, this is extremely clunky, and also defeats the performance gain from CPython’s inlining optimizations. With
PyFunction_SetInlineAble, we can selectively mark
__resume_fn_1 as inlineable.