Full disclosure: I’m currently paid to contribute to PyTorch, so my views might be limited towards their use case. Please take this into consideration when forming your own views. None of my views here represent or are endorsed by the PyTorch team or my employer and are are made in my own technical capacity.
Background
Some time back there was a discussion for adding an inline
keyword. While it was generally rejected by me and the other core devs commenting on it, I have now observed a genuine use case for a C API function that “suggests” to CPython that a function is safe to inline.
The context behind this is that since CPython 3.11, we have performed stack frame inlining, by checking whether PEP 523 state->interp->eval_frame
is set to NULL
(ie there is no custom evaluating frame). We are likely going to perform further optimizations in CPython 3.13 that rely on that, such as tracing through function calls, and truly inlining bytecode by completely removing frame push and pops. While that check is correct, it leaves out the option of custom evaluating functions selectively letting the CPython interpreter inline code.
Specifications
// Sets an flag in PyFunctionObject
void PyFunction_SetInlineAble(PyObject *func, bool can_inline)
The semantics to CPython are as such: similar to C, it is only a suggestion that something can be inlineable. If CPython deems it not inlineable, we are free not to inline. Other implementations are free to ignore this suggestion as well.
At runtime execution, the CPython bytecode will store the current frame evaluating function (if any), and restore it after returning from the function. This allows it to be somewhat compatible with PEP 523 - that is, setting a function to be inlineable has no side-effects on the frame evaluation outside of the function it indicates. ignore PEP 523 flag and just enter the code object if it sees that the flag is set.
Note that PyFunction_SetInlineAble
is a stronger guarantee than PEP 523, and as such overrides it. When we see a PyFunction_SetInlineAble
, CPython will try to inline it regardless of what is the current eval_frame
.
PyTorch’s use case
PyTorch’s TorchDynamo uses PEP 523 to trace through CPython bytecode to feed information to its JIT compiler backend. Once it JIT compiles code, the compiled function is treated as a global, and new bytecode is written that just loads the global and calls it. Thus a whole function is just reduced to a simple LOAD_GLOBAL
and CALL
.
This is of course an oversimplification. When TorchDynamo cannot trace and JIT compile certain code, it will do what’s termed there a “graph break”. This entails returning to standard CPython bytecode. When it finds something it can JIT compile again, it compiles yet another global and writes that to the new bytecode, as a resume/continuation function of sorts. So for example, 2 compileable regions and 1 graph break will look like this:
CALL __compiled_fn_1
<Standard CPython bytecode> # The break
CALL __resume_fn_1
The JIT compiled functions themselves are simple Python functions that wrap compiled CPU/GPU code. I would like to reduce the overhead when calling these compiled functions. The problem is that in graph break, TorchDynamo has to start tracing again, because a graph break could be a call to another function that it needs to trace.
The current solution to work with CPython optimizations would be to do something like this:
CALL clear_eval_frame
CALL __compiled_fn_1
CALL set_eval_frame
<Standard CPython bytecode>
CALL clear_eval_frame
CALL __resume_fn_1
CALL set_eval_frame
As can be observed, this is extremely clunky, and also defeats the performance gain from CPython’s inlining optimizations. With PyFunction_SetInlineAble
, we can selectively mark __compiled_fn_1
and __resume_fn_1
as inlineable.