Proposal for a `PyFunction_SetInlineAble` C API function

kj0 · September 2, 2023, 12:27pm

Full disclosure: I’m currently paid to contribute to PyTorch, so my views might be limited towards their use case. Please take this into consideration when forming your own views. None of my views here represent or are endorsed by the PyTorch team or my employer and are are made in my own technical capacity.

Background

Some time back there was a discussion for adding an inline keyword. While it was generally rejected by me and the other core devs commenting on it, I have now observed a genuine use case for a C API function that “suggests” to CPython that a function is safe to inline.

The context behind this is that since CPython 3.11, we have performed stack frame inlining, by checking whether PEP 523 state->interp->eval_frame is set to NULL (ie there is no custom evaluating frame). We are likely going to perform further optimizations in CPython 3.13 that rely on that, such as tracing through function calls, and truly inlining bytecode by completely removing frame push and pops. While that check is correct, it leaves out the option of custom evaluating functions selectively letting the CPython interpreter inline code.

Specifications

// Sets an flag in PyFunctionObject
void PyFunction_SetInlineAble(PyObject *func, bool can_inline)

The semantics to CPython are as such: similar to C, it is only a suggestion that something can be inlineable. If CPython deems it not inlineable, we are free not to inline. Other implementations are free to ignore this suggestion as well.

At runtime execution, the CPython bytecode will store the current frame evaluating function (if any), and restore it after returning from the function. This allows it to be somewhat compatible with PEP 523 - that is, setting a function to be inlineable has no side-effects on the frame evaluation outside of the function it indicates. ignore PEP 523 flag and just enter the code object if it sees that the flag is set.

Note that PyFunction_SetInlineAble is a stronger guarantee than PEP 523, and as such overrides it. When we see a PyFunction_SetInlineAble, CPython will try to inline it regardless of what is the current eval_frame.

PyTorch’s use case

PyTorch’s TorchDynamo uses PEP 523 to trace through CPython bytecode to feed information to its JIT compiler backend. Once it JIT compiles code, the compiled function is treated as a global, and new bytecode is written that just loads the global and calls it. Thus a whole function is just reduced to a simple LOAD_GLOBAL and CALL.

This is of course an oversimplification. When TorchDynamo cannot trace and JIT compile certain code, it will do what’s termed there a “graph break”. This entails returning to standard CPython bytecode. When it finds something it can JIT compile again, it compiles yet another global and writes that to the new bytecode, as a resume/continuation function of sorts. So for example, 2 compileable regions and 1 graph break will look like this:

CALL __compiled_fn_1
<Standard CPython bytecode> # The break
CALL __resume_fn_1

The JIT compiled functions themselves are simple Python functions that wrap compiled CPU/GPU code. I would like to reduce the overhead when calling these compiled functions. The problem is that in graph break, TorchDynamo has to start tracing again, because a graph break could be a call to another function that it needs to trace.

The current solution to work with CPython optimizations would be to do something like this:

CALL clear_eval_frame
CALL __compiled_fn_1
CALL set_eval_frame
<Standard CPython bytecode>
CALL clear_eval_frame
CALL __resume_fn_1
CALL set_eval_frame

As can be observed, this is extremely clunky, and also defeats the performance gain from CPython’s inlining optimizations. With PyFunction_SetInlineAble, we can selectively mark __compiled_fn_1 and __resume_fn_1 as inlineable.

@markshannon @guido @iritkatriel @brandtbucher

guido · September 3, 2023, 12:37am

Hi Ken, I’m trying to wrap my head around your proposal. It may take a while because I actually have no idea how PyTorch or TorchDynamo works, and your explanation confused me a bit (but see below).

I’m going to start by objecting to the term “inlining” for what we currently do with functions. We have a bit of a history of abusing the term “inline” – in 3.10 and early versions of the adaptive specializing interpreter for 3.11 we had “inline” cache entries that weren’t in-line at all (they were kept in a separate array of cache data that was indexed somewhat cleverly based on the current instruction index). The final 3.11 version does have proper inline cache entries, and we’ve kept that in 3.12 and main: The cache data for a particular bytecode instruction immediately follows that instruction in the bytecode array, which make the cache in-line.

For functions, proper inlining to me would mean that a call to e.g.

def add(a, b):
    return a + b

in the context of

x = add(a, b)

expands the function in-place so it will generate code as if the latter was written as

x = a + b

This is what C/C++ static inline functions do, AFAIK. We’re also planning to do this in 3.13 for Tier 2, but we’re only halfway there.

What we do, since 3.11, is more modest, and not something I’d call inlining. Instead I think of it as eliminating C stack frames. Where in 3.10 and before the CALL opcode would always end up calling PyObject_Vectorcall(), pushing several C stack frames onto the C stack for every Python stack frame, in 3.11, in most cases we just jump back into the interpreter loop without pushing any C stack frames. We do push a Python frame, however.

The current state of the Tier 2 interpreter is that, if we’re lucky, we do expand the called function’s instructions inline in the caller’s Tier 2 (micro) instruction sequence, but right now we still push and pop a Python stack frame – it’s as if the code was something like this:

_PUSH_FRAME()
__tmp = a + b
_POP_FRAME()
x = __tmp

After that considerable detour I think I understand your proposal better. The issue is actually your final code snippet, which calls helper functions clear_eval_frame and set_eval_frame, which (I’m guessing) actually clear and restore the tstate->interp->eval_frame field. The effect of this (if I’m guessing right) is that when the whole thing is specialized to something like this:

CALL_NO_KW_BUILTIN_FAST clear_eval_frame
CALL_PY_EXACT_ARGS __compiled_fn_1
CALL_NO_KW_BUILTIN_FAST set_eval_frame
<Standard CPython bytecode>
CALL_NO_KW_BUILTIN_FAST clear_eval_frame
CALL_PY_EXACT_ARGS __resume_fn_1
CALL_NO_KW_BUILTIN_FAST set_eval_frame

Expanding to Tier 2 micro-code expands CAL_PY_EXACT_ARG into the following sequene of micro-ops:

_CHECK_PEP_523
_CHECK_CALL_BOUND_METHOD_EXACT_ARGS
_INIT_CALL_BOUND_METHOD_EXACT_ARGS
_CHECK_FUNCTION_EXACT_ARGS
_CHECK_STACK_SPACE
_INIT_CALL_PY_EXACT_ARGS
SAVE_IP
SAVE_CURRENT_IP
_PUSH_FRAME

Here, the presence of _CHECK_PEP_523 is crucial: if tstate->interp->eval_frame != NULL this bails (“deoptimizes”) back to the Tier 1 interpreter. Hence the clear_eval_frame call.

If I understand your proposal correctly, you would like to modify the _CHECK_PEP_523 micro-op to not deoptimize if the function being called has a special flag set. The flag then basically means “ignore the PEP 523 evaluator present in the interpreter”. The remaining micro-ops in the expansion of CALL_PY_EXACT_ARGS don’t care about the presence of an PEP 523 evaluator, so they would “just work”.

You could then generate simpler (Tier 1) bytecode in TorchDynamo, like this:

CALL __compiled_fn_1
<Standard CPython bytecode>
CALL __resume_fn_1

provided of course you set the special flag for __compiled_fn_1 and __resume_fn_1.

I think it’s a reasonable proposal, but I worry a bit that it will slow down all code (Tier 1 and Tier 2) by making the _CHECK_PEP_523 uop check a flag in the function object. Perhaps we could alleviate this by having two variants of the specialized Tier 1 opcode CALL_PY_EXACT_ARGS, one that includes the _CHECK_PEP_523 uop and one that doesn’t, and have only the specializer check the flag in the function object. Since CALL_PY_EXACT_ARGS is already a macro in the DSL, the variant would just be another macro, so there’s not much code duplication – just an extra check in the specializer. This would require that the flag is set before the bytecode containing the CALL is specialized, but I think that’s not a problem for TorchDynamo (since it has to construct the compiled functions and emit the bytecode calling them anyway).

What remains is bikeshedding over the name, and lobbying Mark, who generally has a somewhat different mental model than I have about how the interpreter works and what makes it fast (and I defer to him for that).

Bikeshedding over the name, may I recommend using the “unstable” API tier (PEP 689)? And avoid loaded terms like “inline”.

kj0 · September 3, 2023, 6:37am

I’m going to start by objecting to the term “inlining” for what we currently do with functions.

Yep I was careful to call what we do in 3.11 “stack frame inlining” and 3.13 “true inlining” in the original post. I want to clarify that PyFunction_SetInlineAble is the correct name though, because we do need to check for PEP 523 (as you pointed out) before executing a truly inlined function in the Tier 2 optimized bytecode.

If I understand your proposal correctly, you would like to modify the _CHECK_PEP_523 micro-op to not deoptimize if the function being called has a special flag set. The flag then basically means “ignore the PEP 523 evaluator present in the interpreter”.

Yes you are 100% correct.

but I worry a bit that it will slow down all code (Tier 1 and Tier 2) by making the _CHECK_PEP_523 uop check a flag in the function object.

I don’t think it will be measurable. We already check that Py_TYPE(func) == &PyFunction_Type in every one of these stack frame skipping bytecode. At worst I expect a 0.3% slowdown (which again, is not measurable by pyperformance). Just to be sure, I will benchmark this.

guido · September 3, 2023, 3:26pm

Nevertheless, I would like a different name than “SetInlineAble”, because the only functionality really is to override/force the PEP 523 check. Something that makes one think of “eval_frame” would be ideal? (And what do you think about using “PyUnstable”? As I’m sure you know, Mark isn’t keen on PEP 523 and might want to push for deprecating it in favor of other – yet to be designed – functionality that’s more to the point.)

kj0 · September 3, 2023, 4:53pm

Alright. I’ll think of another name. I did not know PyUnstable was a thing now, thanks for pointing that out!

I’ll wait for Mark’s proposal on the new machinery before pushing for this.

markshannon · September 4, 2023, 9:38am

Short answer: Let’s not add more features for PEP 523 when we should be removing it eventually.

Long answer:
I think you need to clearly state the fundamental issue, not your proposed solution.
Let me take a stab at doing that:
PyTorch dynamo wants to select regions of code to optimize. These are usually branchless and can start or end mid-function. It wants other regions of code to be optimized by CPython.

The PyUnstable_SetOptimizer API seems a much better fit for this than PEP 523, both it terms of flexibility and of performance.
PyTorch Dynamo can implement its own optimizer, delegating to the built-in optimizer when it chooses to do so.

kj0 · September 4, 2023, 9:46am

PyTorch dynamo wants to select regions of code to optimize. These are usually branchless and can start or end mid-function. It wants other regions of code to be optimized by CPython.

No. TorchDynamo generally only traces from function entry to exit (treat it as a method-at-a-time tracer). It optimizes all regions of code, it does not rely on CPython optimizing other regions.

The main problem is that during execution, it would be better to let CPython handle execution of some parts, as CPython can execute them better (such as stack frame optimization), while other parts should be handled by PyTorch. In this case, these parts are entire functions. The main problem is that there is no way to selectively mark these functions. PyUnstable_SetOptimizer does not work as it generates traces from backwards jumps (IIUC), which are incompatible with TorchDynamo’s model of tracing from the start to an end of a function.

E.g.

# The following entire function is compiled

@torch.compile
def foo():
    # Some torch operations here

markshannon · September 4, 2023, 10:52am

I used the term “wants” not “does”. Don’t get stuck on the present design which is forced on you by PEP 523.
If PyTorch optimized whole functions, there would be no need for graph breaks.

At a graph break, PyTorch should return control to the VM. PyUnstable_SetOptimizer allows you to do that, PEP 523 does not.

You say that “It optimizes all regions of code” and that “it would be better to let CPython handle execution of some parts”. That is a contradiction.

kj0 · September 4, 2023, 12:36pm

You say that “It optimizes all regions of code” and that “it would be better to let CPython handle execution of some parts”. That is a contradiction.

I think we have different definitions of “optimization” and “execution”. I meant optimization loosely in the sense that TorchDynamo will just generate new bytecode, even for graph breaks. It just so happens that the graph break bytecode are usually similar to the original bytecode regions. I understand that my definition of optimization is probably different from yours, or what TorchDynamo will define optimization as. So I’ll just go with your definition for easier communication, and also in the (likely) event that I am wrong.

At a graph break, PyTorch should return control to the VM. PyUnstable_SetOptimizer allows you to do that, PEP 523 does not.

I agree with that.

However, I feel that the optimizer doesn’t have all the feature of PEP 523. For one thing, PEP 523 allows users to define how they want to create bytecode regions. Essentially, the PyUnstable_SetOptimizer makes tracing the default (IIRC), and does not allow any other forms (such as method-at-a-time). I would caution against deprecating PEP 523 too soon, while the new alternative is not a sufficient replacement yet.

Also, I want to caution that I am not a TorchDynamo expert by any means. So I may not be seeing the alternative ways that it could be implemented using the PyUnstable_SetOptimizer API. Sorry if I get anything wrong above.

guido · September 4, 2023, 2:23pm

Ken, could you give a more detailed example of what TorchDynamo does in various cases?

markshannon · September 4, 2023, 2:51pm

I’m puzzled by your comment that PEP 523 allows users to define how they want to create bytecode regions.

PEP 523 forces the eval function to take over all execution from exactly the point of function entry, until that function exits. No flexibility whatsoever unless you modify the function, which has its own set of problems.

PyUnstable_SetOptimizer allows the optimization to stop at any instruction it wants without changing the function(s) optimized.

If you need to insert entry points into optimized code at places other than RESUME and JUMP_BACKWARD then we can fairly easily add an API for that.

kj0 · September 4, 2023, 3:32pm

If you need to insert entry points into optimized code at places other than RESUME and JUMP_BACKWARD then we can fairly easily add an API for that.

Thanks, that sounds like a better solution than what I’m proposing here. I’m +1 for this! I definitely didn’t think of this. Sorry.

Ken, could you give a more detailed example of what TorchDynamo does in various cases?

Say for the following code (note: the operations don’t matter, we just care that print cannot be compiled):

import torch

@torch.compile
def forward(inputs):
    # Compiled into __compiled_fn_0
    x = torch.cosh(input=inputs)
    # Graph break
    print("A graph break!")
    # Compiled into __resume_at_72_1
    return x + inputs * 2


input_tensor = torch.randn([1])

result = forward(input_tensor)

On first entry, the decorator sets the PEP 523 eval frame to a custom tracing frame that TorchDynamo uses. With the collected trace, TorchDynamo then abstract interprets/symbolically evaluates each bytecode instruction, producing a “FX graph”. This graph can then be lowered further and passed to a compiler backend. The compiler backend generates CPU/GPU code (usually in C++ or Triton), new bytecode is then generated that are stubs that call these compiled functions.

For example, the code above generates the following modified bytecode:

 MODIFIED BYTECODE forward /home/ken/Documents/GitHub/test3.py line 3 
   3           0 RESUME                   0
               2 PUSH_NULL
               4 LOAD_GLOBAL              6 (__compiled_fn_0)
              16 LOAD_FAST                0 (inputs)
              18 PRECALL                  1
              22 CALL                     1
              32 STORE_FAST               2 (___graph_out_0)
              34 PUSH_NULL
              36 LOAD_GLOBAL              4 (print)
              48 LOAD_CONST               2 ('A graph break!')
              50 LOAD_FAST                2 (___graph_out_0)
              52 LOAD_CONST               4 (0)
              54 BINARY_SUBSCR
              64 STORE_FAST               1 (x)
              66 PRECALL                  1
              70 CALL                     1
              80 PUSH_NULL
              82 SWAP                     2
              84 LOAD_GLOBAL              8 (__resume_at_72_1)
              96 SWAP                     2
              98 LOAD_FAST                0 (inputs)
             100 LOAD_FAST                1 (x)
             102 PRECALL                  3
             106 CALL                     3
             116 RETURN_VALUE

__compiled_fn_0 and friends are Python wrappers/stubs over the compiled code. You can see that the bytecode contains the “graph breaks” (things that cannot be compiled), while anything it can compile is loaded and called as a stub.

Interestingly, this new bytecode is cached as a copy inside co_code.co_extra (again from PEP 523), so the old bytecode remains where it is. The decorator checks if there’s a compiled artefact, and calls that. Again, this currently uses PEP 523.

If print were not a builtin but instead a user-defined function, TorchDynamo will trace that function as well, and store the compiled artefact in that function’s co_code.co_extra, which is done through PEP 523.

kj0 · September 4, 2023, 3:48pm

The conclusion:

I’m self rejecting my idea. In favour of Mark’s proposal to modify PyUnstable_SetOptimizer instead, as that seems like a better way forward.

Thank you Mark and Guido for your patience in working this out with me. I really appreciate it. Also reading back on this thread, I apologise if my replies seemed terse.

guido · September 4, 2023, 3:59pm

No apologies necessary, Ken! You never know how much context your audience has or is missing until some kind of dialog is happening.