Inline Python functions and methods

The Python community has a 5 year plan to push the limit of speed in Python. One of the things that reduces Python execution speed is calling methods or functions that are not in the nearest scope.

My suggestion is to introduce inline functions just as they are in C. They can be defined as global functions or class methods but are created in the scope of the function that is calling it at parsetime. I also don’t think it’ll be hard to adapt. inline functions might look something like:

inline def func():
pass

inline async def func():
pass

inline class Executor:
inline def func():
pass

For an inline method, the class must be defined as inline as well, in order to bring to scope whatever class variable or method that the inline function might rely on.

This is just a suggestion, what do you guys think about it.

1 Like

My first thought is that this is exactly the kind of low-level detail that I don’t want to have to think about when writing python code.

Also, I would think this would add all kinds of complications to introspection and runtime modifications. Or at least prohibit some operations. But I’ll admit I haven’t thought it all the way through. For example, could @dataclass work with an inline class?

Sorry to sound so negative. I’m not trying to shoot down the idea, just giving my first impression.

3 Likes

Note that this is also being discussed on Python-Ideas.

https://mail.python.org/archives/list/python-ideas@python.org/message/7XSAC2STMEU6WJSLP3DGOGJJGG4IUO3L/

Anyone who cares a lot about this issue may want to follow both
discussions.

There’s no negativity here, all suggestions are welcome :sweat_smile:.
Actually what I was thinking is that a function declared as inline shall be fully copied to the place where it’s being called at compiletime.

So if I had a function in a module somewhere and I called it, that function would be copied at the place of calling rather than use dots to search for the function in vast scopes.

I don’t think this would have an effect on runtime or introspection since at the end of the day, it’ll just be normal python code just with a function created in the same scope as the caller. So runtime behavior as of now wouldn’t be affected.

Honestly the only thing I’d be worried about is the compiled code being bigger than the actual code written, since every place the function was called would be replaced with the actual function. So it’d beg to use inline functions with caution.

For dataclasses and decorators in general, It’ll need a bit of thought, cause the function being decorated and the decorator itself shall need to be copied to the place where the function is called, this brings in also the thought of nested functions, which is why I generally mentioned that global functions should be the ones that are able to be inlined.

It is difficult to implement due to dynamic nature of Python. For example:

if os.environ.get('TERM') == 'xterm':
    inline def func():
        return 1
else:
    inline def func():
        return 2

What function be inlined at compile time?

Other example:

try:
    from _mod import func
except ImportError:
    inline def func():
        ...

Yet one common example:

inline def func():
    ...
def use_func():
    func()
try:
    from _mod import func
except ImportError:
    pass

Should an inline function be inlined in other modules in which it is used? If yes, what will happen if we change the function definition after compiling modules in which it is used? What will happen if we change PYTHONPATH so other module with the same name be loaded?

How will it work with monkeypatching? How method inlining can work with method overriding in subclasses?

Basically I think due to the compiletime nature that inline functions would have if implemented, there would have to be restrictions on how they are created and compiletime exceptions would be raised if something goes wrong.

First, I think they’d have to be short and precise, secondly, they would have to be declared globally to avoid the dynamic creation.

It’s still just a thought, but more ideas could be generated around this.

The link below has hacks to some of the speed problems related to python. I would like you to consider the point of “reducing dots” and “local variables” which help speed up python by introducing most functions and variables into local scope.

https://wiki.python.org/moin/PythonSpeed/PerformanceTips

So I went back and revised my idea for inline functions in python, and I realized that it would be harder to implement them in python the way I had originally thought about them, due to Python’s dynamic nature. However, the idea itself doesn’t seem so original, as Cinder already implements inline byte-code caching which significantly boosts its performance. I do not think they ever upstreamed these changes however.

So a modification of my idea goes like this. The inline functions can still be written with the inline keyword, but evaluated at runtime. This is because there are some cases such as in conditions where are function might never be called at all, so it wouldn’t need to be optimized in that case
so in cases where someone writes code like,

if True:
    inline def func_version_one(*args):
        #func body
        pass
else:
    inline def func_version_two(*args):
        #func body
        pass

The right function to perform inlining on shall be determined at runtime and cached in the same scope as where it’s performing it’s operations from cases where the program performs large iterations or even in infinite loops and other cases that need optimization.
Like I said before, it’s still a work-in-progress, and I’m still putting lots of factors into consideration, including your incredible insight as the core dev team.

Hi Tobias, could you present a more complete example of an inline function with some actual code in it (even if it’s just a+b) and another function that calls it with actual values (even just f(2,2)) and talk us through how it would work? No need to make the example use two different versions of the inline function.

2 Likes

Alright, so here goes.
I imagined an inline function could be defined with the inline keyword. So lets take an example of a function that performs arithmetics as you suggested above.

inline def multiply(a,b):
    return a*b

Now the function can be called as any normal python function

def continuous_multiply(first_iterable:list,second_iterable:list) -> None:
    for x,y in zip(first_iterable,second_iterable):
        print(multiply(x,y))                #inline function called here

Now what I imagine is that the multiply function shall be cached within continuous_multiply function for quick access to such a repetitive task.
If the inline function were defined in a class, the class itself would have to be declared as inline too. This is to bring to cache anything within the class that would be relied on by the inline functions.

inline class Arithmetics:
    inline def multiply(self,a,b):
        return a*b

using the same continuous_multiply function, the call to multiply function would be something like this

def continuous_multiply(first_iterable:list,second_iterable:list) -> None:
    multiplier = Arithmetics()
    for x,y in zip(first_iterable,second_iterable):
        print(multiplier.multiply(x,y))              #inline method called here

so instead of using dots to reference the function on every iteration, the inlined multiply function shall be cached within the same scope as continuous_multiply function and as a result would speed up execution speed. The dots of course shall still be present in the code, its the optimizations that shall be done at runtime within the interpreter.

Basically what I was looking at are cases where lots of iterations are performed and it would be hard to gather the data and instructions to operate on the data through out the entire iteration process. so bringing closer whatever is needed for the time of iteration through caching would be a step at increasing speed of execution I presume.
The cache can be emptied after the loop or anything keeping the cache active is complete (so maybe there could be an internal reference count of sorts)

The reason for such a feature would be to replace and probably optimize something many pythonistas and I have been doing to create our own caches, which goes something like

def continuous_multiply(first_iterable:list,second_iterable:list) -> None:
    cached_multiply = Arithmetics().multiply
    #or cached_multiply = multiply                 #the one declared globally
    for x,y in zip(first_iterable,second_iterable):
        print(cached_multiply(x,y))

So as evidenced, the idea is still vague, and I’m not too conversant with the python internals well enough to know how hard or how easy it’d be to implement such a feature (though I’m studying the cpython source files during my free time, so I think I’ll get the hang of it pretty soon). However, I think it’d be a pretty good optimization point. Perhaps the idea might not only end on this, but could be explored and thought of in different ways, and can also be applied on async functions if possible.
So, that’s it for me.
Thanks Mr Rossum

Inlining python functions is already partially done by the 3.11 CPython interpreter automatically at runtime. Admittedly, it’s only for python to python function or method calls at the moment, and doesn’t work for generators or coroutines.

My explanation on how it works in CPython:

When CALL_FUNCTION opcode detects a python function, it doesn’t call _PyEval_Vector and eventually _PyEval_EvalFrameDefault (eval loop) . Instead, it sets up a new interpreter frame inline then jumps to the start of _PyEval_EvalFrameDefault (eval loop).

[Link to code], implemented by Pablo, Mark and other contributors (cpython/Python/ceval.c at 69806b9516dbe092381f3ef884c7c64bb9b8414a · python/cpython · GitHub).

Note that in the implementation above, even setting up a new interpreter frame is usually cheap, as old frames can be re-used. I don’t see how we can avoid the copying and re-calculation of args and kwargs even with the explicit inlining suggestion, so I’m not sure there are any benefits of this over the current approach.

1 Like

Ken is mostly right, and the advantage of doing things this way is that we don’t have to bother users with changing their program to make it faster – we just make it faster without their help.

I can imagine one scenario where it may be advantageous to mark functions explicitly as ‘inline’ – when optimizing we often want to “trace” code through a series of instructions that will preserve the type of some data that flows through it. Currently (in 3.11, the main branch) we use “inline” caches a lot, where for a single opcode we use a shortcut that is possible because the types of the input values are always the same. (E.g. the expression a+b when a and b are observed to be always strings.)

But sometimes we have a series of instructions that could be optimized this way together, for example a+b*x might always involve three floating point values. Such a sequence of instructions can be called a “trace”. The holy grail of trace-based optimizations is tracing through function calls (e.g. a+multiply(b, x)). I think it’s still an open question whether we can do this based on the current approach or whether it would be advantageous to have the user mark certain functions as ‘inline’.

But given the big downside of the inline idea (requiring users to change their code) I think we should let it rest until we know for sure that tracing through Python calls does not work and we think inline functions could help.

3 Likes

Another approach is to reconsider mutability of modules. If a module can declare that it doesn’t change at runtime (after module load) then any lookup into that module can be cached.

I have not thought this through particularly deeply, but it could be achieved by replacing the modules __dict__ with a mapping proxy. Given a.b.c, both a and b would have to be read-only to cache the lookup.

Note that in C the inline keyword is “a hint for the compiler to perform optimizations”, and the compiler is not required to perform that optimization.
It’s also worth noting that C++ is moving away from the programmer specifying the inlining of functions for performance / optimization reasons), and leaving it to the compiler to decide when or whether to inline a function (see also Greg Ewing’s comment in the other thread)
https://en.cppreference.com/w/cpp/language/inline

The inline keyword in C++ is now still used, but has a slightly different meaning:
“the meaning of the keyword inline for functions came to mean “multiple definitions are permitted” rather than “inlining is preferred””

I like what @ericvsmith had to say about “exactly the kind of low-level detail that I don’t want to have to think about”.

1 Like

Indeed. To note, if I recall correctly, this came up in several related issues to Victor Stinner’s work on PEP 670 and others. Typically, the compiler usually made a better choice than manual macros or forced-inlining when it came to this, at least in terms of overall performance. That and its various related issues would be a useful read for some perspective on this, at least at the C level.

Ah. I hadn’t seen that PEP - thanks.

Perhaps. But see https://bugs.python.org/issue45116 — MSVC doesn’t do so well.

(Also, your first sentence needs some copy-editing. :slight_smile: )

1 Like

Hmm, very interesting—hadn’t seen that. Quite the investigation, that—certainly seems worth a read, thanks.

Hey, I never said I was any good at copyediting my own writing :stuck_out_tongue: (In all seriousness, that whole post was atrocious; should be much better now—I’d previously drafted a reply discussing the topic in a bit more detail and linking to some related issues, but I’d lost the draft and ended up just banging out something forgettable)

If you need to use a C level inline keyword then you can always do it in Cython (which explicitly supports it, although only for functions that you’re calling from other Cython code).

To me this sounds like the most sensible solution - if you want to force it inlining then you’re probably better doing it with one of the various Python accelerator tools available.

Obviously if Python wants to implement/improve bytecode inlining to speed itself up then that’s great, but probably not something the user wants to be thinking about.

I came across this inline topic because I am working with pyspark.

Using a python user defined function for a column transformation in pyspark is extremely expensive because of the cost of 1. Setting up a python job for each transformation call and 2. the need to marshal a ddr’s column data to and from jvm and python serialization formats (the marshalling issue has only been somewhat addressed).

User defined column transformation functions, implemented as inline functions in the C sense, if they existed, I thought might give a solution to this problem. Which is how I got here.

I am new to python and pyspark for that matter. But I do know, and have seen, that using user-defined functions in pyspark are indeed very costly.