Explicit tail calls

Rosuav · July 15, 2025, 6:07pm

Ah. Same difference - you have the dictionary of locals, but nothing about parameters.

alwaysmpe · July 15, 2025, 6:59pm

Agreed, my thought being if this were implemented and gave benefit it could be used everywhere instead of applying specific user opt in. Text tracebacks would probably require keeping code objects and last instructions which probably makes up a reasonable portion of the memory use of the frame. So, it would likely need a flag or env variable, and would be a tradeoff between debugging and performance. Optimization has a cost.

Worth noting - this could probably be applied for all tail calls, not just recursion. In the likes of C++ or Rust where a call stack needs to allocate a fixed size of memory for local variables and specific registers point to the arguments, there’s a benefit because you can skip the malloc for each function call. That’s where a lot of the performance benefit comes from. Python arguments and local variables are all stored in the locals dictionary, so the benefit would be creating fewer objects.

jsbueno · July 16, 2025, 2:20pm

I don’t believe that post Python 3.11 creating a new execution frame is a problem anymore - they are lazily created under demand -
so, the requirement for “same signature” would be bogus, and the implementation of this could be vastly simplified.

On the other hand, the gains would be minimal - with the post Python 3.11 lazy Frame creation strategy, tens of thousands (possibly hundreds - I had played around a little when 3.11 was out) of recursion levels are achievable in a transparent way.

And then, about new syntax, if this goes forward: I’d prefer some internal mechanism to mark a call expressions in a return statement as “tail optimizable”, and have a specialized decorator to mark functions that allow this, rather than new syntax. (The decorator would change the function flags/markers so that bytecode would do so) - that would require a bit more logic on the VM, but keep the syntax clean.

Overall, I am -0 on this.

(Maybe it could be nice when thinking about it as a language spec useful for other Python implementations, since the lazy-Frame optimizations are a cPython detail)

Rosuav · July 16, 2025, 3:36pm

When we look at processes, there’s the concept of exec for replacing the current process with another. It generally isn’t considered an optimization, it’s a specific feature. Perhaps tail calls should be viewed the same way? You aren’t optimizing away a stack frame; you’re explicitly replacing the current stack frame with this new one. That way, the lack of traceback isn’t a bug, it’s a feature. It’s not one that I would personally have much use for, but if you wanted it, it would be there.

I’d be -0 on adding such a feature, but if someone sees this as worthwhile, it could be done.

Nineteendo · July 16, 2025, 4:08pm

I use this code when I call a recusive function, because the traceback gets too long otherwise:

try:
    rval = f() # calls g() and then f() again
except Exception as exc:
    if (tb := exc.__traceback__) is not None:
        exc.__traceback__ = tb.tb_next

    raise

If Python could figure out that there’s an indirect loop in the traceback I wouldn’t have to do that, but here we are. You don’t get the traceback anyway when calling the C version of the function.

ryanhiebert · July 16, 2025, 10:44pm

I’d expect that the parameters need not match at all. Conceptually. it’s only the return type that matters. Since Python is dynamic, there’s no runtime requirement for the return type to match, and it’s always only one value returned (though that value might be a tuple).

I expect that it would drop the current stack frame and replace it with the stack frame from the single, unmodified function call that was specified as the tail call. Preferably with some marker in the new stack frame that the current stack from was elided. Perhaps a counter that tells us how many frames were elided when a traceback happens.

carljm · July 16, 2025, 11:36pm

It seems like some assumptions are being made in this thread about what benefits are expected from this feature, that I’m not sure apply to modern CPython. Or put differently: what actual problem in modern Python is this proposal intended to solve?

As @jsbueno noted above, since CPython 3.11, the interpreter already handles Python-to-Python function calls by simply updating the interpreter state and jumping to the top of the interpreter loop; there is no additional C stack frame used.

There is still a new Python-frame struct created and placed on the Python call stack, but this struct is quite minimal, and the Python stack is heap-allocated in chunks, and effectively infinite in size (bounded by available memory), unlike the C stack.

So it is already the case today that in pure Python you can recurse extremely deeply without stack exhaustion, and quite cheaply.