Inlining comprehensions -- what are the compatibility constraints?

carljm · January 6, 2023, 6:18pm

Hello!

Currently comprehensions are compiled as nested functions and then called. This provides some isolation (comprehension iteration vars are not visible outside the comprehension) but it’s also inefficient: every time a comprehension runs, we have to allocate a brand-new PyFunctionObject just so that we can call it exactly once and then immediately throw it away.

I’m working on addressing this inefficiency in GH-97933, but the approach we take (and how much we can improve perf) depends on how tightly we construe the compatibility constraints. So I’m looking for opinions on what those constraints should be!

These are the (IMO) non-negotiable constraints, which I intend to keep:

The comprehension iteration variable(s) (i.e., local variables in what is now the “comprehension function”) must neither be visible after the comprehension, nor stomp on the value of a local variable of the same name in the function containing the comprehension.
Other bindings within the comprehension (e.g. walrus operator) should behave exactly as they do now.

Here are some behaviors that I would prefer to change in order to get faster comprehensions, and I am wondering if anybody would mind, or considers these compatibility-protected:

Is it important that calling locals() inside a comprehension (does anyone do this?) does not also include local variables from the function containing the comprehension?
Is it important that in a traceback, the comprehension itself should show up as a separate Python stack frame, rather than just a line number in the containing function?
Is it important that a reference to a comprehension’s iteration variable outside the comprehension (assuming no defined local var of the same name) raises NameError rather than UnboundLocalError?

My personal feeling is that most Python developers don’t think of themselves as writing and calling a nested function when they write a comprehension, and so the proposed new behaviors here would be at least as intuitive as the current behavior, arguably better. But it would be a change.

Thoughts or concerns?

barry · January 6, 2023, 6:28pm

I agree with your list of non-negotiable constraints, and have never relied on the things you’d like to change. I can’t think of any code that I’ve seen (e.g. at work) that does. But you know Python, “Just because you can, doesn’t mean you shouldn’t (sic)” .

You’re probably right about that!

guido · January 6, 2023, 6:43pm

Not much to add here, except that in my experience the extra frame can be annoying when debugging with pdb – you can’t print local variables when you’re in the comprehension’s frame.

Nevertheless, for best compatibility I’m hesitant to change too much. Could we define a bytecode or internal API that executes a code object directly, so it still creates a frame, but the detour via the function object isn’t necessary? (I wonder how cells for nonlocals would be passed in then.) Frames are supposed to be much cheaper than objects, and you need to do something to handle loop control variables that shadow locals in the surrounding scope.

Have you measured where the cost goes? (I see in Inline dict/list/set comprehensions in the compiler for better performance · Issue #97933 · python/cpython · GitHub you have a bit about this, but it’s a bit hand-wavy when it comes to 3.11+.)

carljm · January 6, 2023, 6:49pm

Yep, this is the alternative approach that should be almost entirely invisible. The trick here, as you observed, is closures: I think this approach either requires refusing to optimize comprehensions with a closure, or requires adding a new f_closure pointer to _PyInterpreterFrame (so we no longer rely on the function holding the closure.)

It also may give up some wins in those same closure cases. True inlining allows just sharing locals for those cases and avoids cells and closures entirely, which for a heavily-used closed-over var potentially means a lot of LOAD_FAST/STORE_FAST vs LOAD_DEREF/STORE_DEREF in the outer function. But I haven’t quantified this at all in 3.12, which brings me to your second point:

I haven’t, in 3.12, but I should. I will do some work to get better numbers here.

guido · January 6, 2023, 7:04pm

That’s unfortunate. I searched a tiny code base I am currently working on and found this example:

member = [member for member in family.members if member in self.instrs]

Here self would be a closure. This doesn’t seem a very far-fetched pattern.

Rosuav · January 6, 2023, 7:34pm

Question: Will this affect genexps and comprehensions differently? A genexp could be returned from a function, so it’ll still need to be a full closure and everything. Will the behaviour of list(... for ... in ...) be different from [... for ... in ...] in any way that matters?

carljm · January 6, 2023, 8:02pm

I don’t currently have any plans to change anything for generator expressions. That would introduce a lot of new complexities.

It would be different in that the latter would be faster, and potentially in the 3 ways I mentioned in the OP, because the latter would be a candidate for inlining and the former would not.

Rosuav · January 6, 2023, 8:57pm

Cool cool. I don’t think any of those ways will be particularly notable (certainly not in my personal code), so I’d be fine with that. Not that my word counts for anything much.

pf_moore · January 6, 2023, 9:07pm

I can’t imagine any case where I’d object to the change in behaviour. And if someone did need the old behaviour, it sounds like replacing [...] with list(...) would provide it - although I wouldn’t advertise that fact too widely, as someone is bound to complain that we’ve “broken” the equivalence between genexps and comprehensions

guido · January 6, 2023, 11:01pm

That surprises me at least a little bit. True, comprehensions don’t escape the scope, but the semantics in terms of scopes are supposed to be the same (as of Python 3.0), and I’ve started using “comprehensions” to refer to both (e.g. in PEP 572 – walrus). I’m guessing you have an implementation in mind that reuses the locals of the parent scope?

carljm · January 6, 2023, 11:21pm

Yes, the implementation that I think will give the most benefit is one where the comprehension bytecode is fully inlined into the containing function, with care taken to maintain isolation of the comprehension iteration vars. This is not that hard to do, but it leads to the compatibility tradeoffs in the OP. I don’t think this will work for generator expressions, though, because the need to resume execution pretty much dictates a separate code object and frame.

If we keep a separate code object and frame and just optimize away the single-use function object, presuming a solution to the closure problem (probably just biting the bullet and making every _PyInterpreterFrame one word larger?), it’s more feasible to do this for genexps also. That is a point in its favor.

It seems like the compatibility tradeoffs of full inlining may be acceptable but aren’t ideal, which means it really depends how much we gain from it. I may end up building both versions so we can get the truest comparison.

guido · January 7, 2023, 12:59am

That sounds like a plan. The compatibility trade-offs concern me as a matter of principle — things get ever so much messier, debuggers and other tooling must learn about it, etc.

storchaka · January 7, 2023, 8:17am

stacklevel. The result of warnings.warn(stacklevel=...) in a comprehension (directly or indirectly) will be changed if inline the code of the comprehension.

To me, the largest concern about this idea is that it will add a completely different path in the compiler, and it will be much more complex (and therefore less maintenable) than the code for generator expressions.

gpshead · January 8, 2023, 8:38am

There’s a solution in the works for that specific warnings.warn() API flaw: Migrate off of stacklevel=. Counting runtime stack frames above you which can change based upon language VM implementation details just as well as code refactoring happening elsewhere to determine where a warning should show up attached to in the call stack was always a fragile hack.

My PR #100840 adds a skip_file_prefixes= alternative enhancement to stacklevel=.

How common do we expect code that relies on a specific stack depth to be? I don’t think that is a good assumption to make. I’d never have assumed a comprehension created a new frame myself, any reliance on that would’ve been something unfortunate by observation because it appeared to be implemented that way.

This is not important to me as worded (just the traceback question)… the larger question of if removing an implicit frame counts as breakage remains a compatibility question but I’m not intuitively sure it should. Code gets refactored all the time in ways that changes its stack profile.

carljm · January 8, 2023, 4:11pm

This is true, but it seems unlikely in practice to matter. Typically stacklevel is used in eg a deprecated API to ensure the responsible caller whose code needs to be changed is correctly identified by the warning. In such a scenario this change would typically mean that the comprehension-containing function, on the line of the comprehension, is identified as responsible, rather than the comprehension itself. That seems… totally fine?

I can construct a scenario where it might cause wrong results, but it requires imagining a deprecated API that is always called through a comprehension in the same library, so the stacklevel is intended to bypass the comprehension (and probably another frame or several) to reach client code. I don’t think I’ve ever seen such a case, and it’s already inherently fragile to refactoring (and raises the question why the warning wouldn’t be raised at an api layer closer to client code in the first place) but I suppose it could happen.

That’s a reasonable concern, thanks for the feedback! I suspect the implementation will not be quite as complex as you suggest, and it seems that we’ve recently become more open to accepting some complexity in the interest of performance, but I think this is a trade off that has to be evaluated concretely with actual code and performance results. So I appreciate all the feedback in this thread, and I think the next step is for me to come back with both of those.

tekknolagi · January 10, 2023, 10:58pm

Had a chat with Carl and another approach is to hoist the comprehensions to top-level functions. This is nice because:

No more function object allocation on each comprehension
You can promote freevars to function parameters
Keeps frames the same, etc

It does come with some caveats:

It pollutes the global namespace
Can’t optimize comprehensions that write to cellvars
Function calls are still expensive, unless some external force makes them cheaper (like a JIT)

steven.daprano · January 10, 2023, 11:28pm

If I understand the situation correctly, I think you are saying that:

[func(x) for x in items]

would overwrite or create a global x. Is that correct?

If so, I think that’s much worse than merely changing the traceback slightly in the event of an exception.

How does the performance loss due to the cellvars compare to the performance gain from avoiding the comprehension function object?

Are the function calls more expensive than they currently are? If not, I think that’s a moot point.

tekknolagi · January 10, 2023, 11:35pm

Ah, no. What I mean to say is that

def foo(x):
  return [item for item in x]

might create a new function called <_comprehension_hoisted_0> or similar that is called from foo, as if written as:

def <_comprehension_hoisted_0>(x):
  result = []
  for item in x:
    result.append(x)
  return result

def foo(x):
  return <_comprehension_hoisted_0>(x)

and then <_comprehension_hoisted_0> is in globals() of the module.

steven.daprano · January 11, 2023, 12:07am

Ah, that’s not as bad as I thought, but it’s still pretty invasive,
especially for those who use lots of list comps, and even more so for
anyone that processes globals() looking for functions.

If you have a bunch of list comps that get run once, their hoisted
functions will live forever, in globals.

carljm · January 11, 2023, 12:09am

I think the hoisting approach in some ways ends up pretty similar to the “just optimize out the new function object and don’t change anything else” approach (you’re still making a call but you don’t have to create a new function object each time), but I don’t really like it in terms of user expectations. I think it’s somewhat surprising that [x for x in y] currently creates and calls a nested function; it would be arguably less surprising if it didn’t, and it would be much more surprising if it not only created a function but also hoisted it to module global level under some auto-generated name. Since these are observable, not transparent optimizations, this is a relevant factor.