Improve the accuracy of output from inspect.getsource for expression-based code objects

blhsing · March 20, 2024, 9:23am

Abstract

This idea proposes adding the end line number, as well as the start and end column offsets, of the lines that define a function scope, to the line table entry of the RESUME bytecode that signals the start of the scope. This information will be used to improve the accuracy of the source code returned by inspect.getsource in order to reduce the complexity involved in runtime profiling and code transformations.

Motivation

The primary motivation of this idea is to improve the end user’s ability to obtain the precise defining source code of a given code object at runtime.

Python currently offers the co_firstlineno attribute to a code object so that runtime inspection tools such as inspect.getsource can extract the code block that defines a given code object by tracking indentations with a tokenizer.

However, for code objects not defined by a code block, such as ones defined by a lambda or a generator expression, having just the co_firstlineno attribute is not enough to determine where it is defined because as an expression it can be defined anywhere in a line, possibly with multiple other lambda and/or generator expressions on the same lines.

Consider the following code of a runtime call profiler:

import sys
from inspect import getsource
from itertools import takewhile

def profiler(frame, event, arg):
    if event == 'call':
        print(getsource(frame.f_code))
sys.setprofile(profiler)

for i in takewhile(lambda x: x < 2, filter(lambda x: x % 2, (
    i for i in range(1)
))):
    pass

It currently produces the following output:

for i in takewhile(lambda x: x < 2, filter(lambda x: x % 2, (
    i for i in range(1)
))):

for i in takewhile(lambda x: x < 2, filter(lambda x: x % 2, (
    i for i in range(1)
))):

for i in takewhile(lambda x: x < 2, filter(lambda x: x % 2, (
    i for i in range(1)
))):

From the output with the same entire for statement being returned as the source code for all 3 calls that take place, it is impossible to determine which of the 3 reported calls corresponds to which of the three functions in the returned source code, including two lambda functions and one generator expression.

With the implementation of PEP-657, each bytecode is now accompanied with a fine-grained position, including both the start and end column offsets, exposed via a public code object method, co_positions, which in theory can be used to aid the accuracy of inspect.getsource’s output.

The problem is, however, that the start and end of a lambda function or a generator expression do not correspond to a bytecode with a meaningful position. For example, using co_positions on a lambda function’s code object:

import dis
a = lambda x: (
    x
)
print(*a.__code__.co_positions(), sep='\n')
dis.dis(a)

produces the output:

(2, 2, 0, 0)
(3, 3, 4, 5)
(2, 2, 0, 0)
  2           0 RESUME                   0

  3           2 LOAD_FAST                0 (x)

  2           4 RETURN_VALUE

From the output, all that can be determined is that the definition of the code object starts somewhere on line 2, and that the portion of its body that produces bytecode starts at column 4 of line 3 and ends at column 5 of line 3. The end line number of 4, the start colum of 4 and the end column of 1 are all missing from the output. It therefore remains tricky for inspect.getsource to extract the more precise source code that defines the lambda function, in this case:

lambda x: (
    x
)

Rationale

As the output of dis.dis in the last code example shows, the position information of the RESUME bytecode, except its start line number, is currently meaningless, where the end line number is always equal to the start line number, the start and end column offsets always 0.

It would therefore be reasonable to store the precise position of the source code that covers the definition of the code object, in the line table entry for the RESUME bytecode, a no-op instruction meant to signal the start of a function when its arg is 0. That RESUME is always there at or near the start of a scope also means that inspect.getsource can efficiently obtain the position information of the scope in constant time without having to iterate through all the positions generated from the co_positions method.

Since the additional position information is already available internally in the AST for the compiler core, all that needs to be done is for the compiler to pass the additional position information when entering a scope, and to replace the hard-coded position of (lineno, lineno, 0, 0) with (lineno, end_lineno, col_offset, end_col_offset) for the RESUME instruction.

Specification

With the proposed change, the line table entry for the RESUME bytecode when its arg is 0, will cover the entire source code that defines the scope, except when the scope is of a module, in which case the entry would remain (1, 1, 0, 0) as it is now.

The position covering the entire source code that defines the scope will come from lineno, end_lineno, col_offset and end_col_offset of the statement and expression AST structs, after applicable decorators are taken into account for the start line number as they are now.

Backwards Compatibility

Code analysis and coverage tools that currently presume a no-op bytecode to have zero width in the line table should be refactored to special-case the RESUME instruction to take its position information with the new meaning.

Reference Implementation

A reference implementation can be found in the implementation fork, where the supporting functions for inspect.getsource are refactored to make use of the new semantics.

Performance Implications

The compiler will spend minimally more time to pass the 3 integers end_lineno, col_offset and end_col_offset when entering a scope. The efficiency gain for inspect.getsource from not having to reparse the source code to find the code block given a start line number should be significant.

iritkatriel · March 20, 2024, 4:34pm

Doesn’t code.co_positions() do what you need? See Lib/dis.py for an example how to use it.

blhsing · March 21, 2024, 2:45am

Ah very cool. Thank you. Not sure how I missed or forgot about this exciting blurb about co_positions in the What’s New of Python 3.11.

This means that inspect.getsource can be trivially reimplemented with co_positions instead of the current approach of re-parsing the lines with a tokenizer, only to end up with inaccurate results for expression-based code objects. Will attempt at a reimplementation now. Thanks again.

blhsing · March 21, 2024, 4:18am

Well, at first glance co_positions has all I needed, but as soon as I started experimenting with it I realized that its bytecode-driven output is missing crucial positions for what I would like to achieve, namely to generically obtain from any given code object its defining source code that can be directly recompiled back into the object that contains the same code object.

For example, given an assignment from a lambda expression:

import dis
a = lambda c: (
    1
)
print(*a.__code__.co_positions(), sep='\n')
dis.dis(a)

My goal would be to extract:

lambda c: (
    1
)

that can be directly recompiled back into a function object containing the same code object.

However, the above code outputs:

(2, 2, 0, 0)
(3, 3, 4, 5)
(2, 2, 0, 0)
  2           0 RESUME                   0

  3           2 LOAD_CONST               1 (1)

  2           4 RETURN_VALUE

Note the lack of a proper col_offset to pinpoint the start of the lambda expression, as well as the lack of end_lineno and end_col_offset to identify the end position.

Would it be reasonably feasible to include such information in the RESUME no-op instruction that apparently currently always has the same lineno and end_lineno, and always has 0s for both col_offset and end_col_offset? In the example above, co_positions() should ideally yield (2, 4, 4, 1) for the RESUME bytecode.

blhsing · March 21, 2024, 5:30am

Also note that if we are to enable inspect.getsource to trivially apply the line numbers and column offsets yielded by co_positions() to the source lines without any re-parsing efforts, it may be necessary to include comments in the AST to handle the corner case where the code block ends with a comment at the same indentation level:

import dis
import inspect
if 1:
    def a(c):
        return 1 + 2
        # comment
print(*a.__code__.co_positions(), sep='\n')
dis.dis(a)
print(inspect.getsource(a))

This outputs:

(4, 4, 0, 0)
(5, 5, 15, 20)
(5, 5, 15, 20)
  4           0 RESUME                   0

  5           2 LOAD_CONST               1 (3)
              4 RETURN_VALUE
    def a(c):
        return 1 + 2
        # comment

Note the maximum end_lineno of all bytecodes being 5 rather than 6 here.

But adding a comment node to AST may be an overkill just to handle such a corner case. Perhaps inspect.getsource should continue to use a tokenizer to re-parse the source unless it’s determined that the code object is derived from an expression.

Adding more meaningful positional information to the RESUME bytecode as proposed in my previous post on the other hand feels like a much bigger bang for the buck.

iritkatriel · March 21, 2024, 8:45am

You can get full location info from the AST.

Making resume appear as if it’s covering the whole function can confuse things like code coverage tools.

blhsing · March 21, 2024, 9:59am

By AST do you mean the ast module? If so, the ast module can build an AST from source code but not from a live code object. Or if you mean AST for the compiler core, it just doesn’t currently expose the precise position of a code object to end users, which is the point of ths proposal.

My goal is to make inspect.getsource return a compilable piece of source code from a given code object for runtime reporting and transformation.

I understand that with the current defintion, a no-op bytecode such as RESUME is not supposed to cover any width. But if we document the change, code coverage tools can special-case RESUME accordingly.

RESUME is IMHO the best place to store such info, being a no-op and always at the top of a scope. Or would you help recommend a better approach to exposing the precise position of an expression-based code object?

I’ve gone ahead to prototype the necessary changes to make RESUME include precise positions and modified inspect.getsource-related functions accordingly (by applying the precise position from RESUME directly to source lines). Works well in my limited testing so far, where:

import inspect
if 1:
    def a(c):
        return 1 + 2
b = (
    1 for _ in range(2)
)
c = lambda c: (
    1
)
print(inspect.getsource(a))
print(inspect.getsource(b))
print(inspect.getsource(c))

would output:

    def a(c):
        return 1 + 2
(
    1 for _ in range(2)
)
lambda c: (
    1
)

Anyone interested is welcome to try it out at:

The diff to f4cc77d494ee0e10ed84ce369f0910c70a2f6d44 is as follows:

diff --git a/Lib/inspect.py b/Lib/inspect.py
index 7336cea0dc..a639625ccc 100644
--- a/Lib/inspect.py
+++ b/Lib/inspect.py
@@ -923,6 +923,8 @@ def getfile(object):
         object = object.tb_frame
     if isframe(object):
         object = object.f_code
+    if isgenerator(object):
+        object = object.gi_code
     if iscode(object):
         return object.co_filename
     raise TypeError('module, class, method, function, traceback, frame, or '
@@ -1107,7 +1109,7 @@ def get_lineno(self):
             return self.lineno_found[-1][0]
 
 
-def findsource(object):
+def findsource(object, precise_position=False):
     """Return the entire source file and starting line number for an object.
 
     The argument may be a module, class, method, function, traceback, frame,
@@ -1153,20 +1155,21 @@ def findsource(object):
         object = object.tb_frame
     if isframe(object):
         object = object.f_code
+    if isgenerator(object):
+        object = object.gi_code
     if iscode(object):
-        if not hasattr(object, 'co_firstlineno'):
+        if not hasattr(object, 'co_positions'):
             raise OSError('could not find function definition')
-        lnum = object.co_firstlineno - 1
-        pat = re.compile(r'^(\s*def\s)|(\s*async\s+def\s)|(.*(?<!\w)lambda(:|\s))|^(\s*@)')
-        while lnum > 0:
-            try:
-                line = lines[lnum]
-            except IndexError:
-                raise OSError('lineno is out of bounds')
-            if pat.match(line):
-                break
-            lnum = lnum - 1
-        return lines, lnum
+        for lineno, end_lineno, col_offset, end_col_offset in object.co_positions():
+            if None not in (lineno, end_lineno, col_offset, end_col_offset):
+                lnum = lineno - 1
+                if precise_position:
+                    # keep indentation for function and class definitions
+                    if re.match(r'\s+(?:(?:async\s+)?def|class|@)\b', lines[lnum]):
+                        col_offset = 0
+                    return lines, lnum, end_lineno - 1, col_offset, end_col_offset
+                else:
+                    return lines, lnum
     raise OSError('could not find code object')
 
 def getcomments(object):
@@ -1298,7 +1301,7 @@ def getsourcelines(object):
     original source file the first line of code was found.  An OSError is
     raised if the source code cannot be retrieved."""
     object = unwrap(object)
-    lines, lnum = findsource(object)
+    lines, lnum, end_lnum, col_offset, end_col_offset = findsource(object, precise_position=True)
 
     if istraceback(object):
         object = object.tb_frame
@@ -1307,8 +1310,10 @@ def getsourcelines(object):
     if (ismodule(object) or
         (isframe(object) and object.f_code.co_name == "<module>")):
         return lines, 0
-    else:
-        return getblock(lines[lnum:]), lnum + 1
+    lines = lines[lnum: end_lnum + 1]
+    lines[-1] = lines[-1][:end_col_offset]
+    lines[0] = lines[0][col_offset:]
+    return lines, lnum + 1
 
 def getsource(object):
     """Return the text of the source code for an object.
diff --git a/Python/compile.c b/Python/compile.c
index 3291d31a5c..2874f8c379 100644
--- a/Python/compile.c
+++ b/Python/compile.c
@@ -1256,9 +1256,10 @@ codegen_addop_j(instr_sequence *seq, location loc,
 
 static int
 compiler_enter_scope(struct compiler *c, identifier name,
-                     int scope_type, void *key, int lineno)
+                     int scope_type, void *key, int lineno, int end_lineno,
+                     int col_offset, int end_col_offset)
 {
-    location loc = LOCATION(lineno, lineno, 0, 0);
+    location loc = LOCATION(lineno, end_lineno, col_offset, end_col_offset);
 
     struct compiler_unit *u;
 
@@ -1766,7 +1767,7 @@ compiler_enter_anonymous_scope(struct compiler* c, mod_ty mod)
     _Py_DECLARE_STR(anon_module, "<module>");
     RETURN_IF_ERROR(
         compiler_enter_scope(c, &_Py_STR(anon_module), COMPILER_SCOPE_MODULE,
-                             mod, 1));
+                             mod, 1, 1, 0, 0));
     return SUCCESS;
 }
 
@@ -2196,7 +2197,8 @@ compiler_type_params(struct compiler *c, asdl_type_param_seq *type_params)
             if (typeparam->v.TypeVar.bound) {
                 expr_ty bound = typeparam->v.TypeVar.bound;
                 if (compiler_enter_scope(c, typeparam->v.TypeVar.name, COMPILER_SCOPE_TYPEPARAMS,
-                                        (void *)typeparam, bound->lineno) == -1) {
+                        (void *)typeparam, bound->lineno, bound->end_lineno, bound->col_offset,
+                        bound->end_col_offset) == -1) {
                     return ERROR;
                 }
                 VISIT_IN_SCOPE(c, expr, bound);
@@ -2269,7 +2271,8 @@ compiler_function_body(struct compiler *c, stmt_ty s, int is_async, Py_ssize_t f
     }
 
     RETURN_IF_ERROR(
-        compiler_enter_scope(c, name, scope_type, (void *)s, firstlineno));
+        compiler_enter_scope(c, name, scope_type, (void *)s, firstlineno, s->end_lineno,
+                             s->col_offset, s->end_col_offset));
 
     Py_ssize_t first_instr = 0;
     PyObject *docstring = _PyAST_GetDocString(body);
@@ -2331,7 +2334,9 @@ compiler_function(struct compiler *c, stmt_ty s, int is_async)
     asdl_type_param_seq *type_params;
     Py_ssize_t funcflags;
     int annotations;
+    expr_ty first_deco;
     int firstlineno;
+    int firstcol_offset;
 
     if (is_async) {
         assert(s->kind == AsyncFunctionDef_kind);
@@ -2355,8 +2360,11 @@ compiler_function(struct compiler *c, stmt_ty s, int is_async)
     RETURN_IF_ERROR(compiler_decorators(c, decos));
 
     firstlineno = s->lineno;
+    firstcol_offset = s->col_offset;
     if (asdl_seq_LEN(decos)) {
-        firstlineno = ((expr_ty)asdl_seq_GET(decos, 0))->lineno;
+        first_deco = (expr_ty)asdl_seq_GET(decos, 0);
+        firstlineno = first_deco->lineno;
+        firstcol_offset = first_deco->col_offset;
     }
 
     location loc = LOC(s);
@@ -2385,7 +2393,8 @@ compiler_function(struct compiler *c, stmt_ty s, int is_async)
             return ERROR;
         }
         if (compiler_enter_scope(c, type_params_name, COMPILER_SCOPE_TYPEPARAMS,
-                                 (void *)type_params, firstlineno) == -1) {
+                                 (void *)type_params, firstlineno, s->end_lineno,
+                                 firstcol_offset, s->end_col_offset) == -1) {
             Py_DECREF(type_params_name);
             return ERROR;
         }
@@ -2471,7 +2480,8 @@ compiler_class_body(struct compiler *c, stmt_ty s, int firstlineno)
     /* 1. compile the class body into a code object */
     RETURN_IF_ERROR(
         compiler_enter_scope(c, s->v.ClassDef.name,
-                             COMPILER_SCOPE_CLASS, (void *)s, firstlineno));
+                             COMPILER_SCOPE_CLASS, (void *)s, firstlineno,
+                             s->end_lineno, s->col_offset, s->end_col_offset));
 
     location loc = LOCATION(firstlineno, firstlineno, 0, 0);
     /* use the class name for name mangling */
@@ -2589,9 +2599,13 @@ compiler_class(struct compiler *c, stmt_ty s)
 
     RETURN_IF_ERROR(compiler_decorators(c, decos));
 
+    expr_ty first_deco;
     int firstlineno = s->lineno;
+    int firstcol_offset = s->col_offset;
     if (asdl_seq_LEN(decos)) {
-        firstlineno = ((expr_ty)asdl_seq_GET(decos, 0))->lineno;
+        first_deco = (expr_ty)asdl_seq_GET(decos, 0);
+        firstlineno = first_deco->lineno;
+        firstcol_offset = first_deco->col_offset;
     }
     location loc = LOC(s);
 
@@ -2605,7 +2619,8 @@ compiler_class(struct compiler *c, stmt_ty s)
             return ERROR;
         }
         if (compiler_enter_scope(c, type_params_name, COMPILER_SCOPE_TYPEPARAMS,
-                                 (void *)type_params, firstlineno) == -1) {
+                                 (void *)type_params, firstlineno, s->end_lineno,
+                                 firstcol_offset, s->end_col_offset) == -1) {
             Py_DECREF(type_params_name);
             return ERROR;
         }
@@ -2688,7 +2703,8 @@ compiler_typealias_body(struct compiler *c, stmt_ty s)
     location loc = LOC(s);
     PyObject *name = s->v.TypeAlias.name->v.Name.id;
     RETURN_IF_ERROR(
-        compiler_enter_scope(c, name, COMPILER_SCOPE_FUNCTION, s, loc.lineno));
+        compiler_enter_scope(c, name, COMPILER_SCOPE_FUNCTION, s, loc.lineno,
+                             loc.end_lineno, loc.col_offset, loc.end_col_offset));
     /* Make None the first constant, so the evaluate function can't have a
         docstring. */
     RETURN_IF_ERROR(compiler_add_const(c->c_const_cache, c->u, Py_None));
@@ -2723,7 +2739,8 @@ compiler_typealias(struct compiler *c, stmt_ty s)
             return ERROR;
         }
         if (compiler_enter_scope(c, type_params_name, COMPILER_SCOPE_TYPEPARAMS,
-                                 (void *)type_params, loc.lineno) == -1) {
+                                 (void *)type_params, loc.lineno, loc.end_lineno,
+                                 loc.col_offset, loc.end_col_offset) == -1) {
             Py_DECREF(type_params_name);
             return ERROR;
         }
@@ -3001,7 +3018,8 @@ compiler_lambda(struct compiler *c, expr_ty e)
     _Py_DECLARE_STR(anon_lambda, "<lambda>");
     RETURN_IF_ERROR(
         compiler_enter_scope(c, &_Py_STR(anon_lambda), COMPILER_SCOPE_LAMBDA,
-                             (void *)e, e->lineno));
+                             (void *)e, e->lineno, e->end_lineno,
+                             e->col_offset, e->end_col_offset));
 
     /* Make None the first constant, so the lambda can't have a
        docstring. */
@@ -5769,7 +5787,7 @@ compiler_comprehension(struct compiler *c, expr_ty e, int type,
     }
     else {
         if (compiler_enter_scope(c, name, COMPILER_SCOPE_COMPREHENSION,
-                                (void *)e, e->lineno) < 0)
+                                (void *)e, e->lineno, e->end_lineno, e->col_offset, e->end_col_offset) < 0)
         {
             goto error;
         }

iritkatriel · March 21, 2024, 11:25am

I mean the AST of the source code. For your use case, you must have access to the source code, so you can calculate its AST.

blhsing · March 21, 2024, 11:30am

Again, the issue is, given a code object at runtime, how do you extract its defining source code?

AST from source code doesn’t help because it bears no direct relationship to a live code object.

pf_moore · March 21, 2024, 11:35am

You can’t because it may not even have “defining source code”. Consider a function loaded from a .pyc file that has no corresponding .py file. Or just a function defined in the REPL. The alternative answer is inspect.getsource - in what situation does that not give an answer where it could do?

iritkatriel · March 21, 2024, 11:47am

You wrote:

so, you want the code object to give you the location but presumably you have the source code. Otherwise how would you do this?

MegaIng · March 21, 2024, 12:02pm

See this slightly modified example:

import inspect
if 1:
    def a(_):
        return 1 + 2

b = (
    1 for _ in range(2)
)
c = lambda _: (
    1
)
print(inspect.getsource(a))
print(inspect.getsource(b.gi_code))
print(inspect.getsource(c))

This prints

    def a(_):
        return 1 + 2

    def a(_):
        return 1 + 2

c = lambda _: (
    1
)

First one is correct, second one is completely wrong, and third includes extra code that should be stripped away.

@blhsing is talking about improving getsource to work more reliably.

blhsing · March 21, 2024, 2:53pm

The possibility of source-less code objects doesn’t stop the stdlib from offering inspect.getsource. The point of this proposal is to improve the accuracy of the output of inspect.getsource for the vast majority of cases where the source code is available.

I think my original post can use an example for better clarity indeed. Please see my updated original post.

blhsing · March 21, 2024, 2:55pm

I need a way for inspect.getsource to give me the exact source code of a given live code object. I don’t need it to give me an entire statement that contains many other tokens when it could really just give me the very expression that defines the code object. Please also see my updated original post for a better example.

Also note that with my prototype, the example code in the OP would then output:

(
    i for i in range(1)
)
lambda x: x % 2
(
    i for i in range(1)
)

iritkatriel · March 21, 2024, 4:33pm

You’re not responding to my question: why can’t you get exact locations from the AST?

MegaIng · March 21, 2024, 4:36pm

How would you relate AST node and code object if you don’t have the exact location of the code object? Guesstimating is of-course possible, but is probably going to always have edge cases.

iritkatriel · March 21, 2024, 4:51pm

You have the line number.

MegaIng · March 21, 2024, 4:54pm

Yes, but the line number is not enough information in cases of nested (or even just adjacent) codeblocks, like lambdas and/or generator expressions in the same line. That is the entire point of this suggestion, see the updated examples in OP.

iritkatriel · March 21, 2024, 6:01pm

You also have location information for what’s inside the body of the lambdas (which can be matched to locations of instructions in the code object), as well as the location of the lambda expression itself. For example:

>>> src = "a,b = lambda c: 123, lambda d: 456"
>>> pp(ast.dump(ast.parse(src), include_attributes=True))
("Module(body=[Assign(
  [snipped]

Lambda(
  args=arguments(args=[arg(arg='c', lineno=1, col_offset=13, end_lineno=1, end_col_offset=14)]),
  body=Constant(value=123, lineno=1, col_offset=16, end_lineno=1, end_col_offset=19), 
lineno=1, col_offset=6, 'end_lineno=1, end_col_offset=19),

Lambda(
  args=arguments(args=[arg(arg='d', lineno=1, col_offset=28, end_lineno=1, end_col_offset=29)]),      
  body=Constant(value=456, lineno=1, col_offset=31, end_lineno=1, end_col_offset=34), 
lineno=1, col_offset=21, end_lineno=1, end_col_offset=34)]

[snipped]

BrenBarn · March 21, 2024, 6:32pm

In that example you are again working from the string source code. The question is can you get that information from the code object (or from the function object containing it).

Topic		Replies	Views
Compiling/evaling arbitrary AST trees Python Help	8	1227	September 27, 2023
Some attribute issue with Python 3.11 func.__code__ object Python Help help	2	926	September 23, 2023
The 3.11 linetable for generator expressions seems to be missing an entry Core Development	0	390	September 14, 2023
Parse python code Python Help	15	2054	July 3, 2019
Dealing with forward refs at runtime Typing	4	849	October 31, 2023

Improve the accuracy of output from inspect.getsource for expression-based code objects

Related Topics