Ast library - reading the statements in an ordered manner

Idea

To have a function in the ast standard library to iterate only the nodes of type “statement” in an ordered manner, which basically would mean to easily iterate the nodes representing the codelines.

Motivation

The ast library allows to use ast.walk to traverse the tree and “unroll” all its nodes. There is also NodeVisitor, and they are really useful.

But there can be some use cases in which one may want to just iterate through the nodes of type “Statement”, i.e. the actual codelines, in an ordered manner.

For example, at times I shall develop some simple static code analyzers and I want to iterate over the lines of a function in an ordered manner.

Now, as a regular developer I would say that this is my need and I would develop a function like the one proposed and use it for myself.
Still, the function I’m proposing came out after some experience of mine working with the ast library, but at the beginning I mainly relied on iterating through statements in the .body attribute that several types of nodes have. Which I personally consider a wrong practice, since there can be other attributes for some node types which lead to lists of statements, like attributes representing else or finally inside ast.Try types of nodes.

So I would propose the function below, providing developers who approach the Abstract Syntax Tree for the first time with a standardized function to iterate statements/codelines.

Proposal

A function like the following one:

def iter_stmts(node: ast.AST) -> list[ast.stmt]:
    stmts: list[ast.stmt] = [subnode for subnode in ast.walk(node) if isinstance(subnode, ast.stmt)]
    stmts.sort(key = lambda n: n.lineno)
    return stmts

which would actually take a node from the Abstract Syntax Tree and return all its descendants of type ast.stmt once sorted.

Considerations

I propose this but of course I have some doubts on my own. I mean adding additional stuff to a codebase could have a price, it’s always additional code to mantain.

I wouldn’t have a clear idea for the name of the function, something like iter_stmts, iter_codelines or iter_statements.

Could a function like this be useful to other developers? I don’t know, but I use it across some projects both personal and both for work, so speaking for myself I use this.

It would be useful also for me if somebody really expert in the ast library evaluated this function for me.

I’m sharing this idea of mine in case it could benefit other people and, as I already said, developers who are approaching the ast and have a similar use case, before they just iterate over .body forgetting about other attributes. I feel like a standardized function could be useful, but maybe I’m wrong.

Thanks in advance to the people who are reading this, and for all the work done on the Python programming language.

You can build such a utility yourself if you want, but I strongly doubt that “iterate through all statements, ignoring context” is all that common of a requirement…

I am honestly confused what your usecase for this is? (not like “iterate over all statements”, but why you want to do this)

2 Likes

You can crawl a tree and sort the statements in it by line number, if you do want this.

I’m also curious as to why you might want this – perhaps there are lints that are easier to write like this, but I’ve written a lot of linting code and not run into it. Maybe I’m habitually ruling things out because I’m not considering the possibilities!

Regarding the use case, I sometimes want to track assignments inside a function, to “keep track” of the variable’s type while I read my code.

Of course while doing this I make all the necessary assumptions regarding the functions, so I htpothetize them to be “simple” (no dynamic type changes across iterations of a for loop, no nested function definitions and so on).

I totally comprehend if this is considered some “not that important” type of function, I was just wondering if it could have been useful.

The implementation I proposed I think perfectly represents the suggestion, it just selects all the statements coming from walk and then sorts them by line number.

Thank you very much for the attention!

Aha yeah. That is exactly the kind of assumptions I think the stdlib should not further encourage. If this works for you, great, but if anything, more complex and correct helpers might be useful for the stdlib (e.g. “all assignments in this scope”, which also tracks function boundaries, nonlocal, global, read-only accesses, …)

4 Likes

Sorting is relatively inefficient. You don’t need to sort the statement nodes by line number if you don’t use ast.walk in the first place.

Statements follow the order of a depth-first traversal, while ast.walk is implemented with a breadth-first traversal.

You can simply implement iter_stmts as a recursive generator instead:

import ast
from itertools import chain

def iter_stmts(node):
    if isinstance(node, ast.stmt):
        yield node
    yield from chain.from_iterable(map(iter_stmts, ast.iter_child_nodes(node)))

with open(__file__) as file:
    for stmt in iter_stmts(ast.parse(file.read())):
        print(stmt.lineno, ast.unparse(stmt).splitlines()[0])

This outputs:

1 import ast
2 from itertools import chain
4 def iter_stmts(node):
5 if isinstance(node, ast.stmt):
6 yield node
7 yield from chain.from_iterable(map(iter_stmts, ast.iter_child_nodes(node)))
9 with open(__file__) as file:
10 for stmt in iter_stmts(ast.parse(file.read())):
11 print(stmt.lineno, ast.unparse(stmt).splitlines()[0])
1 Like

Thank you very much for the code, but I have a doubt on the ordering: when you use ast.iter_child_nodes(node) in the map, is there any guarantee that the nodes returned by iter_child_nodes are ordered?

I mean, let’s say that the statements are in the .body list ok, they are going to be ordered. But let’s say that we go through a node like ast.Try, which could have several attributes being lists of “substatements”, such as .body, .orelse, .handlers and .finalbody. Is it guaranteed that iter_child_nodes will return these children in the same order as they appear in the code.

Honestly I have no idea, and that’s why I’m asking.

In the docs of the function:

it do not see any mention regarding the ordering of the children

Yes. Note that there is a strict ordering of these attributes, defined by the grammar: body, handlers, orelse, finalbody. This is true for all statements. If you find that iter_child_nodes doesn’t respect this ordering, that would be a bug in the function.

1 Like