Switch Python's parsing tech to something more powerful than LL(1)

guido · April 25, 2019, 2:30pm

I’m going to have to ponder all of this. In the meantime if you want to work on a specific design please continue the conversation!

apalala · April 25, 2019, 4:53pm

Disappointed how? Why?

There are two different NAME = <expr> in Python:

An assignment
A kwarg

A top-down parser will know the context and be able resolve, because assignments on right-hand-sides (wisely) require := to be unambiguous.

Off the top of my head (because I have no proof but my own experience), Python is so small, and so functional, that most of it can be resolved syntactically over right-hand-side expressions.

There’s probably no ambiguity in the language because core-devs have rejected every syntactic and/or semantic change that would produce even a seemingly-ambiguous construct in Python when seen from a human perspective (that’s how we got to :=).

On my part, I’m unwilling to compromise to any work until the discussion about the changes wanted is stable, in part because I’ve been doing PEG (to be able to parse ambiguous COBOL and/or Java), and Python is small and certainly LL(k<4), which is something efficiently, cleanly, and clearly approachable with top-down-descent parser-generators, handcrafting, etc.

Python-as-a-formal-language is so small that an intern or TA should be able do the experimental work on the syntax in a month or so.

The aim, as I understand it, is to clean up the syntax and the tooling required for bootstrapping a program written in Python. Python-as-a-programming-language will remain defined by the semantics in libpython.

mjpieters · April 25, 2019, 11:05pm

The Python expression containing the eval() call is, yes, but the eval() function will have to compile the string passed in. Guido is referring to that string being parsed. That contains 1000 levels of nesting.

guido · April 26, 2019, 2:30am

Please explain to me one thing. This is valid:

foo().bar = 1

but this is not:

foo() = 1

OTOH this is also valid:

foo()

How would you structure the grammar? We currently use (I simplify):

expression_statement: expression ('=' expression)*

and then pass 2 has a function that restricts any expression on the left of '=' to disallow things that aren’t assignment targets.

I was disappointed because you appeared to say this would still require a pass 2 check.

apalala · April 26, 2019, 3:36pm

Assignments are statements, not expressions, so cases like the above get resolved with:

assignment_statement = lhs_expression (',' lhs_expression) ('=' expression)+
expression_statement = expression # does not include assignments

And lhs_side_expression gets restricted to the subset of expressions that produce assignables (left hand sides).

A more interesting case, but one that gets dealt with similarly is the new := assignment expression:

assignment_expression = lhs_expression ':=' proper_precedence_expression

What I meant is that some errors can only be detected semantically, either with a semantic pass, or at runtime. Here the parser considered the expression syntactically valid twice:


 In [1]: x = list(range(3))                                                                                

In [2]: x[1] = 'ok'                                                                                       

In [3]: x = tuple(range(3))                                                                               

In [4]: x[1] = 'ok'                                                                      
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-e360ee0163fb> in <module>
----> 1 x[1] = 'ok'

TypeError: 'tuple' object does not support item assignment

With typing and type inference making it more and more into the language the possibility of detecting errors like the above at compilation time opens.

We’ll know exactly only when the grammar is written and tested. I just thought that because of the idiosyncrasies of Python we must handle a variety of cases like this:

In [1]: (a) = 1                                                                                           

In [2]: a                                                                                                 
Out[2]: 1

The worst that will happen, in my experience, is that parts of the grammar may look “repeated” for the uninitiated, as one of the normal ways to resolve all of the above is to create separate sub-grammars for left-hand-side and right-hand-side expressions (with as much reuse as possible).

To answer your concern: Yes. Everything that can be reasonably resolved syntactically should be resolved syntactically, as that produces the fastest and earliest error reporting.

The semantic validity and meaning of some syntactically-valid expressions and statements cannot be resolved syntactically:

In [1]: [a] = [2]                                                                                         

In [2]: a                                                                                                 
Out[2]: 2

In [3]: [1, [a]][1] = [3]                                                                                  

In [4]: a                                                                                                 
Out[4]: 2

In [5]:  [a, [b]] = [4, [5]]                                                                              

In [6]: a, b                                                                                              
Out[2]: (4, 5)

guido · April 26, 2019, 3:54pm

apalala:

Assignments are statements, not expressions, so cases like the above get resolved with:
assignment_statement = lhs_expression (',' lhs_expression) ('=' expression)+
expression_statement = expression # does not include assignments

OK, then one level up we have

statement = assignment_statement | expression_statement | ...

It would seem the parser would require full backtracking capability in order to tell whether something that starts with e.g. foo(expr1, expr2, ...).bar is an assignment_statement or an expression_statement – it won’t know until it sees either the '=' token or the end of the line (or ';').

Actually we solved that in PEP 572 one by restricting the LHS to a single NAME token.

Errors that are currently detected at runtime are out of scope. The parser must work without any knowledge of the values or types of identifiers occurring in the code. But my holy grail is avoiding the need for a pass 2 that does additional syntactic checks.

That’s the current situation, and one that defeats my other holy grail of a simple grammar that is easy to understand for humans. The assignment/expression issue above is actually an example: writing it that way would be the easiest to understand, but IIUC no k for LL(k) will suffice (since the function argument list can be arbitrarily long, as can other subexpressions of lhs_expression).

Of course, that’s why it’s out of scope. (Maybe that’s something you have to explain to your students, but it feels to me like you’re lecturing me.)

apalala · April 26, 2019, 4:06pm

mjpieters:

eval('('*1000 + ')'*1000)
The Python expression containing the eval() call is, yes, but the eval() function will have to compile the string passed in. Guido is referring to that string being parsed. That contains 1000 levels of nesting.

Oops! My apologies. I missed that!

There will need to be a depth limit in the parser. As I mentioned before, it is the same for the stack that a table-driven parser requires, as it is impossible to know what the nesting means until it has been fully parsed:

from random import randint

def bad_exp(n):
    pre = ''.join(f'{randint(2, 5)} * ('  for i in range(n))
    pos = ')' * n
    return f'({pre}1 + {n}{pos})'

print(eval(bad_exp(1000)))

Currently:

s_push: parser stack overflow
Traceback (most recent call last):
  File "/Users/apalala/tmp/p.py", line 12, in <module>
    print(eval(bad_exp(1000)))
MemoryError

It seems that the maximum depth that works in Python 3.7 is 91?

print(eval(bad_exp(91)))

Even if the parser was a table-driven LR(1), which wouldn’t require the deep stack, the resulting AST for a deeply nested construct may still blow some memory limit.

guido · April 26, 2019, 4:22pm

But no worse than some shallow construct with a lot of children. The point here is that we can implement heap data structures that only run out of memory when the whole process runs out of memory (e.g. at 2 Gb, or at 64 Gb if you have it :-), while the C call stack (and the Python call stack, in CPython) are limited by the OS in a way that is hard to control (and even hard to test for without resorting to OS and CPU specific code).

Of course, the bytecode generator currently also recurses over the AST using the C call stack – but we could fix that if we wanted to. I don’t want our new parser technology to be limited by the C call stack, especially not in a way that can cause segfaults when confronted with outsize input.

apalala · April 26, 2019, 5:46pm

I agree. If the AST is constructed in a functional style (each parsing function returns an AST composed of the ASTs returned by the functions it calls), then the only parameter needed for parsing functions, if the language remains LL(k) is pos, the current position in the input stream, and that gives room for a lot of nesting the C stack.

That too requires only a single parameter (the pointer to the current AST) per nested call.

For the freaky “1000 nested parenthesized expressions” example, it would be one word per nesting level, and perhaps another word for context, or 2000 words of RAM of the C stack, 8000 KiB?

guido · April 26, 2019, 5:57pm

Most C compilers put local variables here too, so the needed memory might be a lot more than two words per call nesting level.

apalala · April 27, 2019, 10:30am

That’s true. But I’d rather not speculate, and instead build something and measure. The code generation strategy can make an effort to reduce C-stack use. Of the top of my head, Python has at most ternary operators, and everything else can be held in a list or a tuple. bad_exp, for example should require two additional words per level to hold the left and right operands before returning.

As in the example above, Python 3.7 currently supports a depth of 91 for bad_exp, so a new parser can start with something like that.

Keep in mind that when designing the tooling for a single language (instead of a generic compiler-compiler), one is free to make compromises and place restrictions on the source grammar and/or on the generated parser. It will be fine if what gets built is capable of parsing Python efficiently, and nothing more (given any LL(k) grammar, the lookaheads required for decisions in a rule may appear on totally different rules; not in Python).

guido · April 27, 2019, 8:57pm

You started speculating. Feel free to measure.

apalala · April 28, 2019, 1:26am

I’ll try find some time on weekends to help with solutions.

I studied the current parser more, and my impression is that everything’s tightly bound. There are validations taking place in every part of the process, so I think that getting rid of the syntax tree to go directly to the AST probably not a goal achievable in the short term.

As to the type of parser, the current, table-driven, LL(1) parser will be hard to beat, because Python, though small, is expression-based, and things get nested. LL is capable of dispatching the next parse level by examining a few lookahead tokens.

I examined the Comparison of parser generators in Wikipedia looking for something that could help bootstrap an LL(2<k4) parser in C-libpython, and there are abandoned tools, and few of them targeting C.

Something I though while experimenting with the Python grammar is that things would be easier downstream if there was at least one grammar rule per AST node type. Things would also be easier if the first stage named elements of interest in each rule (a good way to bypass the syntax tree).

I’ll take another look at things next weekend.

guido · April 28, 2019, 5:23am

Thanks for considering this.

Can you give examples of this? Unfortunately every change in the CST requires corresponding changes in the ast.c file, which is large and pretty fragile.

apalala · April 28, 2019, 12:39pm

This is an example of a rule that maps to several AST node types (there are several others):

expr_stmt: testlist_star_expr (annassign | augassign (yield_expr|testlist) |
                     [('=' (yield_expr|testlist_star_expr))+ [TYPE_COMMENT]] )

In that example, having one rule lead to more than one AST node type is probably a consequence of the LL(1) parser.

One of the strategies to parse directly to AST is to have a map of what AST node type each rule in the grammar corresponds to. You end up with more rules than there are AST node types, but the logic for assembling the AST becomes much simpler.

Changes in the CST require changes in ast.c because there is a CST! With rules matching AST node types the result of a parse step would be the correct AST node.

As to the measuring we talked about, I think that this rough measure means that the current parser will be hard to beat:

~/cpython$ wc -l **/*.py
...
782154 total

~/cpython$ time python3.8 -O -c 'import compileall; compileall.compile_dir(".", force=1)' > /dev/null 

real    0m5.824s
user    0m4.410s
sys     0m1.038s

I’ll keep exploring ideas, time permitting. PEG provides for flexible and clear grammars, and easy-to-debug/maintain parsers, and it doesn’t require separate tokenization. Yet parsing requires memoization to be performant. Could not building a CST compensate for the memoization since the memos would already be AST? A good experiment would be to produce a separate Python->AST parser with PEG just to take a look at the times and memory; if they’re not way off, it could be an approachable strategy.

apalala · April 28, 2019, 1:00pm

I missed PackCC in my previous search. I’m leaving a note here:

The tool has the correct set of features, and it’s cool that it’s implemented as a single .c file. The license is MIT: PackCC download | SourceForge.net