Switch Python's parsing tech to something more powerful than LL(1)

brettcannon · April 22, 2019, 7:54pm

See Python Developer’s Guide which has guides for the grammar and compiler which should cover the parser.

apalala · April 24, 2019, 3:15am

I did a first pass over the parser documentation and the code.

The Python grammar is so small and so LL(k) that any parsing approach will work, and be efficient.
I see no reason for most of the parsing (not only pgen) be done with Python (and some sort of bootstrapping).

Questions (because I haven’t explored the code deeper):

Is the Parser/tokenizer.h used outside of compilation? Lib/tokenizer.py seems to indicate it is not.
Is the parse tree (Include/node.h) needed outside of compilation? It is #included all over the place, but the docs say that the parse-tree is an intermediate step towards the abstract tree.

guido · April 24, 2019, 3:46pm

That’s good news. I guess for the specific issue of recognizing keyword args in function calls and for recognizing assignment expressions, LL(2) would be sufficient. Are there situations where you think we’ll have to read ahead more than 2 tokens?

This confused me – are you saying that the parsing should be done with Python, or that it shouldn’t be done with Python?

A quick grep learns that it is not. (I think you meant Lib/tokenize.py.)

There is still some use for this API, and the intermediate parse tree is exposed to Python code by the (ancient) parser module. But since it is not considered stable (the structure of some nodes changes with each change to the grammar) and since we strongly encourage using the ast module instead, I think we can deprecate it quickly if we no longer have a need for it internally.

I do have one suggestion. If you’re planning a big overhaul of the parser, please don’t work on your own until you have a perfect patch ready. Nobody will want to take the risk of accepting that patch (since such a drastic change will inevitably break lots of unexpected things). Instead, I recommend that you work in the open, involve other core devs in planning the changes you are considering, and show all sorts of intermediate stages and prototypes of your work to the core dev community so we develop some expertise in understanding the new structure. Also, be prepared that not everybody is going to like change.

apalala · April 24, 2019, 11:40pm

TL;DR

The k in LL(k) doesn’t matter any more because we’re past the times of the severe constraints on RAM. The k in ANTLR is * in LL(*) (really!), and that’s one of the best parser generators around. The k equivalent in PEG parsers is also *. It’s irrelevant if the k required for good Python grammars and parsing is 2, 4, or 8 (k means just max tokens allowed of lookahead for making a parsing decision, so think lookahead[:2] vs lookahead[:4] in Python; it’s the same).

What part of the parsing?

The current workflow, with (runtime, name, source-language, target-language) is:

(Python, pgen, ebnf-variant-LL(1), PARSER) Compile grammar and generate a table-based LL(1) parser in C-libpython.
(C-libpython, ?, Python, CST) Parse Python and generate a Concrete Syntax Tree
(C-libpython, ?, CST, AST) Take a CST and produce an AST
(C-libpython, ?, AST, CFG) Take an AST and produce a Control Flow Graph (CFG)
(C-libpython, ?, AST+CFG, BYTECODE) Generate bytecode

Step 1 is only performed during make. The other steps must be bundled with CPython distributions.

I think that the target of your quest is to change 1-3:

(Python, ?, ebnf-variant, PARSER) Compile a more expressive grammar and generate an efficient Python->AST parser
() gone…
(?, ?, Python, AST) Parse Python and generate an AST

The ? on step 3. is because it would be great if that and other steps were written in Python, but it’s not reasonable to target that nicety until there are measures of MLOC/min parsed currently and have some tests of how-much performance we’d lose when not using C-libpython for parsing Python (Python runs on very tight environments, and users are used to its performance).

I’d opt for the output of 1. to be a table-less, LL(k), top-down parser (see the last chapter of Wirth’s), like if it had been handcrafted, because that would meet every criteria of efficiency and maintainability/debug-ability. It would be easy to understand how each parse function maps to a given grammar expression (generated comments would help), and the parser would be very efficient.

As powerful and likable as PEG is, I’d discard it just because it requires discipline over the ordering of grammar rules and the semantics that entails. PEG is more powerful, but LL(k) is much, much easier to understand for everyone. It’s better to place more effort on having a pretty grammar; one that can be both turned into an efficient parser, and copied verbatim into the documentation.

I don’t dare estimate on which Python version this may be ready. A new Python->CST parser generator may be a required and reasonable first step, while the unwanted dependencies on CSTs are sorted out. Incremental is good.

guido · April 24, 2019, 11:49pm

Quick reply:

By this I meant that I literally couldn’t tell what your sentence said. Removing the parenthetical remarks I see “I see no reason for most of the parsing [XXX] be done with Python”. There’s a grammatical error at [XXX], and I can’t tell if you left out “to” or “not to”. I am literally just asking, did you mean by that “most of the parsing can be done in Python” or “most of it should not be done in Python”?

I’ll study the rest of your message carefully later – it looks like we’re mostly missing some common vocabulary (I haven’t kept up with current trends).

PS. I see another downside for large k in LL(k) – in the REPL I want the parser to emit an error as soon as possible – I don’t want to enter several lines only to find that there was a mistake in the first line.

guido · April 24, 2019, 11:57pm

Another issue: what can we do about rules with different complex alternatives to the left of a distinguishing token? E.g. we accept

<expr>

and

<var> += <expr>

and

<var>, <var> = <expr>

but not

<var>, <var> += <expr>

where <var> is a nonterminal that could have an arbitrary number of tokens (since it could have the form func(<expr>, ...).attr or dict[<expr>]), and <expr> is an arbitrary expression.

I recall from my college days that LALR(1) can handle this (though if it’s neither the errors are atrocious). In Python we currently do an additional check in a later pass, but this is suboptimal, since it accepts bad code like

x+1 += 1

or

foo() += 1

guido · April 25, 2019, 12:16am

Sorry for the barrage of replies here. I’ve now digested the rest of your post.

Yes, you’re right, I am after replacing steps 1-3. My intuition tells me that even optimized Python can’t beat C’s performance for step 2-3, and people expect the parser to be really fast (e.g. Dropbox servers have millions of lines of code).

I am somewhat surprised that you propose to generate a recursive-descent parser rather than a table-based parser. The execution stack will expose us to new failure modes when parsing deeply nested code. (Execution stack overflow is harder to detect than overflow of a data structure – and CPython consumes several C stack frames for each Python stack frame.)

I do place a high value on a readable grammar (hence my experiments 30 years ago with a unique EBNF variant), so I’m glad we agree here.

I don’t know what just a new Python->CST generator would buy us. Yes, this would leave step 3 in place, which makes the task simpler (there’s a lot of code in step 3, which is the giant hand-written file Python/ast.c), but limits our chances of improving the situation, since step 3 makes many assumptions about the precise structure of the CST (things like "if the node has 3 children, it must be of the form <var> = <expr>").

I would be much happier with something that targets the existing AST (described by Parser/Python.asdl). We can just delete the parser module and the C API for the old CST, and damn the torpedoes.

What would it cost to have arbitrary backtracking in the parser? ISTM that’s the only way to really improve the grammar for things like <expr> vs. <var> = <expr> where <var> and <expr> have a lot of overlap (possibly <var> is a subset of <expr>).

apalala · April 25, 2019, 1:56am

Shortly, I stated that my opinion is that it is better if most or all of the parsing is done in Python, but that it is not convenient to aim for that until there’s a cleaner first-stage parser and no dependencies on the Syntax Tree.

The precision and timeliness of the error reporting are unrelated-enough to the k in LL(k) to be of no concern. It is precisely table-driven parsers which are prone to reporting “something” happened “somewhere”. A top-down parser can be directed to report at where it’s more convenient for the user.

Modern PEG parsers (including TatSu) report the error that occurred furthest in the input stream, as that “missing semicolon” is what will make the parse of a whole method or class fail, and is what the user needs to know to fix the input.

I didn’t mention it on the previous message, but yes, it would be good if the generated, handcrafted-like parser was TurboPascal-like, and failed as fast and accurately as possible on the first parse error.

Emphasizing: The k is unrelated enough to good error reporting, or time complexity, and the k for Python will be 2-4 (or *) as required.

p.s. Yes. Fixing the k to a low value will keep the language simple to parse by humans.

apalala · April 25, 2019, 2:05am

It’s a killer to try to resolve everything about parsing at the syntactical level, as many, many rules of a language are semantic (think of what it takes to render a Web page).

The Python->AST part of the parse must preserve enough #line information so that a semantic phase can reject a parse at precise position in the input.

Yes?

guido · April 25, 2019, 2:10am

Sure, so you think that’s not possible. I’m disappointed. Hopefully we can at least do better for NAME := <expr> vs. NAME = <expr> vs. <expr> in argument lists.

apalala · April 25, 2019, 2:21am

Old myths there?

The depth of the stack required for the parse is proportional to the depth of the nested constructs in the source (Python) input, and not the parser’s design or structure (think xFAs with stack memory), and Python discourages deep nesting (although module, class, method, for, if, is already >4 levels deep).

The depth is the same if kept on the call stack, or in a separate data structure. Using the C call stack to keep at least part of the transient parse information should be much, much more efficient and easy to manage than using libpython Arenas , etc.

The heap memory required in top-down will be for delivering the resulting AST, as the call chain’s memory goes away after a successful parse. I hadn’t thought about it till now, but it could be made so that heap memory is only consumed on success…?

apalala · April 25, 2019, 2:28am

As I recall, converting a grammar in one grammar language to another has complexity close enough to lenght-of-input that I think that having a pretty grammar for Python should be a must, independently of the other decisions being discussed.

IOW, Python can have its “official” grammar language, which gets translated into the grammar language of the first stage tool as step 0., if need be.

guido · April 25, 2019, 2:31am

Problem is that people sometimes generate code that is insanely nested, e.g. as an attack or because their code generator is data-driven and the data is a large graph. We occasionally get crash reports from people who do things like eval('('*1000 + ')'*1000).

(And yes I understand that the stack depth is the same regardless of the parser implementation. My point is that if the stack uses the C call stack it can lead to crashes – e.g. on some systems threads are given only 64k call stack space, while the current parsing engine doesn’t consume C call stack space, the stack is all heap data structures.)

apalala · April 25, 2019, 2:34am

guido:

I don’t know what just a new Python->CST generator would buy us. Yes, this would leave step 3 in place, which makes the task simpler (there’s a lot of code in step 3, which is the giant hand-written file Python/ast.c), but limits our chances of improving the situation, since step 3 makes many assumptions about the precise structure of the CST (things like "if the node has 3 children, it must be of the form <var> = <expr> ").

I would be much happier with something that targets the existing AST (described by Parser/Python.asdl). We can just delete the parser module and the C API for the old CST, and damn the torpedoes.

What would it cost to have arbitrary backtracking in the parser? ISTM that’s the only way to really improve the grammar for things like <expr> vs. <var> = <expr> where <var> and <expr> have a lot of overlap (possibly <var> is a subset of <expr> ).

It’s a safe step to to Python->CST first, because there’s no deadline.

As to backtracking, it is implicit in the lookahead in LL(k), up to k.

That said, I beg you to forget about backtracking and think about clear Pythong programs. PEG can easily parse COBOL, Java, NaturalAG, … (haven’t tried C++) because of unlimited backtracking, but those languages verge on the syntactically absurd (my apologies).

Python has done well with a table-based LL(1) parser so far, and just LL(2) is another dimension. Seriously.

guido · April 25, 2019, 2:36am

The idea of converting the “official” grammar into the stage 1 tool input is neat. Though I’m not sure how it would deal with pretty grammars that are ambiguous (as the grammar shown in the docs occasionally is).

apalala · April 25, 2019, 2:52am

Why would one care where the reporting of an error comes from if it is timely and accurate?

Again, think about an HTML grammar that only allowed <td> after <tr> after <table>, etc. It would be a nightmare.

And as to the documentation, it would likely be the simplest grammar fragment, with the semantics of it explained in plain words.

apalala · April 25, 2019, 2:59am

I hadn’t considered ambiguity in the grammar so far. I don’t remember any in Python. Can you point to a fragment in the current grammar that is ambiguous?

Ambiguity has a formal definition. Practically, it is dependent on the semantics of the grammar language/parser-generator. If it is deterministic, it’s not ambiguous.

guido · April 25, 2019, 3:00am

Because in the current scheme it requires a lot of effort to do the “pass 2” checks for things that can’t be expressed in the grammar.

From most Python users’ POV, there are only two places where errors are detected: (a) syntax, (b) runtime. And for any type of error it’s always entirely clear in which category it falls. So e.g. len(2+2) is a runtime error, and foo(] is a syntax error.

Now foo(1=2) is considered a syntax error, but in our grammar it’s not an error. This is painful. (And you can tell the difference because we lose the column number, so the formatting of the error is different.)

apalala · April 25, 2019, 3:09am

That has a parse depth of about 6?(parenthesis plus operators by precedence)

It may crash the interpreter at runtime, yeah, but certainly not a parser.

Parse depth is control-structure depth, plus expression-nesting depth. It is the same depth if it’s on the stack or on the heap, and Python guidelines encourage low depths. The depth will not increase by going from LL(1) to LL(2), nor by going from table-based to runtime-stack-based.

It is not reasonable to require that people working on embedded not write deeply nested Python?

apalala · April 25, 2019, 3:49am

Understood!

Those cases should be handled syntactically (that particular case is a non-assignment within a function call, so…).

The grammar may get a little more complicated as part of this effort, as the above case is usually rejected syntactically by making assingment_expression different from and dependent on expression, and only allowing expression on right-hand-sides, like function arguments.

This is an (untested) PEG grammar for Python translated by TatSu from the Python LL(1) grammar for ANTLR4 written by Bart Kiers.

Just examples of what can be done. I already argued against PEG or LL(*) for Python (as those allow for really weird stuff).