Switch Python's parsing tech to something more powerful than LL(1)

guido · November 4, 2018, 5:13pm

It’s election year! Here’s a controversial trial balloon.

There are a number of places in Python’s parser that really aren’t LL(1). E.g. I wish we could have a set of rules describing simple statements as follows (simplified):

simple_statement : expression | assignment
assignment: target '=' expression
target: atom trailer*
atom: NAME | literal | '(' expression ')
trailer: '[' expression ']' | '(' expression (',' expression)* ')' | '.' NAME
expression: ...

Unfortunately the FIRST set for target and expression overlap and the rule for simple_statement is invalid – we have to replace it with

simple_statement: [expression '='] expression

and then in a second pass of the parser we analyze the expression to the left of the ‘=’ to ensure that it doesn’t contain illegal things like function calls or operators. (To see for yourself, observe the difference between these:

>>> f() = 1
  File "<stdin>", line 1
SyntaxError: can't assign to function call
>>> f/ = 1
  File "<stdin>", line 1
    f/ = 1
       ^
SyntaxError: invalid syntax
>>>

There are a number of other examples too (e.g. keyword arguments).

So I propose to look for a better parsing technique (maybe LALR(1) or LR(k)) that can handle such grammargs directly.

Thoughts? Suggestions? Anyone with understanding of current compiler-compiler technology interested in working on this? I think the output of the compiler (an AST) could remain the same, so the API of the ast should not have to change, but the parser (already not that loved) module would probably have to be replaced or just killed off.

vsajip · November 4, 2018, 8:33pm

What about LL(k)? I believe Jython2 uses JavaCC which uses LL(k), and Jython3 uses ANTLR which also uses LL(k).

njs · November 4, 2018, 9:58pm

The two-pass handling needed to catch f() = 1 is definitely ugly, but it does produce a better error message here :-). I don’t have an opinion on whether or not to change the parsing tech, but if we do then I hope high-quality error messages will be given serious weight when picking the replacement.

Alex_Martelli · November 5, 2018, 5:28pm

If we’re considering changing the grammar, it makes me wonder whether we could also consider a recent suggestion by Peter Norvig. Peter says:
“”"
The sequence

      atom binop unaryop atom

has exactly one parse, regardless of the precedence of the operators.
“”"
and is consequently annoyed that e.g. 1 + not 0 is a syntax error. Could it be OK in this case to allow an “incompatibility”… by giving its obviously-intended meaning to a sequence that is currently a syntax error?

guido · November 5, 2018, 6:04pm

consider a recent suggestion

Sorry, that seems off-topic for this thread. My proposal is not to change the language but just the implementation of the parser. (Of course you can start a new thread with this idea.)

Could someone with knowledge of modern parsing systems explain LR(k) and LL(k)?

pablogsal · November 5, 2018, 9:36pm

There is also the fact that LL(1) grammars prevent simple implementations of parenthesised with statements. The simple rule:

with_stmt: 'with' ( with_item (',' with_item)* | '(' with_item (',' with_item)* [','] ')' ) ':' suite

is not “valid” as FIRST(with_item) == FIRST( '(' with_item (',' with_item)* [','] ')') = '('. I said that is not “valid” because the parser still has certain degree of freedom to disambiguate simple cases. In particular it can prioritize some paths in the DFA. For example, is able to parse this last rule without problems. In particular, it generates:

DFA for with_stmt                                                                                                                                 [512/2103]
  State 0
    'with' -> 1
  State 1
    '(' -> 2
    with_item -> 3
  State 2
    with_item -> 4
  State 3
    ',' -> 5
    ':' -> 6
  State 4
    ')' -> 7
    ',' -> 8
  State 5
    with_item -> 3
  State 6
    suite -> 10
  State 7
    ':' -> 6
  State 8
    ')' -> 7
    with_item -> 4
  State 9
    ',' -> 5
    ':' -> 6
  State 10 (final)
  State 11
    ')' -> 7
    ',' -> 8

It works because the transition from State 1 into a “(” is going to prioritize the path:

0 -> 1 -> "(" -> 2

instead of

0 -> 1 -> with_item -> 3

This parses correctly:

         with (manager() as x, manager() as y):
             pass
         with (manager() as x, manager() as y,):
             pass
         with (manager()):
             pass
         with (manager() as x):
             pass
         with (((manager()))):
             pass
         with ((((manager()))) as x):

but not this one:

         with (((manager()))) as x:

Also, one bigger problem is that it will never parse correctly with (yield something) as x as the hard limitations of the freedom to choose a particular branch are very limited. This situation may be possible to solve using a similar approach as with the keyword argument (argument: ( test [comp_for] | test '=' test | '**' test | '*' test )) but it feels that the LL(1) grammar is limiting a lot of cases like these.

Unless I am missing something, the Python grammar also has a nice property that is that is a ε-free LL(1) grammar. This means that it also an SLR(1) grammar and an LR(1) so it can be parsed by an SLR(k) or LR(k) parser. One of the reasons LR(1) parsers are more powerful is because they defer the recognition of which production is being chosen until the production is complete (to be a bit more precise, they stop when 1 token of lookahead after the end of the production has been inspected), so this allows more complex grammars to be parsed by them than LL(1). They are also not very complex to implement and they run in linear time, but they can consume more space. LARL parser is usually based on LR parser but consuming less memory at the cost of not able to parse some subset of grammars.

I think that using an LR(k) parser will allow simplifying a lot all of these bad edge cases in the current grammar and it will allow implementing others like the parenthesized with statements in a much easier way. It will also involve a significant amount of work but maybe we can leverage yacc or bison (both LARL parser generators) to do the heavy lifting. These are just my impressions, maybe some expert in modern parser systems could give more insight into the fine details and what the best strategy would be given the current properties of the Python grammar.

ambv · November 7, 2018, 10:48pm

I like the idea a lot. This doesn’t sound like a small project but there’s definitely nice things to be gained (like parentheses in the with statement, parsing f-strings in the same pass, removing the necessity for parentheses around yield from expressions).

jackjansen · November 11, 2018, 12:04am

There is a side effect of LL(1) that I personally really like: it keeps languages readable. People tend to read left to right (and I think this also hold for people whose native language is right to left when they’re reading code) and LL(1) and variants push the language design people towards putting the branch point in the grammar on the left side.

This helps reading the code later: you don’t get into a situation as often where you have to restart reading because you’re suddenly seeing a symbol that changes the meaning of everything that came before.

guido · November 21, 2018, 12:38am

Yes, I’ve always used this argument to defend the crappy LL(1) parser. But I think this benefit has run its course. I somehow don’t think that at this point the (in)abilities of the parser will be much of a factor when new language features are considered – the PEP process and the weight of any change to the language will be sufficient to keep rampant new features from destroying the language’s elegance.

rbtcollins · November 21, 2018, 2:04am

I think its well worth doing; my personal fav parser tech these days
is OMeta, though PEG which it builds from is also very nice.

I don’t have time to offer to contribute such a (set of) patches though, sorry.

-Rob

guido · November 26, 2018, 4:23pm

I think what we need here most is someone with the guts, time and energy to make it happen.

njs · November 27, 2018, 8:45pm

Query for the parsing wonks here: would an alternative parser make it more viable to support “context-dependent keywords”, like async/await were in 3.5/3.6, and like JS uses for all new keywords?

I’m wondering because @guido raised the concern that inability to add new keywords might become a blocker for pattern-matching syntax. And in retrospect, making async a keyword in 3.7 turned out to be pretty painful.

pablogsal · November 27, 2018, 8:54pm

Only if the grammar is extended to a non context free one. One thing is that we use a non LL(1) parser to parse an LL(1) grammar, which basically translates into less hacks because we have a more powerful parser but a very simple grammar still. LL(k) grammars have the very nice property of being context free, which is where a lot of the simplicity emanates because the parser does not need to keep state. Python chooses to dump all the state of the language into the tokenizer (like balancing parenthesis and emitting INDENT and DEDENT tokens), keeping the language grammar and the parser very clean.

Allowing context dependent grammars will lift off a really nice property of the language (being context free) that is even more general than being LL(1). This basically means that a whole new lot of complicated rules will be possible under such a grammar. Of course, this does not mean that we will implement them, just that they will be possible while before they weren’t, so an extra effort would be needed to keep the language simple.

njs · November 27, 2018, 9:17pm

Do all pseudo-keywords make the language non-context-free, or does it depend on their syntax? I can see how await's syntax might violate context-freeness (it’s special iff inside the body of an async def or generator expression), but I’m surprised at the idea that async might violate context-freeness. It can only occur in 3 contexts, each of which involves an existing keyword: async def, async with, async for.

I guess a more interesting example is with, which had a very long road to becoming a keyword. Before with became a keyword, with EXPR: and with EXPR as BINDABLE: weren’t valid syntax, so in theory they could be handled without with being a keyword, and they aren’t as context-dependent as await, but the patterns are a little more complex than async's two-adjacent-tokens.

Note that I’m not actually arguing that using more pseudo-keywords would be a good idea. But it’d be interesting to know what’s possible, and I had the impression that newer parsing tech doesn’t necessarily rely on keywords as much as the traditional tech does. It’s possible I just made that up though

pablogsal · November 27, 2018, 9:46pm

I think I did not express myself correctly. The context-free property is a property of the grammar itself. The same rule could be expressed in multiple ways, making different grammars and some of them can be context free or not. The current situation is that the parser can only parse LL(1) grammars and therefore no matter what you write, if the parser parses it, the grammar must be LL(1) and therefore context free as all LL(k) grammars are context free. That is a good limitation because the parser is here the blocking piece that prevents to lift the context-free part.

You can implement pseudo-keywords in different ways and the context-free property will depend on the grammar rule you write unless there are more restricted properties at hand (like some constructs having more limitations which you could use to prove that to implement them you need a context-free language or the lack of it). Context-free here is a property of the grammar, not of the language. Python uses indentation and that is context (the level of the indentation is a state it needs to be saved to understand if something is in some scope or other) but the parser does not care about that indentation as it only will consume INDENT and DEDENT tokens without needing to know the actual “number”. Another example is parenthesis: the parser does not care if they are balanced, is the tokenizer the one that does that check.

I am sure that some pseudo-keywords can be implemented without lifting context-freeness, but others will not. In the current implementation is very easy to conclude that they are implemented in a context-free way because of the previous argument: async production rules can be parsed by LL(1) parser --> LL(1) parser parses LL(1) grammar -> LL(1) grammar is always context-free, ergo the async productions are implemented in a context-free way.

guido · December 5, 2018, 4:08pm

I don’t have time for a long diatribe here, but Pablo is repeating a falsehood. The parser does track balancing of parentheses! The lexer independently also tracks this so that it can decide whether to emit newline tokens. But the lexer just uses a counter that doesn’t care whether you use (, [ or { – it leaves that important task to the parser.

Also, in the terminology that I learned, it’s the language (i.e., the set of all valid programs) that is context-free (“cf”) or not – and this term means “can it be parsed using a cf grammar”. A grammar is a set of rules for parsing and there are many different grammars that parse the same language. A parsing algorithm like LL(k) imposes additional constraints on the grammar, and most parsing algorithms cannot parse all cf grammars.

A separate concern is ambiguity of the grammar. If there exists a program that can be parsed multiple ways using a grammar, that grammar is ambiguous. Some parsing algorithms change the meaning of “grammar” to disambiguate such cases using precise rules.

I don’t believe that all uses of context-dependent keywords automatically mean the language is not context-free – the two terms are not as connected as they sound.

apalala · April 21, 2019, 11:20pm

As others have mentioned, PEG grammars can handle very complex languages efficiently. Moving from LL(1) to, maybe LL(2) or LL(3) would work, and it would be as efficient as it is now, without changing much (but PEG is larger than LL(k)). More powerful algorithms (like ANTLR’s LL(*)) are not needed for Python.

I implemented TatSu (previously Grako, on which Guido tweeted once), a PEG/Packrat parser generator, in Python, for Python. There are good PEG parsers in almost every language, including C.

I’d like to take a look at the current parser, and would appreciate any hints as to where to start in the source code and the documentation.

It would also be good to know the timeframe or deadline for a change in the parser, considering the discussion period that must come before a draft implementation.

If a change in Python parsing is going to be made, I’d consider maintainability (as in the ease of documenting the grammar that started this topic) the main concern.

guido · April 22, 2019, 3:26pm

Perhaps there’s something about the parser in the devguide. If not, I would start with the Parser directory (the parser generator is now in the pgen subdirectory). There’s no timeframe – this won’t make it into Python 3.8. Good luck!

apalala · April 22, 2019, 3:52pm

Thanks!

BTW, another advantage of PEG is that it doesn’t require a separate tokenizer, as lexical elements are defined within the same grammar.

Also, (as with LL(k), but not LR), PEG allows for parsing to start on any production, so subsets of the grammar can be used for different purposes.

Finally, a question. How should typed-ast be involved in changes to the parser?

guido · April 22, 2019, 5:39pm

I’d like to keep the tokenizer out of it.

The existing pgen also supports grammar subsets (and we use these).

typed-ast is not involved, we’ll not add Python 3.8 or later to it (instead, we’ve added type comment support to stdlib ast).