Hello! I’m studying the CPython’s old parser for educational purposes. I’m familiar with the traditional PDA-based LL(1) parser and I got an idea of how CPython uses DFAs for rules instead of just sequences to support extended notation. The question I have is, should pgen check whether its input (Grammar) is indeed LL(1)? Because right now it doesn’t seem to do so.
If we use a traditional formal grammar, then to see whether it is LL(1) we look whether any SELECT sets for the rules of the same non-terminal intersect, where SELECT is:
If we do not use empty production rules, then we only need to look at FIRST sets. But CPython uses extended notation. In this case it’s not enough to look at the FIRST sets to decide whether a grammar is LL(1). Consider an example:
A –> B’a’
B –> ‘a’+
This is not an LL(1) grammar. In a traditional notation we would write it as:
A –> B’a’
B –> 'a’C
C –> 'a’C | ‘’
and ‘a’ would be in both SELECT(C->‘a’C) and SELECT(C->’’).
pgen allows to provide a non-LL(1) grammar. I tweaked Grammar’s pass_stmt a little:
pass_stmt: ‘pass’ | myrule ‘a’
Then, on input
a a a, it gives a SyntaxError while the input is in the grammar’s language. This is okay that LL(1) parser doesn’t parse a non-LL(1) grammar. But is it intended that no error on providing such a grammar is raised? Should a writer of a grammar manually check whether it can be parsed by the parser? Or am I missing something?