Should there be a check whether Grammar is LL(1)?

r4victor · September 12, 2020, 4:14pm

Hello! I’m studying the CPython’s old parser for educational purposes. I’m familiar with the traditional PDA-based LL(1) parser and I got an idea of how CPython uses DFAs for rules instead of just sequences to support extended notation. The question I have is, should pgen check whether its input (Grammar) is indeed LL(1)? Because right now it doesn’t seem to do so.

If we use a traditional formal grammar, then to see whether it is LL(1) we look whether any SELECT sets for the rules of the same non-terminal intersect, where SELECT is:

If we do not use empty production rules, then we only need to look at FIRST sets. But CPython uses extended notation. In this case it’s not enough to look at the FIRST sets to decide whether a grammar is LL(1). Consider an example:

A –> B’a’
B –> ‘a’+

This is not an LL(1) grammar. In a traditional notation we would write it as:

A –> B’a’
B –> 'a’C
C –> 'a’C | ‘’

and ‘a’ would be in both SELECT(C->‘a’C) and SELECT(C->’’).

pgen allows to provide a non-LL(1) grammar. I tweaked Grammar’s pass_stmt a little:

pass_stmt: ‘pass’ | myrule ‘a’
myrule: ‘a’+

Then, on input a a a, it gives a SyntaxError while the input is in the grammar’s language. This is okay that LL(1) parser doesn’t parse a non-LL(1) grammar. But is it intended that no error on providing such a grammar is raised? Should a writer of a grammar manually check whether it can be parsed by the parser? Or am I missing something?

guido · September 25, 2020, 3:32am

That’s clever. You’re right that grammar can’t work, because the + operator is implemented eagerly. I guess in 30 years nobody ever tried to add such a silly rule to Python, and pgen doesn’t really get used for anything but Python, so I never got to implementing such a check. Do you see a problem in principle with adding it?

r4victor · September 25, 2020, 10:07am

I think that such a check can be implemented. The problem arises when an arc in a DFA of some nonterminal is labeled by a terminal which can also follow that nonterminal. So, we need to calculate FOLLOW sets for nonterminals and, for each nonterminal, a set of terminals that can start an optional end of that nonterminal (let’s call this set OPT_END_FIRST). We would then check whether FOLLOW and OPT_END_FIRST sets for any nonterminal intersect.

To calculate these sets when the rules are represented by DFAs, I think, we can use the following method:

OPT_END_FIRST(A) = {terminals that label arcs from terminal states in the A’s DFA}

FOLLOW(A) = {terminals that label arcs from states in which A leads in B’s DFA} + {FOLLOW(B) if any of these states are final}, where B goes over all nonterminals.

This could be a possible solution. Is there a need to fix the problem? As far as I know, pgen will be removed in 3.10.

guido · September 25, 2020, 2:38pm

There is no need, pgen will retire soon.

Have you looked at its replacement, pegen, yet?

r4victor · September 25, 2020, 3:18pm

Yes, I have. I’m working on the series of blog posts on the CPython internals and I’ve made an overview of both parsers.

I wasn’t familiar with PEG before and your series on it served as an excellent introduction, thanks! I like the new PEG parser and I think the idea to use an analytic grammar is the right one. Nevertheless, I’m a bit skeptical about PEG. I couldn’t find other large projects that use PEG. So, the question is why. As far as I understand, one of the problems is that the language PEG defines is not as clear as, for example, those defined by EBNF [1]. It’s certainly a concern in theory, but I’m not sure whether it can be a problem in practice.

[1] https://www.researchgate.net/publication/208034009_Some_Aspects_of_Parsing_Expression_Grammar