Preparing for new Python parsing

This is just to jot down some design ideas before they go off my memory and into oblivion. Please ignore at will.

Base criteria

Bootstrapping Python will depend on nothing but C, and on the Python that can be bootstrapped from C.

PGL

There needs to be a Python Grammar Language (PGL from now on) to bootstrap Python parsing. The main purpose of the PGL is:

  • Understandable by humans.

The only requirement for PGL is that:

  • It is powerful enough to describe LL(k)

Since small variations of EBNF are good enough for PEG, and PEG is larger than LL and LR (and maybe also CFG), the above is a done deal, so human understandability of something that leads to determinism is the only criteria.

Step 0

Given PGL, we use “anytool” to produce a top-down-descent, library-independent parser (call it pglc) for PGL in Python and or C (preferably Python). That done, we ditch “anytool” and carry on maintaining the very small pglc by hand.

pglc is now a tool that can read the syntax of Python described in PGL and output whatever the next step of parsing needs. It is a grammar-to-grammar translator, and/or grammar-to-parser compiler. The language for defining the Python language is now fully defined.

The zeroth part of bootstrap is done.

Step 1

At the next step pglc takes a description of the Python syntax in PGL and produces a top-down-descent parser for Python source as a C-libpython program.

Steps 2-3

The produced C-libpython parser can take text written in Python and translate it into an AST with enough #line information for further stages to report parsing, semantic, and runtime errors accurately.

Step 1-3 of Python parsing is complete.

That step is done here, using TatSu as any tool.

The next steps on this phase are:

  1. Build a model of the parsed grammar.
  2. Make the model generate a Python->AST translator in Python.

With that, there will be a base implementation of the new parser. Following:

  1. Ditch TatSu by refactoring the Python code for the PGL parser (although the Python grammar may change, we may safely assume that the language to describe the grammar, PGL, will not).
  2. Make the model generate a C-libpython parser.

This is the TatSu grammar for PGL:

# This grammar describes the Grammar/Grammar grammar 
# used to describe the Python language
@@grammar :: PGLC
@@whitespace :: /(?s)[\s\r\n]+/
@@eol_comments :: /#([^\n]*?)(?=\n)/
@@left_recursion :: False

grammar = @:{rule}+ $ ;
rule = name:name ':' exp:exp EOL ;
exp = (choice | sequence | subexp) ;
choice = choice:'|'.{sequence|subexp} ;
sequence = seq+:subexp {seq+:subexp}+ ;
subexp = closure | atom ;
closure = closure:atom ctype:('*' | '+') ~  ;
atom = !(name ':') (group | optional | ref | token) ;
group = '(' ~ @:exp ')' ;
optional = '[' ~ opt:exp ']' ~ ;
ref = ref:name ;
name = /\w+/ ;
token = "'" ~ @:/([^'\n]|\\'|\\\\)*/ "'" ~ ;
EOL = /\s*\n|$/ ;

After finding the PackCC PEG/Packrat parser generator for C, a path opens to test in the short-term if the grammar for Python (PGL) can be PEG (runtime-wise):

  1. ✓ Create a TatSu parser to parse Grammar/Grammar
  2. ✓ Parse the Grammar/Grammar using the above parser
  3. ✓ Generate a draft PEG grammar for Python from the above using TatSu
  4. ✓ Debug the PEG grammar using TatSu (PEG semantics require rule-choice ordering, etc.)
  5. Generate AST from Python source using the above (at this point, the grammar is debugged and the parser complete)
  6. Measure parser performance (it should be within the expected Python vs C range). Pass, or abort
  7. Automatically generate a PackCC grammar for Python from the above
  8. Debug the PackCC grammar
  9. Ditch TatSu
  10. Instrument the PackCC grammar to generate AST (as TatSu, PackCC allows naming parse subexpressions)
  11. Measure, and pass or abort
  12. Customize PackCC and the PackCC grammar so it is libpython compatible (PackCC provides for this).
  13. Add a node visitor to translate the PackCC grammar to “documentation grammar”.
  14. The current Python parser can be replaced by a PEG parser that is easy to maintain and covers source->AST.

A draft/undebugged PEG/Tatsu grammar for Python is here.

PackCC does thorough maintenance of the memoization cache, and the cache would contain AST for successful parses, so it’s very possible that the memory requirements of Python can be met with (there wouldn’t be a tokenizer nor a CST).

As to grammars more powerful than LL(1) for Python, PEG is as powerful as it gets


All here, thanks for allowing me to talk to myself. Thanks!

PackCC generated parsers are DFMs. They use C goto to get where they want quickly.

It’s not worth to test PackCC now, as there are other known PEG parser generators for C, like peg (apt get install peg).

It’s good to test first how a PEG grammar for Python looks like, and how maintainable it is.

PackCC is self-contained, and easy to port to libpython (unlike peg, which seems bootstrapped from yacc). PackCC is a good target for adoption into the core of Python, no matter what (NDA, etc.)

@guido, I haven’t been able to find Arihiro Yoshida, original author of PackCC. Is that a concern somehow? How to solve it if so?

Thanks for these detailed notes.

Not sure what you mean by the latter (“bootstrapped from C”). We already have many dependencies on an existing Python binary – quite a few parts of CPython (not just the parser) are generated by Python scripts. So I wouldn’t worry about having the tool written in Python. (The bootstrapping happens by checking the output of these scripts into the repo, so a clean checkout can be used to build a Python binary which you can then use to regenerate the generated files after editing any of their inputs.)

In fact, having peeked only briefly at PackCC, I’m not at all sure why you would use that. (Who uses SourceForge in 2019???) In Python 3.8, we removed the C-based pgen and replaced it with an equivalent Python package.

If I were to design the workflow here I would suggest writing a new parser generator in Python. I do like the idea of using a translator from the current Grammar/Grammar file to something else. But I’m okay with being proven wrong, and if PackCC satisfies all constraints, I’m not ruling it out.

Finally, I had some idea. How hard would it be to change the existing pgen to support LL(2)? That would address two nasty bits that currently require the parser proper to accept a larger language and check for special cases in the second pass (Python/ast.c).

The first case is that of assignment expressions. The Grammar says:

namedexpr_test: test [':=' test]

but what we really want to say is:

namedexpr_test: [NAME ':='] test

Unfortunately that’s not LL(1) since NAME is also in the FIRST set of test. (You can think of test as the pre-PEP-572 top-level expression rule, named test for historical reasons.)

The other case is that of keyword args to function calls. Here the Grammar file has this (comment elided):

arglist: argument (',' argument)*  [',']
argument: ( test [comp_for] |
            test ':=' test |
            test '=' test |
            '**' test |
            '*' test )

Again, the test occurring in front of '=' and ':=' should really be NAME, but can’t be for the same reasons. Ideally it would be something like this:

argument: ( namedexpr_test
          | test comp_for     # ~~ test 'for' targetlist 'in' test ...
          | NAME '=' test
          | '*' test
          | '**' test
          )

In fact, there’s another pass 2 check: positional args may never appear after keyword args. Additionally, test comp_for can only occur (unparenthesized) if it’s the only argument. So an even better rewrite might be this (it’s a little weird because of optional trailing commas):

arglist: ( posarg (',' posarg)* (',' kwarg)* [',']
         | kwarg (',' kwarg)* [',']
         | test comp_for
         )
posarg: namedexpr_test | '*' test
kwarg: NAME '=' test | '**' test

(Hm, I don’t actually think this is LL(2), because of the test comp_for alternative. Ignore that part. This may be a reason why we need an even more powerful algorithm.)

I actually think there are similar cases in function definitions but they present nothing new.

But I don’t know how easy it would be to change pgen from LL(1) into LL(2). (Note that it’s not exactly LL(1) – there’s an additional constraint on its algorithm that rules cannot have empty productions. But even so.)

Missed this post while I was writing my own. It is only a problem if the open source status of PackCC is at in doubt (e.g. if there’s no license file or it’s not an existing open source license).

As I wrote, I do find it questionable to rely on a tool maintained in SourceForge – and not having the author around to answer questions is even more of a problem. But if it’s got a true open source license we may be able to fork it, assuming we can find a volunteer to be the new maintainer.

So, ignoring test comp_for, we could simplify this to

arglist: ( ( posarg (',' posarg)* ) | kwarg ) (',' kwarg)* [',']

or

arglist: (posarg ',')* (posarg | kwarg) (',' kwarg)* [',']

I’ve forked it already, but I have yet to evaluate it for:

  • correctness
  • performance
  • maintainability and customizability
1 Like

It depends on what you mean by “simplify”, and I’m afraid that may lead to arguments exactly like those about “which is the simplest equivalent expression/program”.

For people used to LL, PEG means that:

  1. No left recursion (nowadays several PEG parser generators support lefrec, but the one in PackCC leads to unintuitive cases).
  2. Choices that match the expected longest input must come first. This is because the backtracking in PEG is unlike proglog, etc., and the parser commits to the first choice that succeeds.

Im my experience, the grammars that are clearer for humans, both developers and readers, are those that make development of the parser easier, specially on the semantic actions.

TatSu mimics the args/kwargs of Python for grammar rules, and the productions used there look more like this:

arglist: (
              kwparams
            | params ',' kwparams
            | params
        ) [',']

params = name (',' name)* (',' '*' name?)? | '*' name?
kwparams: kwparam (',' kwparam)* (',' '**' name)? | '**' name
kwparam = name '=' value

Memoization takes away the fear for repetition that is something to mind for in LL(k).

In the strategy of TatSu, PackCC, and other PEG parsers, the parts of interest in each rule would be named so they are readily accessible to the semantic actions, without having to generate a CST, or having to access them by position in the rule (something that changes even with minor changes in the rules.

arglist: (
              k=kwparams
            | p=params ',' k= kwparams
            | p=params
        ) [',']
{ return ArglistAST(p, k) } # <- this is a semantic action

note: In future examples I’ll try to use standard PEG syntax for grammars (although TatSu uses extended-extended-BNF) to avoid confusions over their meaning.

I worked backwards. The Python parser that the bootstrap/make process generates must be C-libpython, because it must integrate with the the parsing infrastructure (specially AST), and the expectation is that it be comparably performant to the current one.

There are only a few PEG->C parser generators, and PackCC offers exactly what’s needed, except that it’s an abandoned project, by an author I haven’t been able to get hold of, and something I haven’t tested. With an MIT-license, I think that testing would clear the risks.

A better known PEG parser generator is peg (installable in Ubuntu), and I’m sure the researchers that meet at the PEG group at MIT would help with knowledge and/or tools.

At this stage it is all about feasibility:

  1. Can a PEG grammar for Python meet the comitter’s/board criteria for clarity and power?
  2. Can a PEG->C parser generator generate a parser that is performant, well integrated into CPython, and generates ASTs directly?

If that succeeds, the pieces can be replaced by other existing ones, or ones created anew.

You answered your own question :slight_smile:

I don’t know pgen, and found it complicated, but LL(k) is LL(k), same algorithms, and parser generators like ANTLR can deal with LL(*) generating Python or C. But ANTLR is in Java, so it cannot be brought into the bootstrap. There should be other tools around, but I haven researched them.

I think exploring LL(k>=2) is worthwhile. I’ll be experimenting with PEG because the grammars are very powerful, and because I know PEG well enough, so experiments can be done over weekend hours using existing tools. Remember that your query was not just for the power of the grammar, but investigating what it will take (mostly in changes to the libraries) for a Python parser to directly produce AST. Since the timeframe is for after 3.8, there’s time to do research.

I mentioned before that LL(1) has the quality of restricting languages to a set that tends to be easy to understand by humans. LL(2) is a tiny departure from that. Grammars are less convoluted if the k is allowed to be whatever is needed for different parts of a language. Yet Python, even on its more complex parts, remains a simple language, not because of LL, but because of its design. Parsing Python can be approached well in different ways.

Right. In the past, the fact that the grammar must be LL(1) was credited (in part) with it being a simple language, but at this point I think there are other safeguards in place, and LL(1) is holding us back (named expressions and keyword args being just two of several examples). IIRC that’s why I started this thread. :-))

Right. I still don’t know enough about PEG grammars to answer (1), and for (2) we’d need to do a serious experiment, right? Assuming (1) is answered in the affirmative (as we both expect), how many shortcuts can we take to answer (2) without doing nearly all the work? (BTW the PSF board has no input here – they deal with the community, but this is purely a technical issue to be decided by the committers, or if no consensus can be obtained by the Steering Council. :-))

What exactly is being memoized?

Interesting. This sounds like a generalization of LL(1), where the parser commits as soon as it has a 1-token match. I am going to have to read up on PEG grammars, because I am curious how and when the parser decides that it is seeing params rather than kwparams. I guess if the input is foo, it first attempts kwparams but realizes there’s no match when it sees the comma. [UPDATE: I now understand this. No need to respond.]

(BTW since Python 3.5 you can mix arbitrary positional arguments with multiple *args, and arbitrary keyword arguments with multiple **kwds.)

PS. Is there a way to not use PEG for lexical analysis? Python’s lexer must be special in order to handle various rules related to indents.

The more I read about PEG grammars and packrat parsing, the more I am thinking that it would be a fun project to create a tool similar to pgen that however uses a PEG algorithm and generates C code (+ tables?) that takes a token stream produced by CPython’s tokenizer and directly produces an AST that matches the AST described by Parser/Python.asdl. The generated C code should use the packrat algorithm (using backtracking and memoization). The parser generator ought to be written in modern Python. (To my dismay I found that pgen is written in a Python style that closely resembles the C code of the original pgen generator I wrote 30 years ago.)

From my experience, the shortcuts are in the plan I described. All of that research plan has to be done to have answers. Having the answers should provide information about what it will take to have all of the implementation.

Thanks for the clarification. My meaning was that, whatever the design and the success or failure of the prototypes, it’s up to committers what makes it into Python/CPython.

PEG parsers memoize the output of a (rule, input_position) tuple during the parse, and that allows for grammars that are clear in the lack of fear of repeating subexpressions.

For complicated languages, memoization can store a lot of rubbish (but I think I remember that the amount is bounded to linear at worst?) I would review the papers [or Wikipedia]). For Python, because of its design, and because it’s currently parsed with LL(1), I expect every memo will be an AST what will be valid input to the final AST, with no waste.

Oops! I missed that!

Did I post the paper about PEG probably being more expressive than CFG, and maybe Turing-complete? It doesn’t matter, really, because the query is still about:

  • A more expressive grammar
  • A grammar compiler that can produce a Python parser that can directly produce AST on runtimes comparable to the current ones

Thanks! During some spare moments at PyCon US I have started hacking on a simple parser generator for PEG grammars that would be suitable as a replacement for pgen. I now have a much better understanding of packrat parsers.

I am slightly less optimistic about the performance of memoization for Python, because my ideal would be to get rid of the extra “pass 2” syntactic checks, and that removes the LL(1) property from assignments (both ‘=’ and ‘:=’) and keyword arguments. (See my earlier posts about the issues here that require an extra pass due to the LL(1) requirement.)

But only slightly. We might be able to flush the cache as soon as we successfully hit a NEWLINE token.

I argued before that feasibility could mean living with great parts of the existing parsing, not just the tokenizer, but even the CST.

I haven’t talked about mixing the current tokenizer with PEG only because PEG normally does not use a tokenizer, and because I’ve never done such before. I’ll research it.

Good point.

I don’t know.

I’m researching this at the moment.

Second answer.

After a moments thought, I would not bother.

PEG has the power to parse grammars within grammars, which is something that solves things like f-strings as a matter of fact.

The questions of how many parts of CPython depend on the tokenizer, and how much, and if a PEG parser can feed those needs remains, and I think can only be answered by prototyping.

I’ve done this in my toy example and it’s simple enough. I wrote a packrat parser that stores an array of TokenInfo objects and uses the index into this array for memoization. All the other things that make PEG nice still work.