Preparing for new Python parsing

guido · May 31, 2019, 4:13pm

A few messages earlier, I stated that tokenize.py can tokenize 6000 lines/sec. Your numbers seemed to state that your parser processed 60 lines/sec. My conclusion is that tokenizer.py would only take 1% of the time. (Update: Maybe you misread this as an assessment of the native tokenizer? That’s much faster of course. Nevertheless I believe tokenize.py is not anywhere near being the long pole here.)

Please do! I’m not at all convinced that the toy project is useful for anything besides teaching myself about PEG. I’d like to turn it into a series of blog posts, but I’ve got limited time. Also I’m unclear on what kind of debugging features to add; I’ve got some tracing prints, but they are not very helpful to debug deficiencies in the grammar (like the problem I posted a few weeks ago).

apalala · May 31, 2019, 5:25pm

Traces from PEG can be analyzed for debugging.

At least for me, non-k conformance from an LL(k) parser is very difficult to deal with.

apalala · May 31, 2019, 5:26pm

I missed that. The 1% estimate makes a lot of sense.

guido · May 31, 2019, 5:28pm

I could use some suggestions on how to generate such traces, and what the most useful information is to put in traces. (If you’re referring to a specific behavior of some specific software, a link to an example in some docs or even source code would be appreciated.)

Also, I’m going to be off-line most of the weekend through Monday. TTYL!

apalala · May 31, 2019, 5:36pm

Traces from PEG, everything being top down, are just if a Python program had been intervened to log entry, result, and exit of each function call.

A sample of a trace from TatSu:

guido · May 31, 2019, 5:37pm

Aha, I’ve seen that. I think we can even improve on that.

apalala · May 31, 2019, 5:38pm

Opinions are welcome!

apalala · May 31, 2019, 5:48pm

Because there’s no hurry, it may be worth to go forward and evaluate the performance after the debugged PEG grammar is translated to the one a C-parser generator (like peg) accepts.

An article titled "Why Python is Popular Despite Being (Super) Slow" just hit my inbox. I haven’t read it, but you can see my point.

apalala · June 1, 2019, 2:14pm

A good prior step is to test with pypy. The failues are probably due to pypy being at 3.6 compatibility. The improvement in LOC/sec is only 50%, which probably means that the grammar needs optimization:

       5,480   files input
       5,480   files parsed
   1,954,995   lines input
   1,954,995   lines parsed
       5,451   successes
          29   failures
       99.5%   success rate
     0:57:41   elapsed time
     5:46:03   runtime
          94   lines/sec
         341   mib max memory

apalala · June 1, 2019, 9:13pm

The project is generating PEG grammars for both PackCC and peg:

https://github.com/apalala/pglc/tree/master/gensrc

The Makefile generates the C programs, and they compile.

Going forward, I decided to use peg as the C parser generator because the tool is debugged and maintained, and its author is available and welcoming feedback.

The next step should be to incorporate the Python tokenizer (I need a lookback data structure).

guido · June 3, 2019, 12:05am

OK, sound good. One concern I have with using peg is that it doesn’t seem to support left-recursion in the grammar directly. I read in TatSu’s manual that it supports it, and I added to the toy project. I’m now quite enamored of this approach; see e.g. the example below, which reads naturally and generates correct nodes for left-binding binary operators. (Though I could change my opinion if it turns out to be too slow. )

I found some time to work on the toy project: I added support for semantic actions, with a twist – the semantic action must be an expression representing the (memoizable!) return value of the rule. Example:

start: expr NEWLINE? ENDMARKER  { ast.Expression(expr, lineno=1, col_offset=0) }
expr: ( expr '+' term           { ast.BinOp(expr, ast.Add(), term) }
      | expr '-' term           { ast.BinOp(expr, ast.Sub(), term) }
      | term                    { term }
      )
term: ( term '*' factor         { ast.BinOp(term, ast.Mult(), factor) }
      | term '/' factor         { ast.BinOp(term, ast.Div(), factor) }
      | factor                  { factor }
      )
factor: ('(' expr ')'           { expr }
        | atom                  { atom }
        )
atom: ( NAME                    { ast.Name(id=name.value, ctx=ast.Load()) }
      | NUMBER                  { ast.Constant(value=ast.literal_eval(number.value)) }
      )

This particular grammar parses some simple Python expressions directly into AST nodes which can be compiled into executable code. The main problem right now is that the line numbers and column offsets are lost.

I’m not at all sure that I want to follow this approach for the full grammar, but I’m pretty sure that it could be made to work. Of course the C actions would need to be entirely different so it’s not clear how helpful this actually is. Then again I believe you’re not generating AST nodes in your C code at all, right?

apalala · June 3, 2019, 3:19am

I didn’t prioritize left recursion because the current Python grammar doesn’t have any (I imagine that later stages reformat the AST to the right associativity). PackCC docs say that it does support left recursion.

The time I spent on this project recently was refactoring TatSu to simplify its parsing algorithm to make it something one can think about implementing in C. I also spent time thinking how to approach C, as testing the speed of the deep recursion seems like the best next step:

peg (still supported, but less features, and less flexible)
PackCC (unsupported, a wildcard)
Roll my own (more on this below)

guido:

Example:

start: expr NEWLINE? ENDMARKER  { ast.Expression(expr, lineno=1, col_offset=0) }
expr: ( expr '+' term           { ast.BinOp(expr, ast.Add(), term) }
      | expr '-' term           { ast.BinOp(expr, ast.Sub(), term) }
      | term                    { term }
      )

That’s very nice! And yes, using recursion instead of iteration ( expr: term (('+|'-') term)* ) is very helpful when building AST directly.

Perhaps the grammar language will need a way to name or index the right-hand-side elements to resolve rules like:

comparison: (expr comp_op expr)
arguments: (arg (',' arg)*)

The approach does work. It is the approach I’ve used in the past, and that intended to use in this experiment.

Why? You’d only need to make sure there’s cleanup of the the memoization cache at the right times. The resulting AST is the desired result of the parse, so its ownership has to pass downstream.

I’m not currently generating any C, but the intention has always been to generate Python AST directly from the rules. That’s why some refactoring of the Python grammar will be necessary, as it currently doesn’t support the approach in your example in all places (like typedarglist).

That said, the approach your project uses to generate parsers is nice and simple, and exactly the one I planned to use if I rolled my own PEG->C parser generator. In that part, you’re ahead of me, so I may end up borrowing some of your code.

In that sense, it would be necessary to reimplement some features, like memoization through decorators, in a way that’s approachable from C. One way to do that is to invoke the implementation of rules through an intermediary like def call(self, rule):, which is what is done in TatSu, and something that has an easy equivalent in C.

Again, I’m certain that direct generation of AST is possible at a cost less than the current one because there will be no CST parsing. At the moment I want to measure the speed of PEG for Python in C with just the CST.

If I roll my own C, I’ll write the data structures from scratch (basically a deque that supports some stacks and the token lookback cache) in the know that they can be replaced by libpython equivalents in time.

Finally, I found consensus in that there are optimizations to make in PEG for a language known to be LL(k), like placing the k-lookaheads as PEG lookaheads in front of choices. That means there’s room for optimization if the initial tests look good.

guido · June 3, 2019, 9:37pm

The Grammar file doesn’t have left-recursion because Python’s pgen doesn’t support it. It would be much more natural to write some rules, like various parts of expressions, with left-recursion – for those the AST does have left-recursion.

Yes, I intend to implement this. But first I need to decide how to address looping elements like (',' arg) in the above.

Yes, but it will be a lot of work to write all the rules and actions.

Because there’s no reasonable way to make the actions from my example work as C syntax, e.g. { ast.BinOp(expr, ast.Add(), term) }.

I still want proof.

This occurred to me too. There are parts of the grammar where a large number of alternatives have to be tried and rejected (e.g. statements – each statement type except assignment/expression starts with a unique keyword, like if, del, class etc.). It should be easy enough to combine the pgen approach (which requires this) with the PEG approach.

apalala · June 4, 2019, 11:26am

It should be OK to add left recursion, as the grammar would anyways need refactoring to support actions that directly create AST nodes.

Yet from, experience, the way to do this is to first get a debugged grammar as close to the current one as possible working, and then start the refactoring, one rule at a time (otherwise one goes from 100% parse coverage to 60%, without a clue about what exactly broke). The only problem is that associativity will be wrong in places until the work is finished.

The work need only be done once, and the resulting grammar would become a great candidate for “Official Python Grammar” (which is one of the original goals). In fact, even if the final solution is LL(small-k), having an PEG/leftrec grammar for exactly the same language would help plenty (or the PEG may drive the LL?).

Because the AST contains per-operator node types, the grammar will need to have a rule per operator, and { return ast.Add({left}, {right}); } or whatever.

PEG, with its lookaheads, allows for lots of “cheating” that can’t be done with LL. For example, one can define fake rules or sub-rules to be used in strategic places to optimize the top-down descent (I think I read that PEG may be Turing-complete). For example, a keyword rule may be used to guard compound_stmt, so the parser doesn’t need to check for every keyword when none apply:

compound_stmt = &keyword ~ (if_stmt|for_stmt|...) ;

@name
name = NAME ; # this is checked to _not_ be a keyword

keyword = !name NAME ; # or something like this

So do I!

apalala · June 6, 2019, 3:37am

A rename of the repo for my experiments, just on the grounds that I always got the previous, cryptic name… wrong!

apalala · June 14, 2019, 1:09am

I didn’t know where else to write, so, @guido, I’ve left traces for you about some changes in TatSu, that I think is what you need to move pegen forward regarding left recursion.

I thought this was a good place to leave a trace, because it’s something you’ll read when you have the time.

guido · June 14, 2019, 1:45am

Thanks. I’m not sure what you mean by “traces” or where to look. But I have been struggling with left-recursion in pegen too (it’s kept me up at night :-). I have to unsolved problems: (a) how to detect indirect left recursion; (b) how to implement my left-recursion strategy for indirect left recursion.

Right now pegen can only handle direct and “hidden” left recursion. (The latter being when there are some optional clauses before a self-reference in an alternative, e.g. expr: [sign] expr or even expr: sign expr where sign: ['-'], whereas indirect left recursion would be A: B ... and B: A....)

I believe that for direct and hidden left recursion my code works though. I did copy my “is nullable” algorithm from TatSu though, so maybe it needs some kind of update?

PS. If you want to communicate with me without bugging this list, just write to guido@python.org (it’s no secret :-).

apalala · June 14, 2019, 2:19am

It may be that you expected it to be simpler than LL(k) analysis? The PEG nullability/leftrec test is also a recursive definition, so it requires an algorithm that either recurses or iterates until there is convergence (until the result of the current iteration is the same as the last).

The code for nullabilty/leftrec in TatSu is complex. The insights and algorithms are from people in academia. It’s up to us, hackers, to hack it

The traces I left are because I think that nullabilty/leftrec can be checked over a simple change made to LL(k) FIRST/FOLLOW/LOOKAHEAD computations. If a ref(rulename) token is added to the sets when a rule is mentioned on a right-hand-side, then the checks become:

def is_leftrec(rule):
   return ref(rule.name) in rule.ll_first()

The check for is_nullable() is not required after I fixed the bug in the LL(k) concatenation (def dot()) in TatSu, but it’s equivalent to:

def is_nullable(exp):
   return () in exp.ll_first()

Once more, the pull request (that I’ll merge after some more unit tests), is this. Take a look at how firstset() is calculated.

CAVEAT: I still have no feedback from the academics on this approach.

Thank you!

P.S. I totally understand being kept up at night.

guido · July 12, 2019, 3:55am

Exciting news: I have been working on generating C code, and a preliminary test shows that it is in the right ballpark. See the README. The C generator doesn’t yet support the same feature set as the Python generator (missing are left recursion, optional, loops, lookahead, cut), but I’m excited. CC: @apalala, @pablogsal .

apalala · July 12, 2019, 3:17pm

Outstanding!

(My apologies for my AWOL of the past few weeks…)