Preparing for new Python parsing

It seems the confusion comes from each one of you talking about a different part of the grammar.
When defining a function, you can only have one **. But when calling a function, you can pass more:

f(*[1], *[2], **{'c': 3}, **{'d': 4})


That’s the tutorial for defining functions. Tutorials leave things out on purpose. You’ll want the reference docs for the whole picture.

1 Like

It’s clear now. I’ll check the grammar.

That seems parseable by PEG! It doesn’t contain the TYPE_COMMENT part, but I’ll try to make the grammar match the reference as much as possible.

Actually, do we need to keep TYPE_COMMENT for Python >3.8? (@pablogsal, @encukou, @guido)

I’m confused here. In which case is the tokenizer not there to help?
The tokenizer emits NEWLINE tokens, except within matching brackets. In your second example, the extra NEWLINE token makes it invalid.

TYPE_COMMENT is new in 3.8. I guess the docs haven’t caught up yet?

Whoops, I lost the Discourse tab and didn’t watch while this topic grew from 42 to 64 posts. @apalala, if you still have questions about the grammar or the tokenizer, please add me (@guido) explicitly to your post so I’ll see it.

The only thing I did over the weekend was do some comparative timings. I found that my toy project is 25 times as slow as the native parser; and apparently TatSu is 30 times slower than that.

I shared my toy project (“pegen”, pronounced “pagan”) with @apalala, @pablogsal and @emily; please don’t share further, I am not ready for public scrutiny. If you have specific questions about this, the best way is to use the GitHub issue tracker for the pegen project.

1 Like

Yes, we need to keep it. Note that it is a special token that is only generated by the tokenizer if a certain flag is specified. This is documented in the ast docs. (This should perhaps also be added to the docs for built-in compile()?)

1 Like


I integrated tokenize into the parser. I still nee to check if TYPE_COMMENT is honored. I don’t see how to enable it in tokenize.tokenize().

I’m somewhat surprized that there wasn’t a huge improvement in speed with the Python tokenizer. I guess I better optimize the grammar before going for AST.

I did notice increased memory use, so I wrote this to hold the look-back for the PEG parsers:

class Tail(MutableSequence):
    def __init__(self, maxlen):
        self._tail = deque(maxlen=maxlen)
        self._start = 0

    def _tailfull(self):
        return len(self._tail) == self._tail.maxlen

    def start(self):
        return self._start

    def insert(self, i, x):
        if self._tailfull():
            self._start += 1
        self._tail.insert(i - self._start, x)

    def __getitem__(self, i):
        return self._tail[i - self._start]

    def __setitem__(self, i, x):
        self._tail[i - self._start] = x

    def __delitem__(self, i):
        del self._tail[i - self._start]

    def __len__(self):
        return len(self._tail) + self._start

Oh, tokenize.pyisn’t what the native parser uses, it’s a reimplementation of the native lexer in pure Python using gnarly regexps (the native lexer doesn’t have an API that’s easily exported to Python). To look for type comments you’d have to intercept tokens of type COMMENT and check if it starts with something like #\s*type:. (And if the first word of what follows is ignore then it’s a type-ignore, which is not a real type comment and can occur anywhere.)

Cool that you’ve got this working!

Probably because it’s not the real lexer. :slight_smile:

Yeah, the list of tokens contains all of the input, one line at a time, plus most of the input again, one token at a time, plus several tuples and integers per token (line and col start and end). An implementation in C could be much more memory-efficient.

A safer way to limit the look-back might be to toss the look-back after a NEWLINE token has been accepted. Python’s grammar is designed so that this is safe. (Note that “insignificant” newlines return a NL token, which should always be ignored.)

Is there currently no way to use the C tokenizer from Python?

The reason it matters is that the test cases should reach millions of LOC before moving forward, so any speedup would be welcome (or I can always offload the runs to something that withstands the heat).

Really?! That’s a beautiful design!

and it can be used in PEG by placing a cut on or after each NEWLINE! (got to try this right now)

No. (I suppose you could hack it with ctypes but it’d be fragile and I can’t help you there.)

Why do you need to test that many LoC regularly?

Or maybe with the help of Cython, or some of the other C->Python tools? I’ll definitely take a look into this.

It’s from my experience building parsers for commercial use. The parser must parse all valid Python, and the only way to have 99.99% confidence in that it does that is to regularly parse as much Python as one can get hold of.

A small glitch in the grammar may render, some % of the valid Python in the wild unparseable, and any % is unnaceptable.

Thus, it has to be millions of lines, but the regularity can be staged:

  • Always test problematic test cases
  • Test ~/cpython/**/*.py at least once per session (under 6 min on 8 cores)
  • Test the source of most major libraries regularly, and always before declaring a milestone

Because I offload large test-sets to a well-ventilated but cheap (except for the case) floor computer, they can run over each merge to master or so, continuous-integration style (until someone finances doing the likes on CircleCI).

Another explanation is “Hyrum’s Law” (of which I Iearned about in this forum):

As much as one would like the Python language to be defined by the reference documentation, the reality is that it is defined by its current implementation, and that one will likely be surprised at how many important modules in the wild depend on what one would have expected to be invalid syntax.

Dumping the token lookback cache on NEWLINE did not work, I don’t know why. I have to think about it (hint: if a module contains, for example, just a very long class definition, the parser may want to see the initial tokens when reporting weird syntax near the end, as the rule for classdef must fail).

Currently, a cache of 8K failed two cases in ~/cpython/**/*.py, so I set it to 32K until there’s analysis to back something else.

Hm, I now recall that in my toy project I cleared the memoization caches, not the token cache. But the code that I added (in a comment) doesn’t make sense, since it doesn’t actually check for NEWLINE. I do recall it worked, but I haven’t set up the infrastructure to test it over millions of lines, so maybe I was just lucky.

Yeah (I’ve yet to look into your project; on my short-list). The memoization cache can be partially cleared after each cut, and a cut after each NEWLINE will have the NEWLINE effect.

I changed the rules for typedarglist and vargarlist to match the reference documentation, and it worked, which is interesting in the sense that the reference is mostly PEG, and much clearer/intuitive grammar.

I have the 1 MLOC parsed I wanted, and that the failed cases are mostly Python 2.7, or test cases meant to fail a parse:

64000 recursionlimit
tatsu/parsers/codragon/cobolresq/parser/ IndexError:                                  
data/cpython/Lib/lib2to3/tests/data/                   FailedParse:                                      
data/twisted/docs/historic/2003/pycon/deferex/ FailedParse:                                 
data/twisted/docs/historic/2003/pycon/deferex/ FailedParse:                             
data/twisted/docs/historic/2003/pycon/deferex/   FailedParse:                      
tatsu/parsers/naturalresq/naturalresq/oldguts/models/java/ FailedParse:                              
data/cpython/Lib/test/            FailedParse:                                      
data/cpython/Lib/lib2to3/tests/data/    FailedParse:                                      
data/cpython/Lib/lib2to3/tests/data/        FailedParse:                                      
data/twisted/docs/historic/2003/pycon/pb/       FailedParse:                                      
data/numpy/numpy/core/tests/               IndexError:                                       
data/twisted/docs/historic/2003/pycon/deferex/ FailedParse:              
data/twisted/docs/historic/2003/pycon/pb/       FailedParse:                                      
data/cpython/Lib/lib2to3/tests/data/      FailedParse:                       
data/twisted/docs/historic/2003/pycon/deferex/ FailedParse:            py                   
data/twisted/docs/historic/2003/pycon/deferex/ FailedParse:                               
tatsu/parsers/cobolryuu/cobolryuu/parser/     IndexError:                                       
data/twisted/docs/historic/2003/pycon/deferex/ FailedParse:       py                   
data/cpython/Lib/lib2to3/tests/data/                  FailedParse:                                      
tatsu/parsers/javaryuu/testdata/javasrc/src/main/java/beans/ FailedParse: y                    
data/twisted/docs/historic/2003/pycon/deferex/ FailedParse:                           

       5,480   files input
       5,480   files parsed
   1,954,995   lines input
   1,954,995   lines parsed
       5,459   successes
          21   failures
       99.6%   success rate
     2:25:52   elapsed time
     8:58:25   runtime
          60   lines/sec
         127   mib max memory

The libraries I chose for that file-based test were:

  • cpython
  • ipython
  • matplotlib
  • numpy
  • scrapy
  • tatsu
  • twisted

It’s way out of my current knowledge to bring the C tokenizer to Python, so I’m writing a version of tokenize which doesn’t depend on char-by-char, as that’s too inefficient in Python (I’m certain I had inconsistent, but much better performance from the PEG tokenization).

Cool results!

Yeah, that’s why I want to go to a different parsing technology – to make the reference grammar and the implementation’s grammer much closer.

Not sure what you mean. I’m sure the stdlib’s isn’t the fastest possible without resorting to C, but it’s very true to the native tokenizer, and it’s pretty fast. I have a test specifically for this in my toy project, it clocked at over 6000 lines/sec – that’s about 1% of your parser’s time, which you report at 60 lines/sec.

FWIW I just added left-recursion to the toy project – the link to Pegged that you gave in your docs was very helpful.

Please recruit someone knowledgeable to make the C-libpython tokenizer available from Python?

I’m certain that the effort will more than pay off in many ways in the near future.


That’s by Nicolas Laurent et al, yes. The guys that meet at PEG/MIT are the best of the best.

My expectation is to make enough progress on the toy projects to make them interesting (not-toy) enough for them to become actively involved. So far they’ve helped plenty with the theoretical aspects, specially on boundaries and expectations.

PEG is theoretically O(g*n) on the worst case (invalid input on a long module, probably), and that’s not good enough for Python. It’s much better with memoization/selective-memoization, and cut. It has to be profiled, probably over a C implementation.

This paper proves that there are trivial transformations from LL(k) grammars to PEG. The problem with that is that setting a theoretical boundary gives no clues about the engineering/practical boundaries.

I’m on the look for a paper that proves that some PEG can be automatically transformed into LL(small-k), as that might solve all the requirements.

We’re all volunteers here. Nobody gets paid to work on this.

Are you disagreeing with my assessment of the speed of compared to the speed of your parser?

Regarding the involvement of academics, they will have to volunteer as well. There is a certain type of academic paper about language or compiler technology that I am unable to read due to its use of formalism I am not familiar with. (Fortunately I still extracted enough value from the Packrat paper.)

Since you seem to have contacts there you are in the best position to reach out.

That’s so obvious it didn’t need stating. My appeal was to your leverage as leader.

My apologies. Which assessment?

At this point in my project I’m inclined to bring in a profiler measure if PEG can cut deep-expression Python, or if the move needs to be towards LL(small-k). Is the time being spent recursing and backtracking?

I’ll keep the TatSu-based Python project ongoing because it’s always been pet-peeve of mine, but it would be off this forum if it’s not something helpful for CPython parsing.

I would not call them “contacts”, but yeah. I agree.