Preparing for new Python parsing

Aw! That’s a shame… :frowning:

I can work on my own toy project only during parts of some weekends. But that should be fine, as there’s no deadline in sight, and any progress should be good, even if the only usable product is information to feed into the decision-making for Python parsing :slight_smile:

Question about this (it’s keeping me up :-). Does failure to match what follows ~ cause the entire parse to fail (a panic if you will), or does it just skip further alternatives of the current rule, while still allowing further alternatives of “callers” to be probed?

A cut acts locally. After it is seen, independently of if what follows succeeds or fails, the parser will skip further alternatives of the current choice, which could be the choice for the whole rule, or a choice in a parenthesis group.

work = preamble (paragraph)* ('->' ~ epilogue | end) | paragraph 

After the cut following -> is seen, epilogue must succeed for the choice to succeed, and end will never be tried, but the rule may succeed with the second main choice if paragraph succeeds.

Results so far for most of cpython/**/*.py:

--------------------------------------------------------------------------------
         184   files input
         184   files parsed
     126,160   lines input
     126,160   lines parsed
          86   successes
          98   failures
       46.7%   success rate
     0:02:26   elapsed time
     0:29:29   runtime
          71   LOC/s

(runtime different from elapsed because of the 8 cores involved)

This is good progress for a weekend project.

My apologies, @guido, but it seems obvious that TYPE_COMMENT was hacked into the grammar:

typedargslist
    =
        '**' tfpdef [','] [TYPE_COMMENT]
    |
        '*' [tfpdef] {',' [TYPE_COMMENT] tfpdef ['=' test]}*
        (
                TYPE_COMMENT
            |
                [',' [TYPE_COMMENT] ['**' tfpdef [','] [TYPE_COMMENT]]]
        )
    |
        tfpdef ['=' test] {',' [TYPE_COMMENT] tfpdef ['=' test]}*
        (
                TYPE_COMMENT
            |
                [
                    ',' [TYPE_COMMENT]
                    [
                        |  '**' tfpdef [','] [TYPE_COMMENT]
                        |  '*' [tfpdef] {',' [TYPE_COMMENT] tfpdef ['=' test]}*
                            (
                                  TYPE_COMMENT
                                | [',' [TYPE_COMMENT] ['**' tfpdef [','] [TYPE_COMMENT]]]
                            )
                    ]
                ]
        )
    ;

I cannot simplify the grammar at this point, or I’d lose the connection to Grammar/Grammar. The only changes allowed are ordering of choices and lexical tweaks. There can be experiments with rewriting of some rules later.

The expectation over coverage of 46.7% is that progress will be exponential, as a single fix in the grammar makes N more files parseable. A 100% coverage is a matter of more hours of work on PEGifiying the LL(1) original.

I think that I’ll integrate the Python tokenizer before going for 100% coverage, as 71 lines/sec is too slow, and I’m certain that most of the time is being spent keeping track of Python’s lexical idiosyncrasies (newlines within expressions allowed in some and not others, multi-strings, newline escaping, INDENT/DEDENT, etc.). In fact, most of the time I’ve spent so far on the PEG parser has been figuring out the lexical.

After 100% coverage, and the tokenizer, the target will be AST!

AFTERTHOUGH: Mmm. Maybe not hacked, but trying everything to get past the limitations of LL(1)? (some grammar writers call that “hell”).

On cut, the “Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space” paper by Kota Mizushima et al.

I haven’t tried implementing it.

The “calling” rule may also be affected by the cut if there are no choices in the callee:

work = preamble (paragraph)* (long_ending | end) | paragraph 
long_ending = '->' ~ epilogue

I ended up rewriting the rule for typedarglist. I just couldn’t understand the original. There’s now 88% parse coverage of cpython/**/*.py.

typedargslist
    =
    (
    | [typedargs] (vararg|typedarg) [',' _ [typedargs] (kwvararg|typedarg)]
    | [typedargs] (kwvararg|typedarg)
    | (kwvararg | vararg) [[','] TYPE_COMMENT]
    | typedargs
    )
    ;


typedarg
    =
    tfpdef ['='  _ test] &[[','] TYPE_COMMENT]
    ;


typedargs
    =
    {typedarg ',' [TYPE_COMMENT] _ }+
    ;

vararg
    =
    '*' !'*' ~ [tfpdef ~ &[[','] TYPE_COMMENT] ]
    ;


kwvararg
    =
    '**' ~ tfpdef ~ &[[','] TYPE_COMMENT]
    ;

It seems that this new rule does not include PEP 570. After the implementation I added some comments to explain where the full-expanded rule comes from:

https://github.com/python/cpython/blob/master/Grammar/Grammar#L26

That is correct. It seems I didn’t bootstrap from the latest Grammar/Grammar.

There seem to be no instances of that syntax in the Python source code. I’ll create a unit test and fix the grammar.

They are several cases in the latest master.

I’ll check.

I also need to include the files under cpython/test/data which I excluded initially because some of them are intentional bad Python.

I added PEP570 to the grammar, and refactored varargslist (to the style of typedarglist). There’s now 100% parse coverage for the 184 chosen Python files (126 KLOC).

I’ll work on the tokenization before adding any more test cases because the current parsing speed is not good enough nor fun.

Also, although you probably already know, take into account that the actual grammar is more restricted that the Grammar/Grammar file as there is a second pass that further restricts what is allowed on Grammar/Grammar. I think it should be great if we could collect all of these secondary restrictions and push them into the PEG specification.

This is specially tricky as there is no way of proving that the new parser parses exactly the same programs as the old one (is equivalent to the halting problem if I am not mistaken) so for doing that we need to rely on the test suite (as in we need to execute it).

Yes. I’ve seen that the grammar is too permissive in places (probably because of the intention to keep it simple).

As to secondary restrictions, only the ones that can be expressed syntactically can make it into the grammar; the rest must go into semantic checks.

I’ll first work on tokenization, for speed (testing takes too long), and because it’s a required component in the experiment/plan.

After the tokenization I’ll work on generating AST. Since there must be modifications to the grammar so each AST node type is paired with at least one grammar rule, there will be an opportunity to restrict the accepted syntax to something more close the actual Python language definition.

Because there will not be a CST, I think it will be inevitable to have to migrate some of the semantics checks into either the AST or the AST maker.

As a reminder, the grammar and code for this PGLC (PEG spec for the Python language) project are here:

The grammar for typedarglist after adding support for PEP 570:

typedargslist
    =
    [typedposonlyargs ',']
    (
        | [typedargs] (typedvararg | typedarg) [',' _ [typedargs] (typedkwvararg | typedarg)]
        | [typedargs] (typedkwvararg | typedarg)
        | typedkwvararg
        | typedvararg [[','] TYPE_COMMENT]
        | typedargs
    )
    |
        typedposonlyargs [[','] TYPE_COMMENT]
    ;


typedarg
    =
    tfpdef ['='  _ test] &[[','] TYPE_COMMENT]
    ;


typedargs
    =
    {typedarg ',' [TYPE_COMMENT] _ }+
    ;

typedposonlyargs
    =
    [typedargs ','] '/'
    ;

Mmm. The current PEG grammar assumes single *args and **kwargs, yet it passed all the ~/cpython/**/*.py I gave it.

I need to review this: Ensure multiple `*args` and `**kwargs` are allowed in `typedargslist` · Issue #3 · neogeny/pygl · GitHub

In the Python 3.8 docs for defining functions it says.

When a final formal parameter of the form **name is present,

Emphasis mine. I’m confused, @guido, @pablogsal.

I think this is referring to the fact that:

>>> def f(a,b,*args,c,**kwargs):
...   pass

is legal in Python3 but not in Python2. But 'args and ‘**kwargs’ can only appear once in the declaration and ‘**kwargs’ must be at the end.

That makes sense. But I don’t think it’s multiple *args, but rather than non-keyword args may appear after? How to deal with this otherwise?

def f(a,b,*,c,**kwargs):
    pass

Something that’s been interesting is finding out where Python will allow newlines when the tokenizer is not there to help. My conclusion is that it is all logical, and at the same time arbitrary.

This is valid:

def f() -> int: 
    return 0

This is not:

def f() 
-> int: 
    return 0

I’m almost certain that there is syntax that is admitted in file-mode, but not in single line mode (I regret I didn’t take note of the weird cases).

In the PEG grammar, the _ rule marks parts where newlines must be permitted, and EQDENT marks where the indent must be the same.