Results so far for most of cpython/**/*.py:
--------------------------------------------------------------------------------
184 files input
184 files parsed
126,160 lines input
126,160 lines parsed
86 successes
98 failures
46.7% success rate
0:02:26 elapsed time
0:29:29 runtime
71 LOC/s
(runtime different from elapsed because of the 8 cores involved)
This is good progress for a weekend project.
My apologies, @guido, but it seems obvious that TYPE_COMMENT was hacked into the grammar:
typedargslist
=
'**' tfpdef [','] [TYPE_COMMENT]
|
'*' [tfpdef] {',' [TYPE_COMMENT] tfpdef ['=' test]}*
(
TYPE_COMMENT
|
[',' [TYPE_COMMENT] ['**' tfpdef [','] [TYPE_COMMENT]]]
)
|
tfpdef ['=' test] {',' [TYPE_COMMENT] tfpdef ['=' test]}*
(
TYPE_COMMENT
|
[
',' [TYPE_COMMENT]
[
| '**' tfpdef [','] [TYPE_COMMENT]
| '*' [tfpdef] {',' [TYPE_COMMENT] tfpdef ['=' test]}*
(
TYPE_COMMENT
| [',' [TYPE_COMMENT] ['**' tfpdef [','] [TYPE_COMMENT]]]
)
]
]
)
;
I cannot simplify the grammar at this point, or I’d lose the connection to Grammar/Grammar. The only changes allowed are ordering of choices and lexical tweaks. There can be experiments with rewriting of some rules later.
The expectation over coverage of 46.7% is that progress will be exponential, as a single fix in the grammar makes N more files parseable. A 100% coverage is a matter of more hours of work on PEGifiying the LL(1) original.
I think that I’ll integrate the Python tokenizer before going for 100% coverage, as 71 lines/sec is too slow, and I’m certain that most of the time is being spent keeping track of Python’s lexical idiosyncrasies (newlines within expressions allowed in some and not others, multi-strings, newline escaping, INDENT/DEDENT, etc.). In fact, most of the time I’ve spent so far on the PEG parser has been figuring out the lexical.
After 100% coverage, and the tokenizer, the target will be AST!
AFTERTHOUGH: Mmm. Maybe not hacked, but trying everything to get past the limitations of LL(1)? (some grammar writers call that “hell”).