Yes to the first three.
I don’t want to do anything about the tokenizer, since it works fine. It is full of arcana, and I worry that replacing it with a different mechanism would break corner cases.
In contrast, the grammar-to-CST mechanism we currently use is straightforward, and I don’t expect nasty corner cases there if we replace it with e.g. a PEG-based parser.
The current CST-to-AST translation code is too ad-hoc and complex, because has to make up for restrictions in pgen, and I want to replace it. If we couldn’t replace it I think the whole project would not be worth it.
Finally, the memory we might save by skipping the CST would give us some space back that we could use for the memoization cache. (The CST is quite large, due to the explicit presence of “trivial” nodes with a single child.)
In my toy project, I want to get to the point where in pure Python I can get from stream of tokens to AST (This is feasible because the ast module lets you create and combine AST nodes.)
We can then double-check that the AST created by the toy for a given program is identical to that created by the current parser, and iterate on areas where it isn’t.
I am looking for other parsing technology than pgen because I want to be able to write the grammar to match how I think about the program, e.g. (that same example again):
start: assignment | expression assignment: NAME '=' expression expression: NAME | <other categories>
PEG + packrat looks like a reasonable candidate so I am looking into that in detail.