Preparing for new Python parsing

Great! But it’s cheating!

You’re not feeding the output of the tokenizer to the PEG parser, but making the PEG parser produce a TokenInfo stream as output, which should be fine.

Oh, that is an interesting point I hadn’t thought of.

But I am still skeptical. Python requires keeping track of indentation levels so it can generate DEDENT tokens. See this code snippet. (That’s the Python version of the tokenizer, but the C version does the same.)

I must have been unclear. Maybe I shouldn’t be calling what I wrote a PEG parser? I most definitely generate a TokenInfo stream and feed it into the parser, and the parser output is a tree object. (Currently I’ve defined my own Tree class but there’s nothing special about it.)

Not an issue. A Google search provides multiple solution options for several parsing strategies. Python is not that special on that regard respect programming languages.

PEG@MIT just confirmed, within minutes, that tokenizing first or after should not be an issue either.

EDIT: It is likely that PEG@MIT, or at least some members of it, will be interested in the problem of flexible grammars for Python that can lead to very efficient parsers in C-libpython. Guido joining that list for a while? I ask who’s interested?

That may seem so if you know mostly LL(k).

A PEG grammar (or an LL(*) one) is able to parse ASSIGN ASSIGN OR NOT-ASSIGN TO ASSIGN, which is valid Natural AG, as I recall (there are worse examples, that I’d rather not recall).

Are the entire three paragraphs that you quoted nonsense? Who is the one who only knows LL(k)? Can you elaborate?

TBH that sounds like nonsense to me – I literally have no idea what ASSIGN ASSIGN OR NOT-ASSIGN TO ASSIGN is supposed to mean, and I haven’t the slightest idea what Natural AG refers to either.

Let someone from that list pop up here first. I could do without more lists.

Exactly. Yet its an example of a valid program in a mainstream programming language for which there was a parser.

My apologies for my previous message (which I already edited, learning about a limitation of Discourse in the way).

What I tried to mean is that, since the simplicity of the Python language is strongly protected by the committer community, there’s no risk on that regard. And that, with that set, anything writable in Python can be easily parsed by an LL(k>=N), LLR, or PEG style of parser.

I understand. Yet one of the things I love most about CPython is how much of it (the best, stable sorting algorithm, to name just one thing) comes from the best research.

There will be a balance, I think, as there’s been so far.

Got it. But it’s not Python. It seems we’re regularly misunderstanding each other because I am only talking about the problems of Python in particular, while you are looking at it from a much more general perspective.

I still believe that, unless we drastically change Python, we should never need more lookahead than a single “simple statement”, which ends at a semicolon or at a significant newline. But I haven’t proven this yet, because my toy cannot yet generate a parser that can parse all of Python.

I don’t think there’s a misunderstanding. Yes, I think it is good to take a look at the general theory of programming languages before deciding anything. The examples that are not Python are just to try to explain what falls into a language category, or what a parser for one category can parse. I’ve already mentioned that since Python is currently LL(1), it is parseable with anything, so other criteria (like the original grammar expressiveness) can play a larger role than the implementation details.

Yes, it is proven (see here), because there is an LL(1) parser for Python.

Again, the LL(1) category for Python is proven, so you have nothing to prove. You just need to finish the work on your toy parser. Does the difference in semantics make sense? Does it help?

EDIT: Let me try me find out the theoretical bounds for memoization in a PEG parser for a language known to be LL(1). I don’t remember reading a paper about the subject, but it has to be O(N), with N==size of the sought AST (IOW perfect).

I don’t believe that. I know Python is a CFG, but I don’t believe this fragment is LL(1):

start: assignment | expression
assignment: NAME '=' expression
expression: NAME | <other categories>

The problem with this is that there’s a conflict: the FIRST sets of assignment and expression intersect (NAME is in both) and you need backtracking or more than 1 token of context to handle this. See the Wikipedia page.

A language belongs to LL(1) if it can be parsed by an LL(1) parser, period. It makes no sense for us to discuss language theory.

It is because you asserted (and nobody has disagreed) that LL(1) grammars are restrictive in expressive power that the discussions over a quest for other kinds of grammars (theoretical language sets) and parsing strategies started.

That’s an non-issue if you’re willing to consider PEG, which could be larger than CFG, and probably linear in memory and time for a language like Python.

My experience is that LL(small-k) is so restrictive that the only way to know if a grammar complies is to pass it through the LL(small-k) parser generator. LL and LR are outstanding theoretical and practical achievements from the epoch of shared computer time and 256K RAM for a whole computing department. Today is now.

Although I sketched a plan I can work on my own on this discussion thread, I made sure to at least try to reconcile it with what you requested, which I think I’ve several times summarize as something like:

  1. Python parsing is old, and there’s an opportunity to modernize it before 3.9 or 4.0.
  2. Among the reasons to modernize, the expressive power of LL(1) is poor for current Python.
  3. How about going directly to AST considering that CST is only used during parsing?
  4. Do we actually need the tokenizer?

There you go!

Yes to the first three.

I don’t want to do anything about the tokenizer, since it works fine. It is full of arcana, and I worry that replacing it with a different mechanism would break corner cases.

In contrast, the grammar-to-CST mechanism we currently use is straightforward, and I don’t expect nasty corner cases there if we replace it with e.g. a PEG-based parser.

The current CST-to-AST translation code is too ad-hoc and complex, because has to make up for restrictions in pgen, and I want to replace it. If we couldn’t replace it I think the whole project would not be worth it.

Finally, the memory we might save by skipping the CST would give us some space back that we could use for the memoization cache. (The CST is quite large, due to the explicit presence of “trivial” nodes with a single child.)

In my toy project, I want to get to the point where in pure Python I can get from stream of tokens to AST (This is feasible because the ast module lets you create and combine AST nodes.)

We can then double-check that the AST created by the toy for a given program is identical to that created by the current parser, and iterate on areas where it isn’t.

I am looking for other parsing technology than pgen because I want to be able to write the grammar to match how I think about the program, e.g. (that same example again):

start: assignment | expression
assignment: NAME '=' expression
expression: NAME | <other categories>

PEG + packrat looks like a reasonable candidate so I am looking into that in detail.

I too saw many (unwanted?) dependencies on the tokenizer. Not to worry about, as experts said that the tokenizer may stay over any language category or parsing strategy chosen.

I forgot to mention “maintainability” among the summary points, and that was among your original main concerns as I remember now (I think currently nobody dares come near pgen).

Yes.

I think that the “toy-project” will likely become the blueprint. You’re way past the warnings by Dennis Ritchie I quote here. You can do what you want. And it’s a fun project! I’m sorry (though I warned) that I didn’t have the time to do the feasibility testing you are doing during these past weeks.

Yes. Today there doesn’t have to be a performance penalty for a grammar that humans can understand.

In case you eventually decide in favor of PEG (knowing that LL(small-k) will also work), I’ll repeat the warning: It is non-obvious to explain why choices must be ordered in PEG grammars.

And it is non-obvious to explain why an LL(k) parser generator will reject a grammar… So?

So true! I’ve now gotten far enough with my toy project to have experienced this myself. Consider this grammar. (Sorry for using the pgen notation – please read ':' as '<-' and '|' as '/'.)

start: NAME '(' arguments ')'
arguments: kwargs | posargs [',' kwargs]
posargs: posarg (',' posarg)*
kwargs: kwarg (',' kwarg)*
posarg: expr
kwarg: NAME '=' expr
expr: NAME

The intention is that it accepts simple call expressions like these:

foo(a)
foo(a, b)
foo(x=y)
foo(a, x=y)

But not this (since positional arguments must not follow keyword arguments):

foo(x=y, a)

In fact, it accepts the first three but not the fourth one, foo(a, x=y). Why not? For a hint, see this fixed version:

start: NAME '(' arguments ')'
arguments: kwargs | posarg [',' arguments]
kwargs: kwarg (',' kwarg)*
posarg: expr
kwarg: NAME '=' expr
expr: NAME

SPOILER:

The first version of the grammar parses a,x=y as posargs (i.e., a,x) followed by =y. This is because PEG is always eager. How can I make this kind of situation easier to debug?

1 Like

In my experience grammar design is like program design, or as language design. Part of it is logic, and another part intuition (or experience) and aesthetics.

As a rule of thumb, don’t use recursion in PEG rules unless the construct being described is recursive, because the extended EBNF of PEG is powerful enough to not require it.

Another hint is that lookeaheads and negative lookaheads are in PEG for a reason.

This grammar solves that language fragment, because the longest choices come first, and lookaheads avoid the wrong choices (the negative lookahead is not actually needed, but better safe than sorry):

start: NAME '(' arguments ')'
arguments: kwargs | posargs ',' kwargs | posargs | ()
posargs: posarg (',' posarg)* 
kwargs: kwarg (',' kwarg)*
posarg: expr !'='
kwarg: NAME '=' expr
expr: NAME

As to debugging, I chose cut expressions for TatSu. A cut parses nothing, but the parser commits to the current choice after it sees it. While debugging, using cut makes the parser fail early if the rules are not right. Using ~ for `cut, the above could be:

start: NAME '(' arguments ')'
arguments: kwargs | posargs ',' ~ kwargs | posargs | ()
posargs: posarg (',' ~ posarg)* 
kwargs: kwarg (',' ~ kwarg)*
posarg: expr !'='
kwarg: NAME '=' ~ expr
expr: NAME

Ah, very cool. I need to add lookaheads and cut to the toy project!

We’ll, there’s already some parsing of Python using PEG/Packrat with pglc and TatSu. It will likely take a couple more sessions to finish debugging the grammar.

This is output from parsing the Python in CPython (smallest files first):

pglc_cpython

(@guido There may be ideas there that can be used in your parser).

I probably have to put the toy project on ice, too many other things are begging for my attention. :frowning: