Preparing for new Python parsing

guido · July 26, 2019, 4:35am

I’ve started blogging on Medium about this. It’ll become a series. Here’s the first installment:

guido · July 28, 2019, 3:17pm

And part two is published. This one is about the structure of the generated parsers, shown using a hand-written parser.

guido · August 5, 2019, 4:45am

Part three is about generating the parser from the grammar, and packrat parsing through memoization.

guido · August 12, 2019, 1:23am

Part four develops a visualization of the parsing process.

arithy · August 12, 2019, 7:49pm

Hello, I’m here.
I haven’t catch up with this topic yet since I discovered it a few hours ago.
Anyway, I would be very glad if you let me know any issues on PackCC via GitHub or SourceForge.
I’m now working on PackCC improvement again.

guido · August 13, 2019, 3:56pm

Hi Arihiro-san,

Thanks for popping up here! I haven’t really tried PackCC, but I skimmed the docs it it looks pretty impressive. I have one suggestion: please get off SourceForge and do all new development on GitHub.

–Guido

arithy · August 15, 2019, 6:57am

Thank you for your advice.
I’ll migrate to GitHub with all luggage.

guido · August 26, 2019, 4:43pm

Part 5 shows what to do about left-recursive grammars (and tries to argue they are useful):

mikesamuel · August 26, 2019, 10:03pm

As mentioned in the medium comment thread, it’s possible to use a single pass operator-precedence parser to produce a parse tree for complex languages, and this approach meshes well with tooling like IDEs that need to glean some useful information from partial or broken inputs.

The rough draft derives such a parser for C-like languages, shows that it extends to lexically and grammatically complex languages like TypeScript, and shows how it localizes damage due to syntax errors.

guido · August 27, 2019, 1:56am

Interesting. It’s a lot to take in, and you don’t mention Python, though if you add a lexer that turns indents and dedents into ‘{’ and ‘}’, and newlines into ‘;’ it’s just another C-like language :-). The good news is that Python was always defined with a simplist,ic lexer+parser in mind so there are no ambiguities like you mention in the article.

I’ve definitely thought about using operator precedence to parse Python’s expressions, and I may try this if I run into problems with stack depth in the recursive-descent parser or perhaps it can save me memory, but I’ve always seen it as a way to deal with a boring corner of the grammar rather than as a way to implement error recovery – so thanks for bringing up that idea.

That said I have no immediate plans to work on error recovery – for 29 years, Python’s parser has stopped at the first error and people have put up with it (even for tools like linters). Of course what IDEs do is up to them, I’m sure that PyCharm and the like have their own error-tolerant parser – in fact I would assume that IntelliJ has its own Java machinery for error-tolerant parsing arbitrary languages.

mikesamuel · August 27, 2019, 8:05pm

If you add a lexer that turns indents and dedents into ‘{’ and ‘}’, and newlines into ‘;’ it’s just another C-like language :-).

Yep. It might be easier to change the lexer to yield ‘\n’ tokens for logical line breaks which would let you define a postfix ‘\n’ operator, and define indent/dedent as a bracket pair.

The good news is that Python was always defined with a simplist,ic lexer+parser in mind so there are no ambiguities like you mention in the article.

Yep, in which case canNest can simplify to just use a lookup table. For 30 operators, you’d need < 1kB for the table.

Also, thanks for keeping it simple from someone who’s done static analyzers.

I have no immediate plans to work on error recovery

Fair enough. Besides simplifying toolchain integration, one of the benefits of this approach is that the parse tree → AST pass can be grammar based but without any need for LR handling, but it sounds like you’ve got LR solved.

I’m sure that PyCharm and the like have their own error-tolerant parser

Error tolerant parsing can help outside the IDE.
If python -i or some other REPL can’t parse some chunk of input, it could highlight problematic code and use readline tricks to re-present the last input but with the cursor at a likely error site.

IntelliJ has its own Java machinery for error-tolerant parsing

Quite right. Python&Java are important enough to warrant maintaining multiple toolchains, but I hope to improve tools integration for niche languages.

guido · August 28, 2019, 1:05am

I wonder if it would be useful to create an implementation in Python of your ideas? It would run slower, but it would be much easier for people to play around with the idea and adapt it to their own tools. (I have to admit that when I saw that your code is C++ I shrunk back from trying to understand it.)

guido · August 29, 2019, 4:41am

@pablogsal, @apalala:
A former colleague warned me about a downside of packrat parsing: for large ambiguous inputs the backtracking may cost a lot of time and memory, and may even be exponential. A worst-case scenario for Python’s syntax would be an expression statement starting with a very long list or dict literal:

>>> [a, b, c, d, e, ......, zzzz]  # Lots and lots of variable names
>>> [a, b, c, d, e, ......, zzzz] = range(N)

One of these could conceivably be the output of a code generator. The parser has to parse up to the = before it can tell what it is seeing. If there’s no =, it has to reset and re-parse using a different rule (and “list target” isn’t going to be the same syntax as “list expression”).

I’m sure there’s a solution.

mikesamuel · August 29, 2019, 8:31pm

I wonder if it would be useful to create an implementation in Python of your ideas?

pyparsepy/ now contains a very rough implementation in Python with Python style operators.

Notable differences from the one for C-like languages:

The operator precedence function has special case handling for lambdas and lambda is defined a s a ternary operator because in
```
f(a, lambda x,y: x + y, c)
```
the lambda appears in a comma-separated list of actuals, and itself can contain a comma-separated list of formals.
Similarly, in this silly program, the first colon must associate with the lambda.
```
if lambda x: x:
  pass
```
The lexer emits newline tokens but only at the end of logical lines, and not after start of input, :, INDENT, or other newline tokens.
As you proposed, this allows treating '\n' as a statement-level postfix operator, like ; in C.
This is necessary to disambiguate
```
# Not a function call
x = f
()
```
and
```
# A function call
x = f \
()
```
A pre-parse pass greedily combines not in and is not into one token to avoid special case handling for multi-word operators.

cjerdonek · August 31, 2019, 4:03pm

This sounds like a fun problem to think about. Could things be structured so that the parsed object could be reused (i.e. don’t re-parse) if it gets to the end and finds that it’s a list target rather than list expression, or vice versa? If you were able to get to the end, then it seems like what you’ve parsed is already known to satisfy the syntax for both list target and list expression. (If it didn’t, you would have discovered this earlier.) To express this another way, it’s like the two possibilities have a common base class, and when you get to whatever comes after, you can resolve it to the appropriate subclass.

guido · August 31, 2019, 10:31pm

Part 6 is short, it just shows how I’m doing actions.

guido · August 31, 2019, 11:13pm

Indeed, and I am far from despondent about it – I mostly added this note to make sure that we actually test this use case when we’re further along with the implementation to make sure there isn’t a weird slowdown. There are lots of interesting corner cases that currently produce syntax errors without column numbers, like

[a, b, c, 1] = range(4)
f().a = 0

I’d like to solve that by having a grammar for target that overlaps with expression. Off the top of my head (untested):

target: atom '.' NAME | atom '[' subscripts ']' | target_atom
target_atom: NAME | '(' [targets] ')' | '[' [targets] ']'
targets: target (',' target)* [',']

whereas somewhere in expression land there’s something like

expression: .........
molecule: molecule '.' NAME | molecule '[' subscripts ']' | molecule '(' [arguments] ')' | atom
atom: NAME | NUMBER | STRING+ | '...' | 'None' | 'True' | 'False' | group | list | dict | set
group: '(' [expressions] ')'
list: '[' [expressions] ']'
dict: '{' [key_values] '}'
key_values: key_value (',' key_value)* [',']
key_value: expression ':' expression
set: '{' expressions '}'
expressions: expression (',' expression)* [',']

(Leaving aside how to get from expression to molecule using the various operators.)

Compare target to molecule, and target_atom to atom, and see how they overlap yet differ.

guido · September 9, 2019, 9:52pm

FWIW the next part of the blog series is out:

Also note that I now have a series overview:

guido · September 13, 2019, 11:54pm

Part 8: Implementing PEG Features

guido · September 23, 2019, 3:22am

Another episode, wherein I discuss last week’s core development sprint as well as the prospects for actually having a PEG parser in CPython.