Lossless Syntax Trees

nchammas · December 6, 2018, 6:01pm

Continuing the discussion from Switch Python's parsing tech to something more powerful than LL(1):

Just an aside that may be interesting to folks, I read about something called “Lossless Syntax Trees” on the Oil project blog last year:

[A Lossless Syntax Tree is] a representation for source code that can be converted back to the original text . In Oil, I implement it with a combination of:

An ordered ASDL tree structure, and

An array of line spans . Concatenating these spans gives you back the original source file.

It’s not an AST because it preserves details of the syntax that are irrelevant to interpretation and compilation.

I don’t know much about parsing. Is this idea relevant to those who are looking to improve Python’s syntax error messages?

Here are some related issues on bugs.python.org, by the way:

Provide more helpful error message when await is called inside non-async method
Show “expected” token on syntax error
‘SyntaxError: invalid token’ is unfriendly
Provide a supported Concrete Syntax Tree implementation in the standard library (this perhaps encompasses the above idea already?)

apalala · April 25, 2019, 1:24pm

That is what TatSu does, and it works! It requires a record (tuple) per input source postion:

    def line_info(self, pos=None):
        if pos is None:
            pos = self._pos
       
        # -2 to skip over sentinel
        pos = min(pos, len(self._line_cache) - 2)
        start, line, length = self._line_cache[pos]
        end = start + length
        col = pos - start

        text = self.text[start:end]
        n = min(len(self._line_index) - 1, line)
        # note: this can be omitted if there's no support for #include
        filename, line = self._line_index[n]

        return LineInfo(filename, line, col, start, end, text)

The AST needs only record the start and end positions in the source for the source it represents.