Just an aside that may be interesting to folks, I read about something called “Lossless Syntax Trees” on the Oil project blog last year:
[A Lossless Syntax Tree is] a representation for source code that can be converted back to the original text . In Oil, I implement it with a combination of:
That is what TatSu does, and it works! It requires a record (tuple) per input source postion:
def line_info(self, pos=None):
if pos is None:
pos = self._pos
# -2 to skip over sentinel
pos = min(pos, len(self._line_cache) - 2)
start, line, length = self._line_cache[pos]
end = start + length
col = pos - start
text = self.text[start:end]
n = min(len(self._line_index) - 1, line)
# note: this can be omitted if there's no support for #include
filename, line = self._line_index[n]
return LineInfo(filename, line, col, start, end, text)
The AST needs only record the start and end positions in the source for the source it represents.