Make the PEG parser available to all Python users

mrolle45 · March 16, 2023, 2:07am

The cpython Test/peg_generator/pegen directory has nice tools with which to generate either a C or Python language parser based on a PEG-like grammar.

However, I have to have a clone of python/cpython in order to use it. And then, I have to know to run the build.py module as a standalone script, and which command line arguments are needed.

Let’s say I am a regular Python user and I want to write a function which will parse some text in a particular format, and I can describe the format in a PEG grammar file. I want to be able to parse the text, given the grammar file, and get whatever the grammar specifies as the result.

Suggested.

I suggest providing something like the following:

A module pegparse in the Standard Library, shown in the documentation under “Text Processing Services.”
A function pegparse.parse(text, grammar, **options), a class pegparse.Parser(grammar, **options), and a method Parser.parse(text).
A method Parser.gen_python() which returns a complete Python file. This file has a function parse(text) which will parse the given text and produce the same result as Parser.parse(text). An optional filename argument would also write the result to the given filename.
A similar method Parser.gen_c() which returns a complete ‘C’ file. I’m not sure what should be in this file. There should be some way to incorporate it into a C extension module so that I can call this extension module to parse some given text.
This might be difficult to implement and is perhaps not very useful, since it provides equivalent parser to gen_python() except that it is faster.
The documentation for pegparse has a complete description of the contents of a grammar file, including all the enhancements available in cpython/Test/peg_generator/pegen as well as relevant meta-tags like @subheader. This might be in a separate document linked from the pegparse document. It could be similar to the ‘Guide to the Parser’ in the ‘Python Developer’s Guide’.
By the way, a single character which is not otherwise part of a recognized token will be parsed as an ERRORTOKEN token. It is valid in the grammar file. As an example, the metagrammar.gram file has the string "?" as part of its atom rule.

Other suggestions.

Custom Tokenizer,
The parsers in pegen/ are based on the input text being divided into tokens as required for a Python file. These are basically names, strings, numbers, operators, comments, whitespace, and unrecognized single characters. There is also an optional encoding marker at the beginning.
The Standard Library tokenize module is used to do this tokenization.
The user might customize this in various ways:
- Provide a filter for the output of tokenize.tokenize() This would take a TokenInfo argument and decide whether the user program wants this token or not.
  For instance, ignoring the indented line structure could filter out INDENT and DEDENT tokens, or ignoring all whitespace could also filter out NEWLINE, NL, and COMMENT tokens.
- Provide a replacement for the Grammar/Tokens file, in order to redefine the operator token strings (entries such as PLUSEQUAL '+='. This would generate a new version of the library token module, and a new version of the library tokenize module which uses the new token.
- Provide a replacement for the library tokenize module.
Dependencies.
The pegen code is used to produce either a Python or a C file, which in turn is saved in the repo and then available in the future to perform parsing. These files will have references to definitions found externally, and so the generated files need to either #include or import other files.
The parser generators use meta definitions in the grammar file to modify the code in the generated file. They also generate some other information.
In the pegparse module, there might be a different way of customizing the generated output, such as arguments to the gen_c and gen_python methods.

xxx

NeilGirdhar · March 16, 2023, 2:16pm

I would love for there to be an excellent parser, but I don’t think it should be in the standard library. I think you really want the flexibility of leaving it as a third party library.

smontanaro · March 16, 2023, 6:12pm

In the typical case, I’d agree with Neil. In this case, isn’t the maintenance cost already being paid by the relevant core devs who maintain the existing parser? I think all @mrolle45 is suggesting is to expose the parser which CPython already has. Certainly, that would involve writing a bit of wrapper code (and maybe moving the existing parser into a location in the code where it would be installed), but anyone else providing a third-party module would have to extract the existing parser to use it. Suddenly, you have two copies of the current internal parser and risk having them diverge.

brettcannon · March 16, 2023, 9:56pm

Not from the perspective of maintaining a public API. Right now we can do whatever we want to that parser and not worry about breaking anyone. Exposing it means we are not beholden to our usual stdlib compatibility promises.

EpicWink · March 16, 2023, 11:10pm

In addition, other Python stdlib implementations would have to at least copy and sync with CPython’s version of they wanted to include it.

smontanaro · March 17, 2023, 2:15am

@EpicWink I’m not sure what you’re saying here. Can you expand?

smontanaro · March 17, 2023, 2:18am

Are there serious suggestions that the current implementation or internal API is lacking in some way wrt its current use? I thought PEG parsers were a pretty well understood piece of technology.

NeilGirdhar · March 17, 2023, 4:36am

Suppose that you’re right that the implementation and API won’t change. In that case, what’s wrong with forking it?

EpicWink · March 18, 2023, 12:47am

Other Python implementations which distribute their own standard library will need to keep in sync with (or reimplement) CPython’s standard library version of the proposed PEG parser library. I suppose this isn’t different from other standard library modules, bit this could be avoided by instead providing the parser in a third party package.

steven.daprano · March 18, 2023, 5:50am

Forking will all but guarantee that the two versions will diverge, in implementation and API or both.

It also requires approximately double the maintenance to maintain two folks compared to maintaining just one.

There are costs as well as benefits to forks.

NeilGirdhar · March 18, 2023, 5:56am

That won’t happen if his hypothesis is correct.

brettcannon · March 21, 2023, 11:40pm

They are, but they still have an API, and exposing the API now means there’s something to support with a backwards-compatibility policy.

encukou · March 22, 2023, 8:47am

Note that pegen is available on PyPI: pegen · PyPI
This version only generates Python code, since the C helpers are too tightly tied to CPython.

smontanaro · March 22, 2023, 5:41pm

Thanks @encukou. Was unaware of that. It will likely suit my needs.