Make the PEG parser available to all Python users

The cpython Test/peg_generator/pegen directory has nice tools with which to generate either a C or Python language parser based on a PEG-like grammar.

However, I have to have a clone of python/cpython in order to use it. And then, I have to know to run the module as a standalone script, and which command line arguments are needed.

Let’s say I am a regular Python user and I want to write a function which will parse some text in a particular format, and I can describe the format in a PEG grammar file. I want to be able to parse the text, given the grammar file, and get whatever the grammar specifies as the result.


I suggest providing something like the following:

  • A module pegparse in the Standard Library, shown in the documentation under “Text Processing Services.”
  • A function pegparse.parse(text, grammar, **options), a class pegparse.Parser(grammar, **options), and a method Parser.parse(text).
  • A method Parser.gen_python() which returns a complete Python file. This file has a function parse(text) which will parse the given text and produce the same result as Parser.parse(text). An optional filename argument would also write the result to the given filename.
  • A similar method Parser.gen_c() which returns a complete ‘C’ file. I’m not sure what should be in this file. There should be some way to incorporate it into a C extension module so that I can call this extension module to parse some given text.
    This might be difficult to implement and is perhaps not very useful, since it provides equivalent parser to gen_python() except that it is faster.
  • The documentation for pegparse has a complete description of the contents of a grammar file, including all the enhancements available in cpython/Test/peg_generator/pegen as well as relevant meta-tags like @subheader. This might be in a separate document linked from the pegparse document. It could be similar to the ‘Guide to the Parser’ in the ‘Python Developer’s Guide’.
    By the way, a single character which is not otherwise part of a recognized token will be parsed as an ERRORTOKEN token. It is valid in the grammar file. As an example, the metagrammar.gram file has the string "?" as part of its atom rule.

Other suggestions.

  • Custom Tokenizer,
    The parsers in pegen/ are based on the input text being divided into tokens as required for a Python file. These are basically names, strings, numbers, operators, comments, whitespace, and unrecognized single characters. There is also an optional encoding marker at the beginning.
    The Standard Library tokenize module is used to do this tokenization.
    The user might customize this in various ways:

    • Provide a filter for the output of tokenize.tokenize() This would take a TokenInfo argument and decide whether the user program wants this token or not.
      For instance, ignoring the indented line structure could filter out INDENT and DEDENT tokens, or ignoring all whitespace could also filter out NEWLINE, NL, and COMMENT tokens.
    • Provide a replacement for the Grammar/Tokens file, in order to redefine the operator token strings (entries such as PLUSEQUAL '+='. This would generate a new version of the library token module, and a new version of the library tokenize module which uses the new token.
    • Provide a replacement for the library tokenize module.
  • Dependencies.
    The pegen code is used to produce either a Python or a C file, which in turn is saved in the repo and then available in the future to perform parsing. These files will have references to definitions found externally, and so the generated files need to either #include or import other files.
    The parser generators use meta definitions in the grammar file to modify the code in the generated file. They also generate some other information.
    In the pegparse module, there might be a different way of customizing the generated output, such as arguments to the gen_c and gen_python methods.


1 Like

I would love for there to be an excellent parser, but I don’t think it should be in the standard library. I think you really want the flexibility of leaving it as a third party library.

In the typical case, I’d agree with Neil. In this case, isn’t the maintenance cost already being paid by the relevant core devs who maintain the existing parser? I think all @mrolle45 is suggesting is to expose the parser which CPython already has. Certainly, that would involve writing a bit of wrapper code (and maybe moving the existing parser into a location in the code where it would be installed), but anyone else providing a third-party module would have to extract the existing parser to use it. Suddenly, you have two copies of the current internal parser and risk having them diverge.

Not from the perspective of maintaining a public API. Right now we can do whatever we want to that parser and not worry about breaking anyone. Exposing it means we are not beholden to our usual stdlib compatibility promises.

In addition, other Python stdlib implementations would have to at least copy and sync with CPython’s version of they wanted to include it.

@EpicWink I’m not sure what you’re saying here. Can you expand?

Are there serious suggestions that the current implementation or internal API is lacking in some way wrt its current use? I thought PEG parsers were a pretty well understood piece of technology.

Suppose that you’re right that the implementation and API won’t change. In that case, what’s wrong with forking it?

Other Python implementations which distribute their own standard library will need to keep in sync with (or reimplement) CPython’s standard library version of the proposed PEG parser library. I suppose this isn’t different from other standard library modules, bit this could be avoided by instead providing the parser in a third party package.

Forking will all but guarantee that the two versions will diverge, in implementation and API or both.

It also requires approximately double the maintenance to maintain two folks compared to maintaining just one.

There are costs as well as benefits to forks.

That won’t happen if his hypothesis is correct.

They are, but they still have an API, and exposing the API now means there’s something to support with a backwards-compatibility policy.

Note that pegen is available on PyPI: pegen · PyPI
This version only generates Python code, since the C helpers are too tightly tied to CPython.


Thanks @encukou. Was unaware of that. It will likely suit my needs.