The cpython Test/peg_generator/pegen directory has nice tools with which to generate either a C or Python language parser based on a PEG-like grammar.
However, I have to have a clone of python/cpython in order to use it. And then, I have to know to run the build.py module as a standalone script, and which command line arguments are needed.
Let’s say I am a regular Python user and I want to write a function which will parse some text in a particular format, and I can describe the format in a PEG grammar file. I want to be able to parse the text, given the grammar file, and get whatever the grammar specifies as the result.
Suggested.
I suggest providing something like the following:
- A module
pegparse
in the Standard Library, shown in the documentation under “Text Processing Services.” - A function
pegparse.parse(text, grammar, **options)
, a classpegparse.Parser(grammar, **options)
, and a methodParser.parse(text)
. - A method
Parser.gen_python()
which returns a complete Python file. This file has a functionparse(text)
which will parse the given text and produce the same result asParser.parse(text)
. An optional filename argument would also write the result to the given filename. - A similar method
Parser.gen_c()
which returns a complete ‘C’ file. I’m not sure what should be in this file. There should be some way to incorporate it into a C extension module so that I can call this extension module to parse some given text.
This might be difficult to implement and is perhaps not very useful, since it provides equivalent parser togen_python()
except that it is faster. - The documentation for
pegparse
has a complete description of the contents of a grammar file, including all the enhancements available in cpython/Test/peg_generator/pegen as well as relevant meta-tags like@subheader
. This might be in a separate document linked from thepegparse
document. It could be similar to the ‘Guide to the Parser’ in the ‘Python Developer’s Guide’.
By the way, a single character which is not otherwise part of a recognized token will be parsed as an ERRORTOKEN token. It is valid in the grammar file. As an example, themetagrammar.gram
file has the string"?"
as part of itsatom
rule.
Other suggestions.
-
Custom Tokenizer,
The parsers inpegen/
are based on the input text being divided into tokens as required for a Python file. These are basically names, strings, numbers, operators, comments, whitespace, and unrecognized single characters. There is also an optional encoding marker at the beginning.
The Standard Librarytokenize
module is used to do this tokenization.
The user might customize this in various ways:- Provide a filter for the output of
tokenize.tokenize()
This would take aTokenInfo
argument and decide whether the user program wants this token or not.
For instance, ignoring the indented line structure could filter out INDENT and DEDENT tokens, or ignoring all whitespace could also filter out NEWLINE, NL, and COMMENT tokens. - Provide a replacement for the
Grammar/Tokens
file, in order to redefine the operator token strings (entries such asPLUSEQUAL '+='
. This would generate a new version of the librarytoken
module, and a new version of the librarytokenize
module which uses the newtoken
. - Provide a replacement for the library
tokenize
module.
- Provide a filter for the output of
-
Dependencies.
Thepegen
code is used to produce either a Python or a C file, which in turn is saved in the repo and then available in the future to perform parsing. These files will have references to definitions found externally, and so the generated files need to either#include
orimport
other files.
The parser generators usemeta
definitions in the grammar file to modify the code in the generated file. They also generate some other information.
In thepegparse
module, there might be a different way of customizing the generated output, such as arguments to thegen_c
andgen_python
methods.
xxx