The cpython Test/peg_generator/pegen directory has nice tools with which to generate either a C or Python language parser based on a PEG-like grammar.
However, I have to have a clone of python/cpython in order to use it. And then, I have to know to run the build.py module as a standalone script, and which command line arguments are needed.
Let’s say I am a regular Python user and I want to write a function which will parse some text in a particular format, and I can describe the format in a PEG grammar file. I want to be able to parse the text, given the grammar file, and get whatever the grammar specifies as the result.
Suggested.
I suggest providing something like the following:
- A module
pegparsein the Standard Library, shown in the documentation under “Text Processing Services.” - A function
pegparse.parse(text, grammar, **options), a classpegparse.Parser(grammar, **options), and a methodParser.parse(text). - A method
Parser.gen_python()which returns a complete Python file. This file has a functionparse(text)which will parse the given text and produce the same result asParser.parse(text). An optional filename argument would also write the result to the given filename. - A similar method
Parser.gen_c()which returns a complete ‘C’ file. I’m not sure what should be in this file. There should be some way to incorporate it into a C extension module so that I can call this extension module to parse some given text.
This might be difficult to implement and is perhaps not very useful, since it provides equivalent parser togen_python()except that it is faster. - The documentation for
pegparsehas a complete description of the contents of a grammar file, including all the enhancements available in cpython/Test/peg_generator/pegen as well as relevant meta-tags like@subheader. This might be in a separate document linked from thepegparsedocument. It could be similar to the ‘Guide to the Parser’ in the ‘Python Developer’s Guide’.
By the way, a single character which is not otherwise part of a recognized token will be parsed as an ERRORTOKEN token. It is valid in the grammar file. As an example, themetagrammar.gramfile has the string"?"as part of itsatomrule.
Other suggestions.
-
Custom Tokenizer,
The parsers inpegen/are based on the input text being divided into tokens as required for a Python file. These are basically names, strings, numbers, operators, comments, whitespace, and unrecognized single characters. There is also an optional encoding marker at the beginning.
The Standard Librarytokenizemodule is used to do this tokenization.
The user might customize this in various ways:- Provide a filter for the output of
tokenize.tokenize()This would take aTokenInfoargument and decide whether the user program wants this token or not.
For instance, ignoring the indented line structure could filter out INDENT and DEDENT tokens, or ignoring all whitespace could also filter out NEWLINE, NL, and COMMENT tokens. - Provide a replacement for the
Grammar/Tokensfile, in order to redefine the operator token strings (entries such asPLUSEQUAL '+='. This would generate a new version of the librarytokenmodule, and a new version of the librarytokenizemodule which uses the newtoken. - Provide a replacement for the library
tokenizemodule.
- Provide a filter for the output of
-
Dependencies.
Thepegencode is used to produce either a Python or a C file, which in turn is saved in the repo and then available in the future to perform parsing. These files will have references to definitions found externally, and so the generated files need to either#includeorimportother files.
The parser generators usemetadefinitions in the grammar file to modify the code in the generated file. They also generate some other information.
In thepegparsemodule, there might be a different way of customizing the generated output, such as arguments to thegen_candgen_pythonmethods.
xxx