Improving Python Language Reference (PLR) documentation

I decided that I’ll dedicate a bit of time each Wednesday to push this forward.
Last week I was on vacation, but I’ll be back tomorrow again. There’ll be another stream on YouTube (click for time in your timezone), and I’ll also stream to the Docs discord (if you join, your voice will be on the public recording).

The plan is to start consolidating Python’s actual grammar and the docs.

5 Likes

Generating just diagrams from the ground truth wouldn’t make sense if the text of the docs doesn’t match. So I’ll need to generate the text too.

I spent most of yesterday’s session learning Sphinx/docutils. I can now write a bare-bones extension that generates production lists (the grammar snippets).
The next step will be generating them from the data :‍)

It looks like rule names in the grammar file are local implementation details, technically they could be changed to the ones docs use (or vice versa).

1 Like

Cool. Is there a reason the docs use ::= instead of =, like in Python.gram? The rest of the notation seems very similar.

Also, beware that Python.gram sometimes uses lookaheads (e.g. `&;return’). Usually those are just optimizations, but a few of them are required to disambiguate things. This is an unfortunate side effect of using PEG instead lf a context-free (LR(k)) grammar.

The :== vs. : is a Sphinx thing. It has a dedicated directive type for these: it linkifies nonterminals but leaves special characters alone.
The source .rst file uses colons :slight_smile:

There are lookaheads, negative lookaheads, but also:
Cuts (~): are these optimizations, or necessary for correctness?
Forced tokens (&&): These look unimportant since the && is removed for the docs. But I haven’t seen a description of these (and didn’t delve into the code). Should it be mentioned in Python.gram’s comment, or is that comment mainly for the docs?

Anyway, I’m thinking I’ll hide lookaheads &c. in the initial implementation. If someone wants the precise grammar they really should look at the whole file, not piece it together from examples scattered around in prose. And the diagramming library will need tweaks to support lookaheads.
(Of course, if I do this the docs should say the snippets are approximations.)

Okay, let Sphinx be Sphinx. (Also I misremembered what python.gram uses – it uses :, not = – sorry.)

The main problem with PEG is that with a true context-free grammar, if a particular input can match two different rules, that’s an ambiguity and this is generally considered a bug in the grammar (though most pragmatic tools also have out-of-band ways to disambiguate common cases). In PEG, however, there is no ambiguity: PEG by definition says that the first rule that matches is what the PEG grammar defines. (There are some subtleties around how “first match” is interpreted too.)

I just looked over all 6 occurrences of “cut” (~) in python.gram, and none of them are involved in disambiguation – all of them occur in places where there is no other grammar rule (in current Python) that could match. Four of them lock in for ... in after the in keyword, two lock in alternate assignment operators – augmented assignment (+= etc.) and the walrus (:=).

Regarding &&, It was introduced here as an optimization and can definitely be ignored.

I think I remember there are a few positive or negative lookaheads that meaningfully affect disambiguation. Maybe Pablo or Lysandros can help out here. Agreed that this isn’t very important for a first cut.

1 Like

Thanks Guido! Will keep that in mind.


The next stream will be 2023-11-15T13:00:00Z on YouTube and Discord.

1 Like

I have a conflicting event today, so no stream, but the current stage of the project – adding directives to ReST files – isn’t that exciting anyway.

Example of current state (top is the existing hand-written grammar, bottom is taken from python.gram*):


It then needs formatting (e.g. [statements] rather than statements?) cross-linking to rules defined elsewhere, reorganization and simplification. But it should make a good PR on its own, leaving diagrams as the next step.

*⁾ I’ve also tried renaming the file rule to file_input, which the current docs use – mostly to see what a change like that would break. Nothing broke, tests still pass.

1 Like

Things came up; I’ll start today’s stream an hour later than usual: 2023-11-29T16:00:00Z on Discord & Youtube

3 Likes

Rather than plan around various end-of-year gatherings, I’m putting this on hold until January. See you in the new year :‍)

Meanwhile, here’s some current thinking:

If grammar docs are generated, then any grammar change will need a docs review to ensure the snippets still match the surrounding prose.
How to ensure that?
So far it looks like the best solution would be to put generated snippets directly in the .rst files, à la Argument Clinic.

I don’t like files that mix hand-written and auto-generated content. It makes things messy, confuses tools that want to ignore autogenerated content or track changed files, often needs ugly “start/end generated section” markers, etc…
But, here it would make changes in the grammar docs show up in review diffs, with surrounding prose as context. That is, IMO, worth the downsides.

4 Likes

PyCon2024 update

  • @encukou has been doing the livestreams since January and I have enjoyed every single one (even the ones that are painfully early when I’m in California)
  • The Diagram Generator at bottlecaps seems unavailable today
  • There is a generator with a python-specific option at DrawGrammar (YMMV)

If you’re less interested in using the Python grammar spec and instead looking to render beautiful diagrams using pure python, I found Syntrax — Syntrax 1.0 documentation , which uses a simple CLI for describing the diagrams.

2 Likes

If you are still interested, I would appreciate it because much of the work that Petr does in Improving Python Language Reference (PLR) documentation - #22 by encukou
is just beyond my comfort zone, so it is hard for me to take leadership if something happens to him.

I forgot to mention, the work in progress is in GitHub - encukou/cpython at grammar-in-docs
and it is a lot of work. Today felt like a master class in tokenization, along with some very clever dev tricks.

3 Likes

The rabbit hole runs deep, but we’re slowly progressing.

As of now, grammar listings it the main branch docs no longer use Sphinx productionlist directive, but a custom one. Actually there are two: a backwards-compatible one, which took over the productionlist directive name, and a new one that allows more flexible formatting: grammar-snippet (currently demoed in two very simple cases).

The visible changes are rather small:

  • string literals are syntax-highlighted
  • rule names use the : symbol rather than Sphinx’s ::=

See Assignment statements: before, after.

The new directive could make it upstream into Sphinx; the best way to do that is to put it through its paces in CPython first.

The next step is to go through the docs and adjust the listings and surrounding prose to better match what’s actually in the grammar file. If we can get close enough to something that can be auto-generated, we can then add a generator – and generate diagrams as well.

6 Likes

Cool!

How close is the (displayed) rule syntax now to the syntax used in the PEG grammar file? (Not that I recommend copying the rules from there, they are often more complicated that users need, for various reasons.)

1 Like

The new directive itself stays very close to its input – it links text in backticks as tokens (and discards the backticks), otherwise only acts as a syntax highlighter.

To render the grammar reference correctly it will need support for comments.
(For the actual grammar file it would need to support annotations and actions. I don’t think we ever want to show those in docs.)

We’re free to make the displayed syntax as close to the grammar as we want. I think we want the alternative top-level alts where | prefixes alternatives rather than separates them; we’ll also want Gather, and lookaheads for cases where they’re not just optimizations.


We could theoretically even introduce new grammar syntax, if it would make things clearer for the reader. For example, there’s a bunch of rules of the form x ["," y] | y, like kwargs, where the duplicated y looks unnecessarily complicated when converted to a diagram:


and would be more readable like this:

The text grammar works around this by always splitting the duplicated part out into a named rule, which works rather well for text. But, like Gather making s.e+ short for (e (s e)*), perhaps there’s also some syntax for this that could simplify the text enough (to be worth the high cost of new syntax).

2 Likes

Going in order of the language reference – bottom up – means there’s a bit of detour first, with parts that aren’t strictly Grammar (although with f- and t-strings, they’re more intertwined than ever): tokens & tokenization.
As of now, proper docs for tokens are in (compare main to 3.12); automation that generated the docs (with empty descriptions) is updated to check that each token is documented. Docs for the trivial symbol tokens are still autogenerated.

Now that we can link to tokens with clean conscience, it’s time to start on the lexical analysis docs. I think I found a style that works: each section will start with the most obvious things, expand toward a complete & precise description, with examples along the way, and end with the formal grammar. (Currently the grammar is usually listed first, and the author of the prose is tempted to assume that the reader has parsed and understood it.)

A PR is up for NAME tokens: https://github.com/python/cpython/pull/131474
We’re skipping STRING for now, and working on NUMBER (see the WIP branch).