How are the parts of the grammar definition of a suite associated?

franklinvp · May 6, 2024, 8:11pm

In the documentation, the definition of suite says

suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT

Should I be reading this as

suite         ::=  stmt_list NEWLINE | NEWLINE (INDENT statement)+ DEDENT

?

In the definition of statement

statement     ::=  stmt_list NEWLINE | compound_stmt

the come with their NEWLINE at the end, but without (inherent) indentation.

I am not sure, because I see explicit parentheses in other places. So, I don’t know if here they are missing on purpose, and I am not understanding something else.

MegaIng · May 6, 2024, 8:31pm

No. INDENT and DEDENT only talk about changes in indentation, not for each indented line.

franklinvp · May 6, 2024, 8:38pm

I see. I should understand the INDENT and DEDENT as if modifying a global state (not affected by a NEWLINE).

I am still a bit confused with the line

The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using a stack, as follows.

from here, which I am understanding as the tokens INDENT and DEDENT telling what to write in the stack mentioned there. Let me think more … I need to travel now …

kknechtel · May 6, 2024, 11:39pm

Right. The tokenizer doesn’t simply turn sequences of spaces (or tabs) into an INDENT token; it keeps track of the current level of indentation, and generates INDENT and DEDENT tokens as needed. (After all, there isn’t any text for it to turn into the DEDENTs!)

In code like

if "foo":
    bar()
baz()

the tokenizer can’t produce a DEDENT until it sees baz (or the end of file). If there were another line inside the if block, it wouldn’t create another INDENT token, because it’s at the same level.

franklinvp · May 7, 2024, 7:00pm

Well, I was trying to understand the grammar definition written in Bakus-Naur form. I suppose I can ignore Python itself and what the tokenizer needs to do, can’t I?

I was understanding INDENT and DEDENT as symbols of the grammar that needed to be explicitly present in each line.

For example, following the definitions in the documentation, the grammar for an if_stmt can produce

if, "foo", :, NEWLINE, INDENT, bar(), NEWLINE, bas(), NEWLINE, DEDENT

vs what I though should be (and cannot produce)

if, "foo", :, NEWLINE, INDENT, bar(), NEWLINE, INDENT, bas(), NEWLINE, DEDENT

Above, the corresponding Python I am thinking about is (without the comment)

if "foo":
    bar()
    bas()  # This line intentionally part of the IF

Rosuav · May 7, 2024, 8:56pm

This is where python3 -m tokenize can help. Paste in the code you’re looking at (or give it a file) and see the exact stream of tokens. Here’s what your three lines of code produce:

1,0-1,2:            NAME           'if'           
1,3-1,8:            STRING         '"foo"'        
1,8-1,9:            OP             ':'            
1,9-1,10:           NEWLINE        '\n'           
2,0-2,4:            INDENT         '    '         
2,4-2,7:            NAME           'bar'          
2,7-2,8:            OP             '('            
2,8-2,9:            OP             ')'            
2,9-2,10:           NEWLINE        '\n'           
3,4-3,7:            NAME           'bas'          
3,7-3,8:            OP             '('            
3,8-3,9:            OP             ')'            
3,11-3,51:          COMMENT        '# This line intentionally part of the IF'
3,51-3,52:          NEWLINE        '\n'           
4,0-4,0:            DEDENT         ''             
4,0-4,0:            ENDMARKER      ''

I assume you used “bar()” as a shorthand for the NAME OP OP sequence, since the exact line of code is irrelevant; I’ve kept it exactly as per the tokenizer’s output here for clarity.

kknechtel · May 7, 2024, 9:54pm

Well, yes. You can’t produce it (at least in general) because knowing when to emit DEDENT (and how many) depends on keeping track of previous levels of indentation. Also because, when you look at the whitespace in front of a line, you need to know how many levels of indentation it represents - you cannot just divide the number of spaces by 4 or anything like that.

And if you are doing that anyway, then it’s trivial to make INDENT tokens only appear when the indentation level increases (since the language dictates that, while you can dedent multiple times in a row, you can only indent one new level on a new line of code). There’s no point in emitting the extra INDENTs for the current level of indentation because it only complicates later parsing.

In fact, the modified suite rule you described, along with the tokenization you had in mind, wouldn’t handle nested suites. But “the statements inside a suite are bracketed by INDENT and DEDENT” is trivial for a parser to handle.

franklinvp · May 8, 2024, 10:23am

I had assumed that those BNF in the documentation would generate Python language directly.

In the answers, all of you have explained going from Python language to tokens, and I see the grammar indeed generating the language of those tokens. However, to go from tokens to Python, I need a stack to keep track of the INDENT - DEDENT.

Is it fair to say that the grammar in the documentation defines the language of tokens, not Python itself?

Well, I think I have my answer in this sentence that had skipped reading

The descriptions of lexical analysis and syntax use a modified Backus–Naur form (BNF) grammar notation.

They describe the lexical analysis.

kknechtel · May 8, 2024, 5:39pm

Yes, but that isn’t the modification that they mean when they talk about the notation being modified. Tokenizing language input first and then parsing the tokens is the normal process (as you’d learn from any introductory compiler theory course). It happens that “tokenizing Python” is slightly more involved than using some rules to split it up and then assigning a token type to each piece. But that’s completely independent of the parsing process, which is fundamentally conceptually the same for any language that takes the traditional approach: rules are used to group the tokens into more complex tokens, repeatedly, until they’re completely organized into an abstract syntax tree; then code is generated from that.

“Modified BNF” just means that they describe the rules a little differently from how the textbook says to do it. (And then they describe exactly how they do it.)

franklinvp · May 9, 2024, 12:35pm

Well I don’t know why you are focusing on the word “modified”. The part of the quote that gave me the answer (and that is why I extracted it) is “lexical analysis”. I understood that sentence as telling that they are defining “lexically analyzed Python language”, and not “Python language”.

An extended answer would be that it cannot be otherwise, for Python language not being context-free.

It is just unexpected to enter a random spot in the documentation of a language, find a formal definition, but it is not a formal definition of that language and the indication that that is the case is far away and not very explicit.

MegaIng · May 9, 2024, 12:48pm

It is a formal definition of a slight generalization ^[1] of syntax of Python. The python language (in the sense of “all strings representing a runnable program without errors” ^[2]) is close to impossible to define formally (as is the case for most Turing complete systems).

Instead python is best defined via multiple stages, first lexical analysis, then syntax parsing, then semantic analysis ^[3]. This grammar only covers the second step, as is standard in basically all real world grammar definitions.

It sometimes doesn’t forbid stuff it should, for example break outside of a loop ↩︎
I assume that is what you mean at least. Otherwise you need to be more careful with your word choices ↩︎
And then others not relevant right now like compilation and execution ↩︎

franklinvp · May 9, 2024, 12:50pm

No, by Python language, above, I do not mean Python that can be run, only Python that does not have syntax errors.