IndentationErrorError: inconsistent reporting of inconsistent use of tabs and spaces in indentation in exception messages

kknechtel · December 9, 2023, 7:34am

I’ve been going through Stack Overflow trying to close and re-route a bunch of old duplicate questions… I discovered that when code has tabbed indentation followed by spaced indentation, this only gets reported as a TabError when the code tries to use 8 spaces to match the tab.

All examples below are taking from the Python 3.8 REPL; I also tried in 3.11 and it’s all the same - except that TabErrors don’t show a ^ in the message, and the IndentationError’s caret is one space further to the right. Notably, either way, that caret points to the end of line, including comment, which still doesn’t seem useful.

On to examples:

>>> def tabfirst_4():
...     pass # tab
...     pass # 4 spaces
  File "<stdin>", line 3
    pass # 4 spaces
                  ^
IndentationError: unindent does not match any outer indentation level
>>> 
>>> def tabfirst_8():
...     pass # tab
...         pass # 8 spaces
  File "<stdin>", line 3
    pass # 8 spaces
                  ^
TabError: inconsistent use of tabs and spaces in indentation

Fair enough; my understanding from the documentation is that Python 3 still considers a tab to be “equivalent to” 8 spaces (actually, space up to the next multiple-of-8 tab stop) in some sense:

Tabs are replaced (from left to right) by one to eight spaces such that the total number of characters up to and including the replacement is a multiple of eight (this is intended to be the same rule as used by Unix). The total number of spaces preceding the first non-blank character then determines the line’s indentation. Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation.

Indentation is rejected as inconsistent if a source file mixes tabs and spaces in a way that makes the meaning dependent on the worth of a tab in spaces; a TabError is raised in that case.

But here I have multiple objections.

What is even the benefit of keeping around the first calculation (which is, as I recall, identical to how it was in 2.x)? The number 8 is not in any way special to the indentation system; even someone who chose to mix tabs and spaces “responsibly” would presumably use the same pattern of tabs and spaces, such that the “weight” of the tabs would be irrelevant.
Arguably, it causes harm:
```
>>> def mixed_8():
...             pass # a tab, followed by 8 spaces
...             pass # 8 spaces, followed by a tab
... 
>>> # no error!!!
```
This only works when the number of spaces is a multiple of 8, of course. It’s almost as if special treatment is being afforded to people who want an 8-space indent; they get to use tabs to represent those indents, and interchange them with 8-space blocks freely.^[1]
When spaces come first, accepting the above reasoning, one would expect a TabError for 8-space indent followed by tab-indent, but a base IndentationError: unexpected indent for 4-space indent followed by tab-indent. After all, in the latter case, a 4-space indent was followed by something equivalent to 8-space indent. Right?

But that doesn’t happen:
```
>>> def spacefirst_4():
...     pass # 4 spaces
...     pass # tab
  File "<stdin>", line 3
    pass # tab
             ^
TabError: inconsistent use of tabs and spaces in indentation
>>> def spacefirst_8():
...         pass # 8 spaces
...     pass # tab
  File "<stdin>", line 3
    pass # tab
             ^
TabError: inconsistent use of tabs and spaces in indentation
```
Strangely, now the error is consistent - the mixed indentation problem is detected first. Why is this inconsistent - why is mixed indentation detected first in this case, but not in the other case?

Obviously, disallowing tabs will break outstanding code, and there are presumably codebases out there that have interchanged tabs with 8-space blocks that would also break, and don’t want the maintenance burden of fixing that terrible indentation style.

But surely the mixed-indentation check could at least come first consistently? That would preempt a ton of questions from beginners who wrote the equivalent of tabfirst_4, and see aligned text in their editor.^[2]

Maybe it should have been PEP 4 instead? ↩︎
Maybe this isn’t happening, because it would become difficult to allow the mixed-tab-8-space code? Ugh… ↩︎

Rosuav · December 9, 2023, 7:38am

Karl Knechtel:

Arguably, it causes harm:
>>> def mixed_8():
...             pass # a tab, followed by 8 spaces
...             pass # 8 spaces, followed by a tab
... 
>>> # no error!!!
This only works when the number of spaces is a multiple of 8, of course.

Far as I can tell, this is ONLY a quirk of the REPL. It won’t happen in actual script execution.

kknechtel · December 9, 2023, 7:44am

Perhaps your editor is converting tabs for you?

>>> compile(src, '', 'exec')
<code object <module> at 0x7fd074e6b030, file "", line 1>
>>> src
'def mixed_8():\n\t        pass\n        \tpass'
>>> with open('mix.py', 'w') as f: f.write(src)
... 
42
>>> import mix
>>>

Rosuav · December 9, 2023, 7:53am

Hmm, not sure. It definitely wasn’t converting, but I also deleted the file I tested that with, so now I can’t test for exactly what that was doing.

By the way, as a side note, eight spaces for a tab IS special. It’s a built-in definition that goes back a long way. This:

isn’t true. Tab stops truly are every eight, unless you do something differently. Of course, it’s better to see tabs and spaces as completely independent, which is the intention in Py3, but that’s why you aren’t seeing this phenomenon with four space indentations or any other.

guido · December 9, 2023, 4:31pm

IIRC when we decided to add a check for inconsistent indentation, the intention was that the parser would not be relying on the equivalence of a tab to any fixed number of spaces at all. The intended abstraction was that mixing of tabs and spaces was only valid if the indentation “levels” would be calculated unambiguously regardless of how many spaces are equivalent to a tab. If you’ve found a counter-example, that’s a bug, and should be fixed (but probably not in a bugfix release, since surely there are people who don’t know they are relying on this).

kknechtel · December 9, 2023, 11:37pm

I assume you understand this much already, but to be explicit: the issue here is, a tab wasn’t simply equivalent to a fixed number of spaces in the first place - because historically a tab could “swallow” up to 7 preceding spaces, following standard typographical conventions (i.e. the tab advances to the next tab stop, which are defined to appear every 8 columns). So it isn’t only a matter of having the same number of tabs and the same number of spaces, but the order matters.

Python 3 will mostly disallow “the same number of tabs and spaces, but in a different order”, but it allows “a tab, followed by eight spaces” to match “eight spaces, followed by a tab”. This is clearly wrong if we contemplate tabs that are “equivalent to” more than eight spaces (so, if I understood correctly, reportable as a bug); and it’s unexpected in that “eight spaces” do not match “a tab”. But nobody seems to notice this inconsistency - presumably because nobody who uses tabs for indentation wants to use multiple tabs, or a tab and spaces, for a single indent level.

I suspect that this can’t be trivially patched, but I haven’t looked at the implementation in depth (in particular, I don’t fully understand why tabfirst_4 doesn’t find the tab issue but spacefirst_4 does).

Python 3 also allows things like “one space, followed by a tab” to be used consistently for the first level of indentation - after all, the calculation is unambiguous! But IMO this can’t really be in the intended spirit of the rule - committing a source file like that to some company’s VCS sounds like a “job security” trick.

If we’re contemplating allowing pathologically indented code to break, my proposal is that within a given nested context, increasing levels of indentation may only add tabs up to whatever point, and then only add spaces after that; they must not swap back and forth, and must not mingle them, but only put the spaces after the tabs. This would, as far as I can imagine, require a completely new implementation - this is what I have in mind for the algorithm:

The indentation of a line should be forbidden to have a tab after the first space (if any). A TabError is raised if this is violated.
Internally, the indent stack should reckon each level as a (number of tabs, number of spaces) pair (as enabled by the first point).
Given that the top of the indent stack is (T, 0), the next indent is effectively unrestricted: it may be (T+t, s) for non-negative t and s (except t = s = 0, which of course is the same level).
Given that the top of the indent stack is (T, S) for S > 0, the next indent may only be (T, S+s) for positive s. An attempt to use more tabs will raise a TabError.
From one line to the next, if the number of tabs decreases, there must not be any spaces, or else TabError is raised (ideally, with a different message).

If we’re not contemplating that, my proposal is simply that any case that could be detected as a TabError, should be, and that it’s a bug to come up with the base IndentationError (like the “unindent does not match any outer indentation level” result in the tabfirst_4 example). In this case I don’t propose a specific way to fix it, because again I don’t have enough familiarity with the existing implementation.

Regardless, I propose that the stack trace for all IndentationErrors should not show the caret, because it doesn’t point anywhere useful. (It seems that Python 3.11 usually suppresses it, but not for the “unindent does not match any outer indentation level” case. Maybe that was already reported separately.)

guido · December 10, 2023, 12:57am

Yeah, I grew up when everybody knew that. I used “equivalent” because the full definition is too long.

Oh, that was very much the spirit of the rule! The rule isn’t about making people do what’s right. It’s about avoiding code that looks different in some person’s editor than how the interpreter sees it. E.g. if one line uses 6 spaces and another uses one tab, which one is indented more? It depends on tab size, and that can change the meaning of a program. But one space followed by one tab is always shown as indented more than just one space, so by itself it is allowed.

I don’t feel we need to contemplate a whole new proposal. However, I feel we should report your counterexample as bug and fix it. The bug seems to date back to the Python 2 days (when you had to use -tt to get these errors – but it doesn’t flag this).

guido · December 10, 2023, 1:00am

That’s an orthogonal issue, although I agree that the caret should either not be shown, or should be shown at the start of the line (with exactly the original pattern of whitespace in front of it, so it will actually line up as long as you’re using a font suitable to ASCII art). Maybe @pablogsal or @lys.nikolaou can help with this.

guido · December 10, 2023, 3:28am

I looked it up in the source (I had to use a debugger to find where to look :-).

The key bit of code is here:

github.com

python/cpython/blob/ca1bde894305a1b88fdbb191426547f35b7e9200/Parser/lexer/lexer.c#L433-L436


      
          else if (c == '\t') {
              col = (col / tok->tabsize + 1) * tok->tabsize;
              altcol = (altcol / ALTTABSIZE + 1) * ALTTABSIZE;
          }

and here:

github.com

python/cpython/blob/ca1bde894305a1b88fdbb191426547f35b7e9200/Parser/lexer/lexer.c#L481-L483


      
          if (altcol != tok->altindstack[tok->indent]) {
              return MAKE_TOKEN(_PyTokenizer_indenterror(tok));
          }

(And a few similar places – just search for altcol.)

This basically keeps track of the current column using two different tab sizes: 8 (in col) and 1 (in altcol). (tok->tabsize is initialized to TABSIZE, which is 8, and AFAICT it is never changed.)

When the indentation level is needed, it checks that col and altcol are the same, and if not, raises an error. There is also an altindstack keeping track of the alternate column offset by indent level.

This explains that 8 spaces followed by a tab is considered equivalent to a tab and 8 spaces, whereas n spaces followed by a tab is not equivalent to a tab and n spaces, if n%8 != 0.

All in all the logic is a bit convoluted, but not too horrible. My understanding is that the tab-equals-8-spaces logic is used to assign a column offset to each token, for used in syntax errors. (And yes, this means that if you use tabs for indentation, and your editor uses 4 spaces per tab to display your code, the column offsets in error messages may appear wrong. This is why the language spec still insists on claiming a tab is equivalent to 8 spaces. It is also why almost everybody uses spaces for indentation.)

I have an idea for fixing this: keep track of the column in yet another coordinate system, where a tab is equivalent to a different number of spaces (neither 1 nor 8), and complain if all three column offsets aren’t the same. This would also require a third stack of indent levels in this system.

I’m not sure it’s worth it, given that the stated goal (“if it looks the same but indents differently, it’s an error”) is obtained for tab sizes that are divisors of 8, and the only common tab sizes are 4 and 8. But this is the code you should change.

lys.nikolaou · December 11, 2023, 3:41pm

Agreed. We can certainly look into making the carets of errors like this more useful.