Improved error reporting for string literal syntax errors

kknechtel · May 15, 2024, 3:43am

Two cases I’d like to consider here. First, regarding Unicode syntax errors for bad escape sequences.

Python up to 3.8 gives (yes, I’m using a common Windows gotcha for testing on Linux):

$ python3.8 -c "data = 'C:\Users\me\Desktop\data.txt'"
  File "<string>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Python 3.9-3.11 show the corresponding code and point at the end of string:

$ python3.11 -c "data = 'C:\Users\me\Desktop\data.txt'"
  File "<string>", line 1
    data = 'C:\Users\me\Desktop\data.txt'
                                         ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Python 3.12 highlights the entire string:

$ python3.12 -c "data = 'C:\Users\me\Desktop\data.txt'"
  File "<string>", line 1
    data = 'C:\Users\me\Desktop\data.txt'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

First off, calling these “unicode errors” is actually really strange because:

unicode is no longer the name of the type
the problem is specifically with sequences designed to avoid using “Unicode” (meaning non-ASCII) characters in the code
the error message equally applies to \x escapes

Second, referring to the 'unicodeescape' codec is also strange, and not helpful. It reveals a completely irrelevant implementation detail^[1], and gives the false impression that one could somehow substitute a different codec. It also entails that the error is presented as a “decoding” problem related to Unicode, when it’s actually an issue of backslash escape processing.

Third, these highlights are not useful. Python clearly knows where the problem is (because it prints the “position” indication) - so it should be able to use caret symbols to point at it properly.

My suggestion:

$ python3.x -c "data = 'C:\Users\me\Desktop\data.txt'"
  File "<string>", line 1
    data = 'C:\Users\me\Desktop\data.txt'
              ^^
SyntaxError: invalid Unicode escape. To represent a Unicode character, '\U' must be followed by 8 hex digits. Use '\\U' instead (or a raw string) if this should be a backslash followed by 'U'.

Similarly for \u and \x escapes, indicating how many hex digits to use for each of those.

Second, string literals which are unterminated due to the intended termination being escaped.

Python up to 3.9 gives

$ python3.9 -c "'/\/\/\'"
  File "<string>", line 1
    '/\/\/\'
            ^
SyntaxError: EOL while scanning string literal

Python 3.10-3.12 put the caret at the start of the string, which is arguably worse:

$ python3.10 -c "'/\/\/\'"
  File "<string>", line 1
    '/\/\/\'
    ^
SyntaxError: unterminated string literal (detected at line 1)

What I’d hope to see instead is a highlight of the last \' sequence on the line with basic advice, and simpler wording:

$ python3.x -c "'/\/\/\'"
  File "<string>", line 1
    '/\/\/\'
          ^^
SyntaxError: couldn't find the end of string. Use "\\'" instead if the highlighted part should be a backslash and the end of string.

If there is \ followed by the quote at any point past the opening quote, it would scan for the last such sequence:

$ python3.x -c "'/\'/\'/\' + 'text'"
  File "<string>", line 1
    '/\'/\'/\' + 'text'
            ^^
SyntaxError: couldn't find the end of string. Use "\\'" instead if the highlighted part should be a backslash and the end of string.

Of course the error message doesn’t mention raw strings this time, because they’re unlikely to be helpful

which is itself strange, actually. The implication is that Python has already read the file as bytes to get as far as a coding declaration or else decide to use UTF-8; then converted to string for tokenization; then converted the string literal token back to bytes in order to be able to apply a codec; then attempted to convert back to string via that codec… ↩︎

Rosuav · May 15, 2024, 3:51am

Less strange than you might think. Yes, unicode isn’t the name of the type, but it’s still a Unicode string. Calling some characters “Unicode” to differentiate them from “ASCII” is incorrect usage and Python shouldn’t be perpetuating this; you can use a Unicode escape for any codepoint, even an ASCII one. The \x shorthand works for any two-digit codepoint value, including ones that aren’t ASCII.

You’re right that talking about the codec isn’t generally helpful, although perhaps it’s still not entirely wrong, since you can poke around with the codecs module and get the same behaviour. And I absolutely agree that it would be helpful to improve this message, since Windows users are forever going to run into this with “\Users”.

kknechtel · May 15, 2024, 3:59am

I agree, hence the scare quotes. As far as I can tell, invalid Unicode escape. To represent a Unicode character, '\U' must be followed by 8 hex digits is still correct wording; but if you think it’s misleading I’m open to suggestions. (Maybe ... an escaped character, ...?)

One issue that occurred to me is that caret highlighting could be inaccurate if the string contains multi-character graphemes, zero-width characters and/or characters that typically don’t print at “standard” width even in monospace fonts (I’m thinking in particular of CJK ideographs and emoji). But these problems exist regardless and would affect all uses of the technique in errors.

kknechtel · May 15, 2024, 4:03am

@eryksun The forum’s suggestion feature found your query in an old thread; rather than bump that, I thought I might draw your attention to this proposal.

Rosuav · May 15, 2024, 7:17am

Yeah, I think that’s correct; it is an invalid Unicode escape, since that’s precisely what \U is meant to introduce. Just saying, calling it a “Unicode error” isn’t wrong either

Not enough of a problem to be worth fixing IMO.

storchaka · May 15, 2024, 9:58am

This is what happened here. The parser first decodes the sources with the source encoding (UTF-8 by default), then encodes it with UTF-8. If it finds a string literal, it decodes it with UTF-8 and encodes to ASCII with representing all non-ASCII characters as \UXXXXXXXX. Then it call the Unicode Escape codec (using direct private API _PyUnicode_DecodeUnicodeEscapeInternal()) which decodes all escape sequences (such as \n, \ooo, \xXX, \uXXXX and \UXXXXXXXX). It is a normal codec, so it creates a UnicodeDecodeError which includes the range and the error description like “truncated \UXXXXXXXX escape”, try to call the error handler (there is none in this case) and raises the exception. The full error message of UnicodeDecodeError includes the name of the codec, the range and the description. The parser catches it and converts to SyntaxError with adding “(unicode error)” to the original error message.

kknechtel · May 15, 2024, 8:25pm

Yes, that makes perfect sense. But I think the parser can do something more useful when it catches the UnicodeDecodeError, so I made the suggestion.