Two cases I’d like to consider here. First, regarding Unicode syntax errors for bad escape sequences.
Python up to 3.8 gives (yes, I’m using a common Windows gotcha for testing on Linux):
$ python3.8 -c "data = 'C:\Users\me\Desktop\data.txt'"
File "<string>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Python 3.9-3.11 show the corresponding code and point at the end of string:
$ python3.11 -c "data = 'C:\Users\me\Desktop\data.txt'"
File "<string>", line 1
data = 'C:\Users\me\Desktop\data.txt'
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Python 3.12 highlights the entire string:
$ python3.12 -c "data = 'C:\Users\me\Desktop\data.txt'"
File "<string>", line 1
data = 'C:\Users\me\Desktop\data.txt'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
First off, calling these “unicode errors” is actually really strange because:
unicodeis no longer the name of the type- the problem is specifically with sequences designed to avoid using “Unicode” (meaning non-ASCII) characters in the code
- the error message equally applies to
\xescapes
Second, referring to the 'unicodeescape' codec is also strange, and not helpful. It reveals a completely irrelevant implementation detail[1], and gives the false impression that one could somehow substitute a different codec. It also entails that the error is presented as a “decoding” problem related to Unicode, when it’s actually an issue of backslash escape processing.
Third, these highlights are not useful. Python clearly knows where the problem is (because it prints the “position” indication) - so it should be able to use caret symbols to point at it properly.
My suggestion:
$ python3.x -c "data = 'C:\Users\me\Desktop\data.txt'"
File "<string>", line 1
data = 'C:\Users\me\Desktop\data.txt'
^^
SyntaxError: invalid Unicode escape. To represent a Unicode character, '\U' must be followed by 8 hex digits. Use '\\U' instead (or a raw string) if this should be a backslash followed by 'U'.
Similarly for \u and \x escapes, indicating how many hex digits to use for each of those.
Second, string literals which are unterminated due to the intended termination being escaped.
Python up to 3.9 gives
$ python3.9 -c "'/\/\/\'"
File "<string>", line 1
'/\/\/\'
^
SyntaxError: EOL while scanning string literal
Python 3.10-3.12 put the caret at the start of the string, which is arguably worse:
$ python3.10 -c "'/\/\/\'"
File "<string>", line 1
'/\/\/\'
^
SyntaxError: unterminated string literal (detected at line 1)
What I’d hope to see instead is a highlight of the last \' sequence on the line with basic advice, and simpler wording:
$ python3.x -c "'/\/\/\'"
File "<string>", line 1
'/\/\/\'
^^
SyntaxError: couldn't find the end of string. Use "\\'" instead if the highlighted part should be a backslash and the end of string.
If there is \ followed by the quote at any point past the opening quote, it would scan for the last such sequence:
$ python3.x -c "'/\'/\'/\' + 'text'"
File "<string>", line 1
'/\'/\'/\' + 'text'
^^
SyntaxError: couldn't find the end of string. Use "\\'" instead if the highlighted part should be a backslash and the end of string.
Of course the error message doesn’t mention raw strings this time, because they’re unlikely to be helpful ![]()
which is itself strange, actually. The implication is that Python has already read the file as bytes to get as far as a
codingdeclaration or else decide to use UTF-8; then converted to string for tokenization; then converted the string literal token back to bytes in order to be able to apply a codec; then attempted to convert back to string via that codec… ↩︎