I originally brought this up on (apparently legacy) python-ideas mailing list, but got redirected here with mostly negative initial feedback (I was prepared for that, therefore it is an ‘idea’, not a ‘feature request’).
Sorry for some formatting, but there is a two-link limit for new users; I tried my best to adhere to it without losing readability. (moderators: you can edit it as you wish)
Problem
People coming from C, Perl, Python 2 and similar languages tend to misinterpret "\x90"
for b"\x90"
quite often. The situation is only made worse by APIs that accept both str and bytes.
My idea is that Python could deprecate string literals containing any non-ASCII codepoints specified in any way different from unicode (ß) or unicode escapes (\u
, \U
, \N
).
My experience (probably most low-level people share it) is that nearly everyone writing "\xF0"
actually means b"\xF0"
but has just switched context from writing some C code. If they mean "\u00F0"
why not hint them to (or make them) write that?
If someone uses ASCII range it is ‘harmless’ most of the time, since all modern charsets map the 7-bit range 1:1; this is an idea of saving time of all those who use an API that accepts both str and bytes objects (paths, environment and process arguments for instance) and feed it UTF-8 while they mean ISO-8859-1, because they just forgot the b
and now have to debug some strange behaviour.
(Barry brought up "\x9b"
CSI as useful, but I think b"\x9b"
should be preferred in 7-bit contexts, and "\x1b["
/ "\33["
and "\u009b"
are more widely used. No actual research from us.)
I started having the idea already back in 2021 on [StackOverflow][1]. The question is an excellent example of what I mean.
Solution
Syntax
I would not go so far to follow JSON (disallowing \x11
and \222
escapes completely), but while writing "\x00"
or "\0"
is useful and widely used, "\x99"
(and especially "\777"
!? [fortunately addressed by GH-91668]) is probably marginal and definitely less explicit than "\u0099"
(in the Zen of explicit better than implicit). Byte strings do not treat b"\u00ff"
as b"\xff"
for example, so why should strings treat "\xff"
as "\u00ff"
.
I also saw that there have been several GH issues closed lately concerning similar topics (https ://github.com/python/cpython/issues/98401 and https ://github.com/python/cpython/issues/81548).
In their spirit, Python could raise a DeprecationWarning
and then SyntaxWarning
(or should it be BytesWarning
?), suggesting "\x99"
to either become "\u0099"
or b"\x99"
, eventually promoting it to some equally helpful SyntaxError
. The final behaviour could be hidden behind a feature
like from __future__ import backslashes
(one nice name I can think of) or the interpreter -b
flag.
The new regular expression for octals would be \\[01]?[0-7]{1,2}
and \\x[0-7][0-9A-Fa-f]
for hexadecimals, hopefully not confusing anyone, and not much more complex than the old ones.
repr()
In the meantime, the default ascii()
representation should eventually use the \u0099
form for all such codepoints, to keep the invariant of eval(ascii(x)) == x
without syntax warnings. repr()
is also affected, but it is fortunately limited to the [\x80-\xa0\xad]
range. I mean [\u0080-\u00a0\u00ad]
.
A reasonable timeline would be to change the repr first, initially hidden under an interpreter flag or environment variable, then officially deprecate it in the documentation, then introduce the error guarded by -b
/ from __future__ import backslashes
or another flag, then make the repr use \u00NN
by default (delayable), then add the warning by default and finally make it always raise an error. As a precedent, breaking repr()
was not a dealbreaker when introducing randomized seeds (even repr({"a", "b"})
is now unpredictable).
This would be of course a breaking change for a lot of unit tests, and stuff like pickle should probably support old syntax, delaying any such change until a new protocol comes (if it applies to the newest one—quite sure it does not).
Considerations
Such a breaking change must be used wisely. Other changes to octal escapes could be sneaked in, based on conclusions from the 2018 [‘Python octal escape character encoding “wats”’ thread][2] (I like writing "\0"
and "\4"
though, just to make my opinion clear). If going the whole hog, the 2015 [‘Make non-meaningful backslashes illegal in string literals’ thread][3] could be revived as well, wrt. "\f\v"
deprecation, "\e" == "\33"
introduction and such.
Please let me know what you think, what else could break, and is it useful anywhere else apart from my use case, and what similar problems you have.
[1]: https ://stackoverflow.com/q/64832281/3869724
[2]: https ://mail.python.org/archives/list/python-ideas@python.org/thread/ARBCIPEQB32XBS7T3JMKUDIZ7BZGFTL6/
[3]: https ://mail.python.org/archives/list/python-ideas@python.org/message/PJXKDJQT4XW6ZSMIIK7KAZ4OCDAO6DUT/