Deprecate misleading escapes in strings

Arusekk · February 16, 2023, 9:00pm

I originally brought this up on (apparently legacy) python-ideas mailing list, but got redirected here with mostly negative initial feedback (I was prepared for that, therefore it is an ‘idea’, not a ‘feature request’).

Sorry for some formatting, but there is a two-link limit for new users; I tried my best to adhere to it without losing readability. (moderators: you can edit it as you wish)

Problem

People coming from C, Perl, Python 2 and similar languages tend to misinterpret "\x90" for b"\x90" quite often. The situation is only made worse by APIs that accept both str and bytes.

My idea is that Python could deprecate string literals containing any non-ASCII codepoints specified in any way different from unicode (ß) or unicode escapes (\u, \U, \N).

My experience (probably most low-level people share it) is that nearly everyone writing "\xF0" actually means b"\xF0" but has just switched context from writing some C code. If they mean "\u00F0" why not hint them to (or make them) write that?

If someone uses ASCII range it is ‘harmless’ most of the time, since all modern charsets map the 7-bit range 1:1; this is an idea of saving time of all those who use an API that accepts both str and bytes objects (paths, environment and process arguments for instance) and feed it UTF-8 while they mean ISO-8859-1, because they just forgot the b and now have to debug some strange behaviour.

(Barry brought up "\x9b" CSI as useful, but I think b"\x9b" should be preferred in 7-bit contexts, and "\x1b[" / "\33[" and "\u009b" are more widely used. No actual research from us.)

I started having the idea already back in 2021 on [StackOverflow][1]. The question is an excellent example of what I mean.

Solution

Syntax

I would not go so far to follow JSON (disallowing \x11 and \222 escapes completely), but while writing "\x00" or "\0" is useful and widely used, "\x99" (and especially "\777"!? [fortunately addressed by GH-91668]) is probably marginal and definitely less explicit than "\u0099" (in the Zen of explicit better than implicit). Byte strings do not treat b"\u00ff" as b"\xff" for example, so why should strings treat "\xff" as "\u00ff".

I also saw that there have been several GH issues closed lately concerning similar topics (https ://github.com/python/cpython/issues/98401 and https ://github.com/python/cpython/issues/81548).

In their spirit, Python could raise a DeprecationWarning and then SyntaxWarning (or should it be BytesWarning?), suggesting "\x99" to either become "\u0099" or b"\x99", eventually promoting it to some equally helpful SyntaxError. The final behaviour could be hidden behind a feature
like from __future__ import backslashes (one nice name I can think of) or the interpreter -b flag.

The new regular expression for octals would be \\[01]?[0-7]{1,2} and \\x[0-7][0-9A-Fa-f] for hexadecimals, hopefully not confusing anyone, and not much more complex than the old ones.

`repr()`

In the meantime, the default ascii() representation should eventually use the \u0099 form for all such codepoints, to keep the invariant of eval(ascii(x)) == x without syntax warnings. repr() is also affected, but it is fortunately limited to the [\x80-\xa0\xad] range. I mean [\u0080-\u00a0\u00ad].

A reasonable timeline would be to change the repr first, initially hidden under an interpreter flag or environment variable, then officially deprecate it in the documentation, then introduce the error guarded by -b / from __future__ import backslashes or another flag, then make the repr use \u00NN by default (delayable), then add the warning by default and finally make it always raise an error. As a precedent, breaking repr() was not a dealbreaker when introducing randomized seeds (even repr({"a", "b"}) is now unpredictable).

This would be of course a breaking change for a lot of unit tests, and stuff like pickle should probably support old syntax, delaying any such change until a new protocol comes (if it applies to the newest one—quite sure it does not).

Considerations

Such a breaking change must be used wisely. Other changes to octal escapes could be sneaked in, based on conclusions from the 2018 [‘Python octal escape character encoding “wats”’ thread][2] (I like writing "\0" and "\4" though, just to make my opinion clear). If going the whole hog, the 2015 [‘Make non-meaningful backslashes illegal in string literals’ thread][3] could be revived as well, wrt. "\f\v" deprecation, "\e" == "\33" introduction and such.

Please let me know what you think, what else could break, and is it useful anywhere else apart from my use case, and what similar problems you have.

[1]: https ://stackoverflow.com/q/64832281/3869724
[2]: https ://mail.python.org/archives/list/python-ideas@python.org/thread/ARBCIPEQB32XBS7T3JMKUDIZ7BZGFTL6/
[3]: https ://mail.python.org/archives/list/python-ideas@python.org/message/PJXKDJQT4XW6ZSMIIK7KAZ4OCDAO6DUT/

Rosuav · February 16, 2023, 10:29pm

As I posted on python-ideas, this similarity is NOT a bug. It’s extremely convenient, for instance, to be able to work with ASCII bytestrings and ASCII text strings using many of the same features (see eg PEP 461 which added percent formatting to bytestrings in 3.5, specifically to make this parallel stronger).

Your proposal is a significant amount of breakage, and for what benefit? Removing a similarity isn’t an improvement, and in any case, it ONLY removes this similarity in the narrow case where a \xNN escape is used and no \uNNNN escape is used.

The whole proposal hinges on this:

Can you offer any evidence to support this?

Arusekk · February 16, 2023, 11:36pm

What you talk about is different. I agree with adding useful features from text strings to byte strings for convenience. I also agree with keeping the interface consistent. I even agree that "\x41" == "A" is consistent with b"\x41" == b"A".

What I disagree with is creating confusion by alluding that "þ" == "\u00FE" is in any way similar to b"\xFE". It is b"\xc3\xbe" in UTF-8, the (almost) only charset used for storage/communication today, and the one mandated for Python 3 source code.
"\xFE" == "þ" is not consistent with b"\xFE" ==?== b"�" != b"þ" ==?== b"\xc3\xbe", which is explicitly disallowed:

>>> b'þ'
  File "<stdin>", line 1
    b'þ'
        ^
SyntaxError: bytes can only contain ASCII literal characters

There could be no syntax error and it would be more ‘consistent’, but it is there, to guard against accidental mistakes, even if it made sense.

Is it significant amount of breakage, though? I think that if I ever write "\xNN" meaning a unicode codepoint, 0xNN <= 0x7f will hold. If not, why not write "\u00NN". Or better yet, "þ". (You have to do one of the latter in several languages already, e.g. JSON.)

As for the evidence, see the linked StackOverflow question for one, but I’ve had numerous reports on the pwntools repo for another example (but also countless private conversations with infosec friends): Issues · Gallopsled/pwntools · GitHub

I hope that makes my idea clearer.

Rosuav · February 16, 2023, 11:54pm

Yes, it is. Valid code becomes invalid. I have a lot of code that uses \x1b in either text or byte strings to mean an escape character; ANSI escape sequences don’t magically become different because they’re in a text string:

print("\x1b[44mBlue\x1b[0m")
some_file.write(b"\x1b[44mBlue\x1b[0m")

Had Unicode strings never supported \xnn notation, the call to add support would be somewhat weakened by the fact that it can already be spelled \u00nn, but the parallel is still useful. Since this notation DOES exist, removing it is significant breakage.

Please, can people stop being so dismissive of backward compatibility? Python is not React.js.

Arusekk · February 17, 2023, 12:30am

Fortunately 0x1b <= 0x7f holds and your code would therefore stay valid. What other codes do you think are frequently used? I personally mainly use NUL and ESC. Barry mentioned Unicode CSI, but it is probably not even implemented in most terminal emulators.

I see the backwards compatibility as a significant issue to consider, and therefore I posted this as an idea thread, in order to gather some feedback if I am the only one forgetting b more often than not. However, octal escapes have already been limited from "\000"..."\777" to "\000"..."\377", which breaks backwards compatibility in a way (GH-91668 linked above, released in py3.11). Invalid escapes, such as "\m" have also already been deprecated (since py3.6). Also, DeprecationWarnings were invented in order to phase in such changes softly, then getting harder and harder before removing a feature completely after several years have passed. Who cares about a feature that has been shouting annoying warnings for three years in all maintained releases of major OSes? I know, sometimes compromises must be taken to reduce the maintenance burden on legacy software. But a directed warning with a ready hint looks tempting to me.

(The warnings are now hidden behind a -W default interpreter switch, but they are bubbling up to the surface steadily.)

Rosuav · February 17, 2023, 12:53am

Ah, true, sorry. I’ve also used \x80 but less commonly, so it’s less of a strong parallel. My bad.

Still, the backward compatibility issue demands a strong justification. Possible justifications include:

Fixing known, actual issues such as with Windows path names - when "C:\Path\Filename" works and "c:\path\filename" doesn’t, even though, to most people’s minds, those paths should be equivalent
Matching other languages’ behaviour (weak justification but can help to tip the balance one way or the other)
Fixing something that is fundamentally broken. In Python 2, "\377" == "\777" which is highly surprising. (The spare bit is simply dropped.)

So what’s the advantage here? What major problem is being solved by this breakage?

Note that “nobody uses this feature” isn’t a justification. “People use this feature” is an objection, but absence of objection is not a justification.

steven.daprano · February 17, 2023, 3:03am

I take it you are only referring to \xXX escapes with 0x80 <= XX <= 0xFF. That is, I guess, a little more justifiable than trying to deprecate and remove very common escapes like \x01 and other control characters.

\xXX escapes are especially useful for the C1 control characters 0x80 through 0x9F.

The question is not “why not” force them to write something longer, but what positive effect will that cause, that is worth:

the inconvenience
the breakage of existing code
the code churn
the change in documentation
which instantly makes many books, blog posts, examples etc obsolete

etc. In Python 3, the strings '\xF0' and '\u00F0' (and for that matter '\U000000F0') are identical. You have not convinced me that allowing the first is harmful. The evidence you give at this stackoverflow post is not very convincing: the fault there is not the coder mistyping '\xF0' when they want the bytes b'\xF0' but the coder receiving bytes and trying to blindly and wrongly decode it as UTF-8.

Prohibiting \x escapes in strings would not help the user in this case. Even changing the repr of strings will not help, because the user doesn’t have a string, they have bytes.

In other words, I think your example here is irrelevant to the issue you are raising. As for the other examples from the pwntools project, I have briefly looked at one or two and I don’t think they are relevant either, but I have not done a exhaustive deep dive into them all. (You shouldn’t expect us to do your homework for you, and find good examples that justify this proposal. That’s your job!)

To justify a breaking change, you need to establish that the benefits of the change are significantly greater than the costs of keeping the status quo. I don’t think you have come even close to doing that.

Three things:

As much as we wish for “UTF-8 everywhere”, it is not true that UTF-8 is the only, or even almost the only, encoding in use. (By the way, UTF-8 is not a charset. Unicode is the charset, UTF-8 is one of many codecs used for encoding/decoding Unicode chars to/from bytes.)
Nor is UTF-8 mandatory for Python source code. It’s just the default.
And most importantly, I don’t think that anyone says that the Unicode character þ “is in any way similar to” the single byte b"\xFE".

If anyone is saying that, they seriously need to get out of the 1990s and stop assuming Latin-1 everywhere. What we do say is that the character þ can be written as the literal '\xFE' in addition to the other escapes. The character '\xFE' is no more similar to the byte 'b\xFE' (despite the visual similarity) than the character '2' is to the int 2. Python coders are expected to learn the differences between strings and ints, and strings and bytes.

I do think that you have made a reasonable case for having linters warn on hex and octal escapes for code points 0xA0 and beyond. We often recommend linters enforce stricter rules than the language itself. But beyond that, I don’t think you have made your case for this breaking proposal.

storchaka · February 17, 2023, 8:28am

The drawbacks of this change are obvious – breaking existing Python code and tests, adding arbitrary difference with other programming languages and formats. It would still be a viable proposition if it solves a common problem. But I have never met a case of misinterpreting "\x90" for b"\x90".

barry-scott · February 17, 2023, 8:38am

-1 i think you will break too much valid code.

This is valid and does not match your rules. ‘\x9b’ that is the ANSI CSI in 8-bit.
In 7-bit it is ‘\x1b[‘.

The first DEC VT 100 implemented 8 bit ANSI control codes and it was often used in apps to speed up updating screens.

I must admit i have not tested on xterm etc. must do tbat ans update here.