Please don't break invalid escape sequences

MegaIng · December 16, 2024, 1:28am

Maybe a DeprecatedSyntaxWarning subclass of SyntaxWarning should be added to clearly set the expectation of users.

Then the progression would be (IIRC) DeprecationWarning → DeprecatedSyntaxWarning → SyntaxError.

As for timeline, I don’t believe there is a rush. This was a DeprecationWarning from python 3.6 to python 3.11, and IMO it can be a SyntaxWarning from python 3.12 to python 3.17 and turn into a SyntaxError in python 3.18. At that point no version where it wasn’t a SyntaxWarning ^[1] is still actively supported.

or DeprecatedSyntaxWarning ↩︎

Wombat · December 16, 2024, 1:30am

Here are 16,000 examples: Code search results · GitHub

sirosen · December 16, 2024, 2:13am

I might raise this on Jupyter’s issue tracker if I were you. I don’t know exactly what you saw, but if it’s not obvious to you as a user that you’re getting warnings from code that you have written, that seems to me like a very significant UX issue.

Maybe? I don’t want to dismiss the idea out of hand, but I also don’t know that this thread provides good evidence that it’s necessary.

Such a change would probably be best argued for in a fresh thread which focuses on how “sometimes SyntaxWarnings indicate future errors and sometimes they indicate other things”. So far, that topic hasn’t been touched on very much.
Notably: I don’t think that the OP’s concern about not having been clear on the meaning of the warning argues for a Python change. It might, per above, argue for a Jupyter change.

I feel like a -0 on the change, personally. I wouldn’t mind, it’s fine, but isn’t it adding yet-another-thing to be documented and learned?
Perhaps the text of the warning should be changed, if clarification is needed.

umarbutler · December 16, 2024, 2:30am

I think I sometimes get them using magic commands, its just that the errors look the same as regular Python errors, probably because the commands are dispatched to Python?

But arguably, right now, there is a lack of documentation, ie, SyntaxWarning does not immediately tell you this is going to be deprecated. DeprecatedSyntaxWarning would definitely do that though!

A subclass is cleaner, the name tells you immediately what’s going on, but at the very least couldn’t a message also be added, like SyntaxWarning: \e is an invalid escape sequence and will not work in future Python versions.?

jamestwebber · December 16, 2024, 2:41am

If it printed that message for every escape sequence, it would mean that none could be added in the future, which was one of the reasons that it’s useful to deprecate the invalid ones.

(well, it wouldn’t be a binding promise, but it would be a confusing message if it didn’t remain true)

sirosen · December 16, 2024, 2:43am

TBH, both of those error presentations from Jupyter look fine to my eye, except maybe for the fact that the %time usage shows you that little check which makes it look like a success.
In the case of their %time command, the trace is modified, but that’s not surprising.

In both cases the error comes from code which you, as a Jupyter user, wrote, and the NameError is clear enough. I’m not clear that there’s an improvement opportunity here for Jupyter, but others might feel differently.

I think this is the part that I subtly disagree with. I’m not sure that a subclass is clearer. Isn’t it better to modify the message, if we want to clarify? Deducing things from what warning class is used seems harder for users – especially novice users – than a longer and more explicit message.

My inclination, if I were a maintainer, would be to leave the warning as-is for now, and update it when a version is selected to change this into an error.
That is, if 3.18 is going to make this an error, 3.17 and 3.16 could emit:

SyntaxWarning: invalid escape sequence '\c', this will be an error in Python 3.18

umarbutler · December 16, 2024, 2:47am

Well it would only occur until SyntaxError is introduced. And the wording was not exact. Just something along those lines. Perhaps \e is not currently a valid escape sequence. In [Python 3.XX/a future version of Python], invalid escape sequences will raise a SyntaxError.

jamestwebber · December 16, 2024, 2:50am

But that would require a decision to actually make that change, which probably requires PEP, and none of that has happened yet. It might never become an error. The current warning seems totally fine.

bavalpey · December 16, 2024, 2:54am

I myself am very against this because I don’t see an issue with this as an error everywhere. Backslashes in strings can reduce verbosity, in regex. And there are cases where raw strings make it more verbose.

This is a scripting language. And it has allowed this way of writing strings for a long time. It should not break something so simple as long as it retains version 3. It should not deviate from this expectation.

To mirror Chris’s attitude, if you want this change so desperately, you can always write your own version of Python.

Add a new escape sequence, fine. But labeling code that has a backslash before a character that isn’t an escape as “broken” is wrong. And I want to stress this. It’s wrong. Code that has this works as intended. It functions. The code will operate. What will make it broken is causing code with absolutely no issues to error.

Does it indicate a potential issue? Yes. This is what linters are for. It is not the job of the language to hold the hands of its developers, preventing them from writing code how they see fit.

I would have been against the initial deprecation way back in 3.6, had I been privy to its discussion. And I’m against it now.

This is going to get repetitive, but I very much dislike the characterization of backslashes in non-raw strings as broken. Does it work as a literal backslash except when in front of the 8 (9 if you count escaped quotes) characters? Yes? Okay, so if you don’t need a backslash in front of one of those and so you use it verbatim, then you wrote broken code?

What exactly is the problem? It’s adjacent to error inducing behavior? At what point did Python become a language that took away agency from its users under the guise of protecting them.

I had wanted my initial post to be one and done, but admittedly I engaged in the back and forth myself.

Let’s all go back to it so we can remind ourselves of the only thing that matters:

Making invalid escapes into a syntax error will break existing code

Any time there is code breakage, there needs to be a significant demonstration of its advantage.
Unless there are serious security ramifications, it is not on the defenders of the status quo to show that the breakage is significant.
It is on those advocating for the breaking change to prove, with a compelling case, that the change provides significant advantages.

And the advantages posed, at least in this thread, are not significant enough for the large impact this will have. This already high bar is rightly set even higher due to the fact that this breakage will occur in code that people treat as comments. Let that sink in. Your scientific codebase perhaps no longer works because three dependencies deep there’s a comment that is now considered a syntax error.

So let’s visit the advantages, and then we as a community can determine whether they are significant enough to warrant making completely valid code break

Allows python to add new escape sequences

This is an outcome, not an advantage. Can the same exact sequences not already be spelled out? If so, then what’s the advantage? That it’s easier to write? I personally don’t think convenience warrants breakage
If you really need the new sequence, then just introduce it. This is fine, you have already done your due diligence by having invalid escapes marked as deprecated for 6 minor versions, and marked as a warning for 2 (or more depending on when the new escape is added).

No one here is arguing that new escape sequences should be forbidden.

The existing status quo is confusing for new users.

Quite frankly, no one here is qualified to speak definitively on the point about this being easier for novices one way or the other. We are not novices. Actual studies need to be conducted and examined for this line of argument to carry any water.

Anecdotally, I have firsthand experience teaching intro computer science students. This is not something they struggle with in my experience. Having to learn to recognize 8 escape sequences is a really low bar. Even if it were 12, or 15, it’s nothing. They already know about not being able to use certain variables names because they are keywords.
I’ve never had anyone express confusion over \n being special while \d, for example, is fine.

Code with invalid escapes is more likely to inadvertently contain real escape characters

This is true, but the key word here is likely. It is not guaranteed. And that is why it is very appropriate for this to remain a warning. It should be a warning, and it should be visible. But this is only “more likely”. The likelihood of invalid escapes left in code is already small.

jagerber · December 16, 2024, 4:40am

Anecdotally, I personally struggled for years to understand the story here. I was confused by exactly what you’re pointing out. From however I was taught I basically took away that I should be really scared about slashes in strings, but I didn’t know why or how to get around it. I knew that prepending strings with r (thus making them raw strings) should somehow make me feel safer but again I didn’t understand why. Perhaps this comes from the fact that python is the first language I very seriously learned. Maybe for users coming from C or other languages the idea of escape characters is much more natural.

To address many of your repeated points: I emphatically agree with you that having an “unrecognized” or “invalid” escape sequence in a non-raw is NOT an error as has maybe been said in this thread. It is code that works now.

While I agree "\Hello \World" is not an error or broken right now, I also emphatically agree with a comment above that "\Hello \World" is a “misfeature”. Yes, it is a feature of the language, BUT, the language community and stewards of the language have concluded that it as a BAD feature that makes the language worse to use for MOST users. So they are planning to get rid of it. This is a reasonable thing to do in my opinion. Especially if care is taken which it seems is going on. Yes, making this change will negatively impact some fraction of python users, but it has been deemed the relatively small cost to this fraction of users is worth the better understanding that will be realized for possibly many more users, especially newcomers.

It is definitely a cost-benefit analysis. It’s not “obviously” ok to make this change. But, for me personally, looking at the costs and benefits, it looks like a good thing to do.

umarbutler · December 16, 2024, 6:37am

I’m curious what the main issue is with just having a permanent SyntaxWarning with a semantic message or DeprecatedSyntaxWarning?

It seems like a win-win to me at least:

We can break the least amount of code possible. In fact, until \e is added, no code will break. And after then, only \e will break, but everyone would have gotten a fair warning about that.
We can still guide user behaviour against invalid escapes. You could even mention raw strings in the warning message, ie, maybe you want to use a raw string or maybe you want to escape your backslash.

The only downside is that, the user is not punished into never using invalid escapes and so later on their code might break because someone decided to add say \W into the mix (breaking C:\Windows). But is that one downside worth all the breaking and depriving users of the power to create a string like "\😊", knowing full well that is unlikely to ever become a real escape so there’s no need to do anything about it?

Nineteendo · December 16, 2024, 8:15am

Doubling the backslash doesn’t make a difference, except for '\b', '\\' (and some octal escapes):

Most of the escape sequences supported by Python string literals are also accepted by the regular expression parser
   \a      \b      \f      \n
   \N      \r      \t      \u
   \U      \v      \x      \\
(Note that \b is used to represent word boundaries, and means “backspace” only inside character classes.)

'\u', '\U', and '\N' escape sequences are only recognized in Unicode (str) patterns. In bytes patterns they are errors. Unknown escapes of ASCII letters are reserved for future use and treated as errors.

Octal escapes are included in a limited form. If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.

If you assume unrecognized escapes are correct (or if you’re not using them), the main advantage of raw strings is being able to use r'\\' instead of '\\\\' for matching a literal backslash.

Nineteendo · December 16, 2024, 8:44am

There are 23 character escapes, 15 if you exclude the octal escapes (which is a lot to keep track of):

\<newline>      \\      \'      \"      \a      \b      \f
\n      \r      \t      \v      \0      \1      \2      \3
\4      \5      \6      \7      \x00    \N{NUL} \u0000  \U00000000

storchaka · December 16, 2024, 9:37am

There is no plan to make this warning an error in near future. We were aware that that it will take long time when this warning was introduced. We will make it an error only when we have confidence that almost all code was converted or abandoned.

We usually only add syntax warnings to the compiler if we are absolutely certain that the corresponding code will never work as intended. This case is an exception. A syntax warning is not emitted and cannot be emitted for the code that does not work (you get either syntax error for "C:\Users" or wrong result for "C:\new"), and it is emitted for the code that works ("C:\Windows"). The point is that it is very likely that string which contains an escape sequence that does not work as intented also contains an escape sequence which triggers a warning, so when fixing a warning you will also fix a programming error. Or at least the code that contains such programming error, has also a string that emits a warning. Or at least the user will learn to be more careful and use raw strings or double backslashes.

In long term, treating unrecognized escape sequences as errors is beneficial. It makes semantic simpler (two cases instead of three) and the code more error-proof. So that is inevitable. But when that will happen has not yet been decided.

alphaparrot · December 16, 2024, 2:52pm

Here’s a really big example of LaTeX being used in documentation without being escaped (nor any mention of raw strings). The numpy docstring How-To: Style guide — numpydoc v1.9.0rc0.dev0 Manual.

Now, the source code behind that page does use an r-string, and users who publish to readthedocs likely are too, because Sphinx in particular does stumble on non-raw docstrings with LaTeX. But lots of code out there is frequently used, infrequently maintained, documented thoroughly, and not published to readthedocs—and likely contains LaTeX equations or algorithmic flow diagrams (I’d hate to see an FFT butterfly diagram in a docstring that has to have backslashes escaped). It’s a relatively simple matter to search GitHub for them, though that misses code passed around via email or privately-hosted server (more of it exists than you’d think), code on gitlab, and the enormous volume of code that exists in private repositories (often in the corporate sphere).

A lot of great points have been made over the weekend. Here are some of my big ones that I keep coming back to as I think about this:

Code that is infrequently maintained is not necessarily infrequently used. A lot of good scientific code (good meaning it does what’s intended, and does so quickly and efficiently, not necessarily that it’s pretty or conforms to PEP) does not rely much on external libraries, and therefore doesn’t need to change much between say Python 2 and Python 3.18. If it ain’t broke, why fix it (and more to the point, why break it unnecessarily).
Code which does what it’s intended to do is not broken. Full stop. PEP-incompliant and broken are two very different things. If the docstring exists to function as documentation (nominally its only purpose), and isn’t being passed to a parser that complains about escape sequences, then it’s not broken no matter what’s in it.
Many users of such codes are not in a position to maintain them themselves, either because they lack the time or the know-how. Sure, they could learn; that is unrealistic and overly idealistic. The fact of the matter is that placing an unreasonable burden on the community will undoubtedly lead to fragmentation.
Speaking of fragmentation: yes, you could just run old code in old python. This is a bad thing to encourage, both for safety reasons and the continued usability of python as a jack of all trades. And old python will not always be an option. Can you even install Python 2.x on modern Unix systems anymore? If you have an environment where you need to use something like numpy or scipy to analyze the output of an older code, odds are you actually can’t install those libraries if your python version is too old, at least without doing it manually. Older python versions cease being supported by many packages over time. I myself had to drop support for python 2, python 3.6, etc in a package I maintain because some of the libraries I depend on no longer support older python versions. The same arguments apply to forking python. That’s a bad idea; let’s not go there.
Yes, a lot of these discussions already happened. The python core dev community leads the user community by literal years when it comes to discussing the future of the language. The vast majority of python users don’t even know this forum exists. It is a very good thing (arguably an intended consequence!) that triggering a SyntaxWarning is now bringing people like Umar and me into the discussion, especially given that raising an Error is still far off. That means there’s time to change course. Arguably this should happen more; perhaps that way we would never have had the disastrous decision to abandon default integer floor division (something that all intro programming students learn in Lesson One in virtually every programming language except now Python) in the transition from 2 to 3.

People from the python user community who are not part of the python dev community (but may maintain packages smaller and less popular than the 100 most popular packages) are realizing this change is planned, and are coming here to say this would have a deleterious impact on a lot of existing code for numerous reasons, and only seemingly marginal benefit. I think that’s a good thing and indicates it’s worth revisiting the decision, and considering whether a revised, more cautious approach is merited (such as implementing a change to how docstrings, regex strings, etc are processed so that normal strings throw errors, but others don’t).

Yes, this means messy code will continue to be messy. But it works (and is sometimes necessarily messy in order to work), and code aesthetics are not a hill worth dying on imo when it comes to code other than your own or which you work on with others. Python should encourage good code aesthetics (such as via PEP guidelines), but should not break existing code which is insufficiently pretty.

jamestwebber · December 16, 2024, 3:03pm

Sure, this is an example of how to do it correctly–with raw strings. I’m not sure what the point of this example is, except as an example of a well-maintained project that follows best practices.

It’s not a coincidence that actively maintained projects like numpy and SymPy have already made the very simple correction whenever it was an issue. The SyntaxWarning is working exactly as desired.

pf_moore · December 16, 2024, 3:06pm

This is the point, though. To fix this issue, all you need to do is change the docstring to an r-string. Yes, it’s a change, but it’s not hard. And I understand that it’s to fix a problem that doesn’t affect you - but the CPython developers have to consider the far wider impact on all Python users.

The point has been made that this enables future changes like making \e an escape character. That would be a small but positive benefit for many Python users. We could just make that change without this deprecation, but if we did so, that would also break your code, and you’d still have to make a change. And another change every time we added a further escape sequence. The decision was made that it’s better to warn people (and require a change from them) once, rather than potentially many times. Maybe that’s worse for you personally, but overall the decision was made that it would be a net benefit for the millions of Python users out there.

Sure. Would you fix it if there was a typo in one of the equations? Or a bug in the code itself? If so, why is it such a problem to fix the docstring by adding an “r” character to the start of the string? If you’re saying you’d leave an incorrect equation in the docstring or not fix a code bug, then I doubt there’s much anyone could do to make this less of a problem for you. But even then, you still have the option to just continue running the code with Python 3.11.

fungi · December 16, 2024, 3:51pm

To mirror Chris’s attitude, if you want this change so desperately, you can always write your own version of Python.

Those who announced this planned behavior change are in fact writing their own version of Python already, right? I, for one, welcome the eventual transition to a syntax error, and have ample trust in the SC members and core devs involved in that decision that it will be undertaken as slowly as it needs to be and with due care for overall impact to the ecosystem.

I’ve personally switched so many strings to raw in my projects since the deprecation in 3.6 that I lost count long ago. Updating code syntax to meet expectations of new and future interpreter and language versions is all part of software maintenance. Unmaintained software is going to stop working with newer platforms for a variety of reasons, and this one is pretty minor compared to other ways Python, CPython and the standard library have evolved over the years.

mikeshardmind · December 16, 2024, 4:05pm

I’m often in the camp of teaching how something works is better than breaking existing code, but not in cases like this.

Invalid escape sequences were always broken, they just so happened to in some cases not cause immediately visible problems. In cases like this, fixing the inconsistency and teaching users goes hand in hand, and it allows new escape sequences to be added without each potential new addition having to be considered as a breaking change.

I actually dislike how frequently working code stops working in python, whether due to removing working modules with no known deficiencies^[1], or changing behavior people should reasonably be able to rely upon.

Still, I don’t see a world where it’s reasonable to rely on invalid escape sequences both remaining invalid (your code suddenly changes behaviorally if it ever becomes a valid one, which prevents additions to the language) and also relies on the interpreter deciding to continue to allow an invalid escape.

I don’t want to re-litigate the removal of “dead batteries”, but at least one of the removed modules was pure math, implemented in C, and without any known deficiency. It wasn’t dead because it wasn’t receiving maintenance, it was complete and did not need it. While this did mean it was easy to extract and have nearly no burden to maintain, it’s caused things like users thinking there was something wrong with the original and that it shouldn’t be used because it was removed alongside several modules that did have genuine problems. ↩︎

patrick-kidger · December 16, 2024, 7:41pm

IMO this is a big enough breaking change that it’d be reasonable in a ‘Python 4’, if-and-when that ever comes about.

Until then it’s just going to break half of all the grad student code ever written.