Please don't break invalid escape sequences

umarbutler · December 13, 2024, 5:48am

Looking at the ‘What’s New in Python 3.12’ article on the Python website, it seems that ‘[i]n a future Python version, SyntaxError will eventually be raised, instead of SyntaxWarning’.

If that is the case, then I would strongly advise raising a warning that is more descriptive than just SyntaxWarning. I had just been ignoring the warnings as a result, but had I seen a warning that used the word deprecation, I would’ve immediately known that I need to update all of my code now.

Additionally, is it really worth breaking so much existing code, and making it more difficult to copy and paste Windows paths, ~~or even write a simple string like “You need to sleep and/or rest!”,~~ to achieve whatever objective this change is meant to achieve?

I’d like to suggest that we don’t break all that code — it would be extremely painful to fix.

This is a cross-post of a comment I made on an adjacent GitHub issue where I was referred here. I’d also like to share the comment of alphaparrot on that thread:

I’m with @umarbutler and @fenchu on this one; this seems like an unnecessary change that creates a large amount of work across a very wide range of projects. Windows paths have already been mentioned, but a very big one that will come up very quickly and annoyingly is that much of the scientific (and mathematical) community has adopted Python for analysis tasks, and there are many libraries and packages written in Python for performing scientific tasks. These projects often include LaTeX-formatted sequences in their docstrings, as some documentation engines can automatically parse LaTeX math sequences into mathematical symbols. LaTeX macros for greek letters and other symbols typically start with a backslash; the inclusion of thoroughly-documented equations in docstrings in scientific packages will therefore break those packages once this becomes an Error rather than a Warning (and the move to SyntaxWarning means those packages will now start spitting loud warnings).

The nature of scientific code development is that a great deal of it is built by academics in their spare time, or by graduate students and postdocs who may not stay in the field. As a consequence, a lot of code that works fine is nonetheless not actively maintained, or only infrequently maintained. So it’s likely that an awful lot of scientific developers will not be sufficiently involved in Python development to see this coming, or may not have the time to update their code. Similarly, while using raw-strings for docstrings containing backslashes has been in PEP-257 since 2001, most programmers in scientific disciplines have no idea that’s the case, because docstrings have always just worked as plaintext. Why unnecessarily add an extra character that seemingly does nothing?

PEP-257 also states that the PEP contains conventions, not laws or syntax. This seems to be a move towards encoding PEP conventions as syntax rules. I would suggest that a better long-run approach is to treat docstrings as separate from all other strings (they already are) and always process them as a raw string literals. They behave like plaintext documentation, are used as plaintext documentation, and should therefore be processed that way. The only thing I could see that breaking is if someone put a byte literal in their docstring for some ungodly reason–as opposed to the current plan of action which will break a great many packages, and demand valuable labor from even more packages maintainers.

NB Unfortunately I had to strip all the convenient links I had added to this post barring one because new users are limited to using just two links.

Rosuav · December 13, 2024, 6:03am

Usually I’m one of the strongest advocates of “don’t break things, don’t break things”, but this is a case where any code that’s broken by this change was already broken. The “and/or” example shouldn’t ever be an issue anyway; the correct usage there is a forward slash, as you’ve used here, and writing “and\or” is simply incorrect grammar.

Windows paths are the most common one though, and the issue here is that you get data-dependent bugs. Consider how current versions of Python behave:

>>> print("C:\Documents\Text File.txt")
<unknown>:1: SyntaxWarning: invalid escape sequence '\D'
C:\Documents\Text File.txt
>>> print("c:\documents\text file.txt")
<unknown>:1: SyntaxWarning: invalid escape sequence '\d'
c:\documents	ext file.txt
>>> print("C:\Users\DefaultUserName")
  File "<python-input-5>", line 1
    print("C:\Users\DefaultUserName")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
>>> print("C:\Documents\" + dirname + "\" + basename + ".txt")
<unknown>:1: SyntaxWarning: invalid escape sequence '\D'
  File "<python-input-7>", line 1
    print("C:\Documents\" + dirname + "\" + basename + ".txt")
                                        ^
SyntaxError: unexpected character after line continuation character

All of these different results are from the same bug: Unescaped backslashes in a string literal.

Further: If Python doesn’t make this an error, code may break inexplicably in a future update. Imagine if a future Python version were to add "\e" as an escape sequence equivalent to "\x1b" (very useful when writing ANSI colour codes, and found in many other languages). Now any program that uses "c:\everything" will be broken.

Academics have to learn the bare minimum of Python syntax. You already need to understand that "c:\text file" won’t work. Why is it a problem that "C:\Text File" also won’t work?

The solutions are extremely easy, too. My preferred recommendation is simply to use forward slashes! Everyone understands those, and they have no problems on different OSes. Raw string literals work too, but they introduce the edge case that r"C:\Path\to\directory\" + filename won’t work, and IMO that’s annoying enough to not want to have to explain it to people. Or just double all the backslashes, since string literals are not raw text in a program.

I’d like to suggest that we stop trying to support broken code that works only by accident as it is even more painful to fix when it suddenly stops working.

Stefan2 · December 13, 2024, 6:05am

Does it? You’re using raw strings anyway for that, right? Wouldn’t want Python to misinterpret the “\n” in “C:\foo\now” as a newline character.

Hmm? That’s not affected at all.

rrolls · December 13, 2024, 6:11am

On the contrary - we absolutely should break “all that code”, if it’s including arbitrary backslashes in regular strings that could be followed by any character.

People should not be allowed to form the habit of putting hello\world into a regular string in Python and seeing it work fine when if they later put hello\nothing it suddenly does something completely different to what they’ve come to expect.

If you want to include a backslash without having to escape it, use r"hello\world" - that’s what I do when I’m on Windows and writing some hacky ad-hoc script that I end up just pasting an entire absolute path into.

It’s also really not that hard to run a Python script, see a SyntaxError, go to the source location, add one backslash and re-run it, and repeat until your script works, so I don’t really buy the argument of “this would break existing code” here.

Also, in general, don’t ignore warnings - they’re there for a reason. You should treat them like errors. There is a reason why various modern programming languages have adopted a policy of not having any compiler warnings at all and saying something is either correct or an error: experience has shown that people tend to just ignore warnings and then suffer more later, so this policy fixes the problem by forcing people not to ignore warnings.

Separately -

I would suggest that a better long-run approach is to treat docstrings as separate from all other strings (they already are) and always process them as a raw string literals.

This isn’t really a good idea, since docstrings are just triple-quoted strings that happen to be in certain places, so it’d totally break the ability to use backslash for an escape sequence inside a triple-quoted string. I’ve written many a docstring where I have to either escape a backslash or avoid including it (for example, I’ve frequently written the phrase """`x` must either end with a newline or be empty""" where I might have written """`x` must either end with `\n` or be empty""" if only I didn’t have to escape the backslash - there are cases where avoiding it like this isn’t practical so I have to escape instead, though none come to mind off the top of my head) - but even with that I wouldn’t suggest we force triple-quoted strings to always be processed raw just to make my life writing docstrings a little easier.

Now if Python did what most languages do and used doc comments rather than docstrings (which personally I feel it should have done from the get-go), this’d be a different story and the backslash-in-docstring problem wouldn’t exist in the first place.

umarbutler · December 13, 2024, 6:20am

My preferred recommendation is simply to use forward slashes!

If only Windows agreed. The main problem is copying and pasting Windows paths.

Raw strings work yes. And many people in this thread have pointed out ‘\n’ won’t work unless you do ‘\\n’ or use a raw string.

Yet the fact remains that this is going to break a bunch of code AND make it more difficult to copy and paste Windows paths, particularly for beginners. I do use raw strings for paths but sometimes I don’t because there’s no \n in my path and its one less click to not add r to the start of the string.

umarbutler · December 13, 2024, 6:23am

If that were the case, I don’t think I’d get any work done. I’d be too busy trying to ‘debug’ my pip.

In all seriousness, the problem as I’ve stated already is that a SyntaxError is not particularly semantic. Yes, I did see the syntax error and then headed to GitHub but it would’ve been much easier to raise a more semantic DeprecationWarning or at least provide a warning message that flags, hey, this is going to break your code soon…

Stefan2 · December 13, 2024, 6:32am

Not just \n but also for example my \f. I didn’t even know that’s a valid sequence, never used it.

I’d say the one extra “click” of pressing the r key is not just better but also faster than thinking about whether you have a valid sequence, so your “one less click” reason seems just bogus.

Rosuav · December 13, 2024, 7:05am

Yes, I get that. But fundamentally, string literals are not raw text and never will be. So you have a few choices:

Stick the path inside r" and " and hope for the best. This will break if you have a double-quote character in the path, although I think that’s illegal in Windows anyway. It also breaks if you try to have a directory name with a trailing backslash.
Switch all your backslashes to forward slashes. Pretty straight-forward and works on all systems.
Double all your backslashes.

All of these work just fine and work by design, not by accident.

Beginners need to learn these things. Otherwise, all you’re doing is creating an environment in which the computer can’t be trusted, because seemingly-equivalent things behave very differently ("C:\Documents and settings\Text File.txt" works fine, but "C:\Users\Text file.txt" doesn’t, nor does "C:\Documents and settings\text file.txt"). I have seen the consequences of this sort of distrust, and it is far FAR worse than a simple bit of hassle around copying and pasting into string literals.

Warnings are a program’s version of pain. It’s there to keep you safe. We invent new types of warnings - linters, type checkers, and so on - to try to make our lives easier. I personally don’t follow the “all warnings are errors” mantra, but I do agree that warnings should not be ignored. Know what the warnings are telling you.

Yeah, it’s kinda odd that Python has \f for form-feed, which hardly anyone uses, but doesn’t have \e for escape, which people do frequently. But when unrecognized escapes aren’t errors, adding new escape sequences will change program behaviour, and so it never happened. I hope that a future version of Python will be able to take advantage of this and finally add in \e.

alphaparrot · December 13, 2024, 7:06am

Introducing “doc comments” as an alternative to raw-literal docstrings is a great idea, so long as the implementation is syntactically backwards-compatible. Just make any string (triple-quoted or otherwise) which appears as the first line of a function or object definition be treated as a doc comment, not a string.

Because that’s the thing, docstrings are not just a triple-quoted string. They get turned into the doc attribute of the function or object, and they’re only a docstring if they are the first thing that appears in the function/object definition. If they already have a special role in the syntax, then it’s unlikely anyone will be surprised if they have different syntax rules from other strings (such as being processed as doc comments, or as raw literals—I’m not convinced by the newline example, btw; seems like processing docstrings only as raw string literals would have helped you and not broken anything. Just because you suffered in the past doesn’t mean others should suffer in the future).

There are certainly folks out there using triple-quoted strings to write non-docstring treatises of comments who wouldn’t be helped by this, but that’s rare compared to including LaTeX macros in docstrings (and the proper way to do long-term comments anyway is to write out the paragraphs of comments and then use an IDE to comment out each line in one go).

What I think is perhaps the most important thing to note here is the very large body of code written in the last 10 years which was not broken before, in that it worked exactly as expected, and which will break if docstrings with invalid escape characters start throwing errors (since docstrings are primarily useful as plaintext or as inputs to markdown/ReST engines). Many of those packages are not actively maintained, but are straightforward enough that they’re unlikely to break in the foreseeable future—unless docstrings break. Many of the people using these packages are not savvy enough programmers to fork a GitHub repository, fix it themselves, and deploy a fixed version; they know how to do pip install and use jupyter notebooks and that’s it. This change, if docstrings start throwing errors, will result in important and useful scientific tools falling into disuse because they took a lot of work to build, the people who built them left the field, and nobody else has time to adopt them. Alternatively, a large body of users will simply not upgrade to newer versions of Python, because it breaks the tools they use for work. That already happened with Apple’s M1 chips; astronomers were the last to upgrade their hardware, because IRAF and PyRAF didn’t work on M1 chips for several years.

Rosuav · December 13, 2024, 7:07am

Also - a bit confused here. The problem is one of syntax. You get a SyntaxWarning that explicitly says that the escape sequence is invalid. What’s the problem here?

Rosuav · December 13, 2024, 7:13am

No, that would never be backward-compatible. They would still be strings.

A “doc comments” feature would be more like this:

def some_function(n):
    # Does stuff.
    # n - how much stuff to do
    # Returns the amount of stuff actually done

    # You can't do negative stuff, that doesn't make sense
    if n < 0: raise ValueError("I refuse")
    return 0 # I'm lazy today

Then you’d need tools to parse out the comments, stopping at a blank line or whatever rule you choose to use, and uses that. It won’t be available as __doc__ but it can be used for other purposes as needed.

One feature that would make this easier would be to have a “preserve_comments” flag on ast.parse() to keep ALL comments. (There’s one that will preserve type comments, for compatibility with Py2 code, but other comments go bye-bye.) That’d simplify the development of analysis tools that use specially-formatted comments for various purposes. You could design whatever rules you like, code them up as a tree-walker, and export the information to whereever it’s needed; meanwhile, when you run the code, those comments do nothing.

alphaparrot · December 13, 2024, 7:22am

Alright, scratch the doc comments idea then. Instead the interpreter should treat docstrings differently than all other strings, silently in the background. Docstrings are documentation; they’re not meant to be executed, and a very large body of code exists which assumes that a docstring is just text and won’t throw errors (at least not Python errors).

The question is not what I could do in the future to avoid my docstrings throwing errors. I could just slap an ‘r’ on the front of my docstring to force it to be treated as a raw string literal, as I think it should be and as PEP-257 suggests. The question is what legions of other programmers did (perhaps inadvisedly) over the last 10 years. We are not talking about hypothetical future code; we are talking about existing code. We shouldn’t unnecessarily break that code when it currently works fine. Python already made this mistake once in the 2->3 transition when we changed how integer division works (actually an example of when python did opt for beginner-friendly rather than sensible and proper); we shouldn’t make this mistake again.

umarbutler · December 13, 2024, 7:36am

Yes, I get that. But fundamentally, string literals are not raw text and never will be. So you have a few choices:

Stick the path inside r" and " and hope for the best. This will break if you have a double-quote character in the path, although I think that’s illegal in Windows anyway. It also breaks if you try to have a directory name with a trailing backslash.

Switch all your backslashes to forward slashes. Pretty straight-forward and works on all systems.

Double all your backslashes.

Options 2 and 3 are very much not fun. Imagine having to rewrite every slash when you paste in C:\Windows\WinSxS\amd64_microsoft-windows-netfx3-ondemand-package_31bf3856ad364e35_10.0.22000.1_none_1234567890abcdef\Microsoft.NET\Framework64\v3.5\Temporary ASP.NET Files\root.

Raw strings are the answer.

This isn’t a hill I’m willing to die on, I can manage with raw strings, but I just question whether the breakage is justified. You mentioned the possibility of adding \e but then one could simply add a SyntaxError warning for a couple versions, but then of course you’d end up sliently breaking ‘C:\everything’ in later versions, so yeah its not an easy one.

@alphaparrot makes a good point about LaTeX though. There’s so much LaTeX out there that it’s worth thinking about.

Has one done a survey or obtained some data on how much existing code on GitHub would end up breaking? Knowing the sample size of any potential problem could ensure breaking changes are made hastily.

Rosuav · December 13, 2024, 7:47am

Open up Python’s REPL. Type input() and hit enter. Paste in the path. Hit enter again. You now have the repr of that path, perfectly ready to copy and paste.

Like I said, that’s another perfectly valid answer. It has a nasty edge case, but if you aren’t bothered by that, there’s nothing wrong with it.

Either we make \e a breaking change all on its own and then get right back to the same problem of data-dependent bugs causing issues, or we fix ALL of them at once and then the problem is solved. I know which one I prefer.

You’re welcome to do this exact search yourself. But ultimately, that code is already broken.

Stefan2 · December 13, 2024, 7:49am

Can you show some? (Links to actual cases.)

petercordia · December 13, 2024, 7:50am

Latex in docstrings possibly breaking packages sounds like a very significant problem. If I understand warnings correctly (they warn the creators of libraries but not the users), I wouldn’t even know if a library I use is going to be broken by this change.

And latex is latex. You can’t “simply use forward slashes”. In latex “\sqrt[3]{\alpha}” is meaningful, being the third root of α, but “/sqrt[3]{/alpha}” is just nonsense.
Ideally everyone who needs to would put little r’s in front of their docstrings, and that would probably solve the latex problem. But as @alphaparrot wrote, this is code that has been written in the past, and which despite still being useful, is no longer being maintained.

Edit: sorry, I didn’t mean to reply to Umar, and I can’t see how to fix it.

rrolls · December 13, 2024, 8:02am

Using VS Code:

Select the first \
Press Ctrl+D a bunch of times, til all instances of \ are highlighted. _{(If like me you also map the keyboard shortcut Ctrl+Shift+D to “Cursor Undo”, then if you press it one too many times you can use Ctrl+Shift+D to quickly undo the extra highlight(s).)}
Type / or \\ as appropriate. Suddenly, all highlighted backslashes are replaced with whatever you typed! Magic

This is something I pretty frequently do when I’m copy-pasting a Windows path into a Python file and I don’t want to use a raw string in that location for whatever reason.

umarbutler · December 13, 2024, 8:08am

This is a GitHub search for language:python "\textbf: Code search results · GitHub. There are 2k results. Some start with r. A lot don’t.

This is for language:python "C:\Windows: Code search results · GitHub. There are 6.4k results. I can’t find any on the first page beginning with r.

umarbutler · December 13, 2024, 8:08am

Here is another 5.3k scripts with language:python 'C:\Windows: Code search results · GitHub

NB I’m forced to split the links to get around the two links only for new users rule.

umarbutler · December 13, 2024, 8:10am

Here is another 1.3k results for "\sqrt Code search results · GitHub