Please don't break invalid escape sequences

bavalpey · December 15, 2024, 4:27pm

Hi Marc,

This excitement is not about the warning. It is about the pending SyntaxError that is promised, and a plea against it.

sirosen · December 15, 2024, 5:00pm

It looks like not everyone who is complaining about this change is making an effort to understand the motivation behind it.

I, for one, am looking forward to re.compile("a\.b") being an error. It will provide a much better experience for novice developers. This change isn’t user-hostile: it’s actually a big user-facing improvement.

@malemburg’s take is the right one. If the effort being poured into this thread were instead poured into community contributions, that would be more positive and productive.

Kurt · December 15, 2024, 5:11pm

If I understand you correctly, the syntax warning is fine for you, but you are worried about the pending and possible future syntax error. And I think that’s a perfectly reasonable position.

Remains the question if there should something be change or be made right now?
And here I am with Marc-André: I am not sure what the purpose is of this idea discussion at this point.

Give a guarantee that the syntax warning is never be changed to a syntax error in the future?
Or just ensure the awareness that such a change must then be carefully evaluated?
Something else?

Stefan2 · December 15, 2024, 9:17pm

Should the documentation be adjusted? It doesn’t say “invalid” escape sequences but merely “unrecognized” escape sequences, which to me doesn’t sound like something bad, and it says they’re left alone and even calls that useful:

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.)

Maybe change it to “invalid”? That’s already what the syntax warning calls it, so currently there’s a mismatch.

Wombat · December 15, 2024, 10:09pm

Wow, that was a good find. The docs have promised to leave unrecognized sequences unchanged. That completely kills the proposition that users were doing something wrong.

That means that regex users were always allowed to write \w and have it work without making it a raw string. Any regexes written that way would have passed tests but will break in the next version of Python.

Also the documentation specifically calls out why this long-standing behavior is useful.

This new change should be reconsidered by the new SC. The documentation shows that this was never an invalid practice as the proponents have asserted. It was a promised behavior with known benefits.

To the extent there was a cost/benefit analysis, it seems to have completely missed the actual costs that are now making themselves known (regexes, latex, ascii art, etc). Also the purported benefit seems to be specious because no actual new escape code is currently under consideration.

jagerber · December 15, 2024, 10:13pm

I think this is a good change. It will eliminate a footgun and teach users about raw strings and escapes. I’ve used python for 10 years and have always been confused about these things and still didn’t understand them until researching for this thread. This is desirable.

However, it is pretty sad that multi line comments or docstringa will be able to break code now (I guess they can already raise warnings). This attitude comes from me not knowing any good cases for having multi line comments or docstrings not being treated as raw strings, or in fact for being treated as string literals at all. But a knock on effect of this change is that, in addition to devs being forced to learn about raw strings and escape characters (good thing) users will also have to learn that eg docstrings are string literals (annoying and not useful thing in my opinion).

bavalpey · December 15, 2024, 10:25pm

It’s odd to call docstrings multi-line comments.

Are they used like this in certain settings?

oscarbenjamin · December 15, 2024, 10:29pm

I would not say that this “eliminates” any footgun. It just makes it more likely that if you paste some text into your code with \ characters you might realise that they need to be escaped. For any new code a warning does that almost as good as an error. Valid escapes would still not be an error/warning so any actual code that is incorrect because of not escaping slashes would likely be just as silently incorrect regardless.

Rosuav · December 15, 2024, 10:33pm

If you read a bit more of the docs, no it doesn’t promise this forever. It states that there is a change coming.

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.) It is also important to note that the escape sequences only recognized in string literals fall into the category of unrecognized escapes for bytes literals.

Changed in version 3.6: Unrecognized escape sequences produce a DeprecationWarning.

Changed in version 3.12: Unrecognized escape sequences produce a SyntaxWarning. In a future Python version they will be eventually a SyntaxError.

Why do you continue to demand that future Python versions be warped around your convenience? You will continue to have the option to use old Python versions, if you choose not to change your code.

oscarbenjamin · December 15, 2024, 10:59pm

I think that not using raw-strings for docstrings with LaTeX was a somewhat common problem for SymPy a long time ago. When the warnings were added to CPython, SymPy added a CI job with:

python -We:invalid -We::SyntaxWarning -m compileall -f -q sympy/

That is the magic incantation apparently to check all files for invalid escapes. Presumably these days SymPy could get the equivalent with ruff somehow (maybe W605?).

Going back to a commit from 10 years ago I see 10 invalid escape warnings but I don’t think any are to do with LaTeX (~~some are regex bugs~~):

$ git checkout sympy-1.0
...
(sympy) oscar@nuc:~/current/active/sympy$ bin/isympy 
/stuff/current/active/sympy/sympy/core/facts.py:3: SyntaxWarning: invalid escape sequence '\_'
  """This is rule-based deduction system for SymPy
/stuff/current/active/sympy/sympy/core/expr.py:2214: SyntaxWarning: invalid escape sequence '\*'
  """
/stuff/current/active/sympy/sympy/core/evalf.py:1463: SyntaxWarning: invalid escape sequence '\*'
  """
/stuff/current/active/sympy/sympy/utilities/misc.py:25: SyntaxWarning: invalid escape sequence '\ '
  """Return a cut-and-pastable string that, when printed, is equivalent
/stuff/current/active/sympy/sympy/utilities/runtests.py:529: SyntaxWarning: invalid escape sequence '\*'
  """
/stuff/current/active/sympy/sympy/utilities/runtests.py:1315: SyntaxWarning: invalid escape sequence '\*'
  """
/stuff/current/active/sympy/sympy/utilities/runtests.py:1562: SyntaxWarning: invalid escape sequence '\d'
  matches = re.findall("line \d+", name)
/stuff/current/active/sympy/sympy/utilities/runtests.py:1770: SyntaxWarning: invalid escape sequence '\s'
  want = re.sub('(?m)^%s\s*?$' % re.escape(pdoctest.BLANKLINE_MARKER),
/stuff/current/active/sympy/sympy/utilities/runtests.py:1774: SyntaxWarning: invalid escape sequence '\s'
  got = re.sub('(?m)^\s*?$', '', got)
/stuff/current/active/sympy/sympy/core/function.py:1696: SyntaxWarning: invalid escape sequence '\s'
  """
Traceback (most recent call last):
...
File "/stuff/current/active/sympy/sympy/core/function.py", line 107, in __init__
    evalargspec = inspect.getargspec(cls.eval)
                  ^^^^^^^^^^^^^^^^^^
AttributeError: module 'inspect' has no attribute 'getargspec'. Did you mean: 'getargs'?

If you go that far back then other incompatibilities like the inspect module change breaks everything. That sort of thing is actually a regular problem for me because it makes it difficult to git bisect really old changes if you can’t run old versions (regardless of warnings) under newer versions of Python. Adding a SyntaxError here might set a new floor for how far I can go back when wanting to bisect.

bavalpey · December 15, 2024, 11:01pm

This is the action I think would be appropriate for this thread. Perhaps not the guarantee that it is never to be changed, but at least walk back the promise that it will be an outright error.
Parts of the community have raised issue with this promise breaking code. I’m not sure who gets to make the decisions for changes, but things that break code that works perfectly fine should have a really high bar, in my opinion.

So, I’m not really sure how this provides a better experience for novice developers. Why should this be an error? It matches “a.b”
It keeps backslashes as being able to be used as escapes in regex. Yes, users should usually prefer to use raw strings, but there are cases where not using it is just fine. E.g. for cases where you want to match a literal tab or literal newline.

oscarbenjamin · December 15, 2024, 11:14pm

Actually they are not bugs. I don’t know regex well myself but these warnings are coming from precisely where the \ did not need to be escaped.

jagerber · December 15, 2024, 11:27pm

Yeah… fair that footguns can still get through. However, in many cases they will be stopped since unintentional valid escape sequences (footguns) will often appear together in a string with invalid escape sequences so that the SyntaxError will be triggered and the user will hopefully learn about escape sequences and address the invalid escape sequence and also address the valid escape sequence. Now there is a only a very unlucky path for users to attempt to use backslashes in non-raw string literals for purposes other than escaping sequences while not hitting any SyntaxErrors and also not noticing incorrect results. So after reflection, I guess I would still say that, practically speaking, the footgun is removed.

I’m pretty sure docstrings are nothing other than multi-line comments that appear as the first statement in a function or class definition. I work on a package that happens to do crazy stuff treating docstrings as string literals and passing them around and modifying them at runtime but I really dislike all of this. Using f strings in docstrings etc. But I think these are all bad practices. I haven’t seen a good use case for treating docstrings as anything other than raw strings. But, I’d like to hear cases from others in this thread to help justify forcing new users to learn about this (in my opinion) useless fact. I’ve seen some handwaving that some docs tools might use the fact that docstrings are python literals, but I’m curious to know more details. How badly would these tools be impacted if docstrings were just always raw strings?

Wombat · December 15, 2024, 11:40pm

Why do you continue to demand that future Python versions be warped around your convenience?

Can we hear from someone other than Chris Angelico? He is dominating the conversation and automatically dismisses every point that is raised and his tone is borderline abusive. I’ve read that line several times and even with the most charitable reading, it sounds like he just insulted me and told me to shut up (it also insults the OP and other thread participants as well).

There appears to be zero real engagement or respectful consideration of the issues raised or recognition of the burden being unnecessarily shifted on to users.

Is there anyone else on this forum who has an open mind and is willing to legitimately consider the pros and cons before this becomes set in stone?

If you read a bit more of the docs, no it doesn’t promise this forever. It states that there is a change coming.

That is disingenuous. The notice of an impending change text was only added recently, at the end of 2023. In contrast, the docs promising the current behavior go back quite far. Guido van Rossum affirmed this promise and its rationale in a 1998 edit, 60f2f0cf8e1. And most of the wording predates that edit. It has been intentionally explicitly promised and documented for most of Python’s history.

That at least warrants a genuine discussion that doesn’t callously dismiss actual impacts on users.

oscarbenjamin · December 15, 2024, 11:48pm

I would call it hyperbole rather than abuse. My reading is that Chris was just being a bit dramatic but it is easy for these things to be misread in a text-only context.

bavalpey · December 15, 2024, 11:56pm

I think any further talk about docstrings is quite off topic. If you open up another forum post for it, I’ll share my thoughts there. But I do have a use for docstrings, just as multi-line strings. I think the difference is that docstrings are actually interpreted. Comments, (other than the shebang), never are. So calling them comments was curious to me.

MRAB · December 16, 2024, 12:26am

1998 was a long time ago. Since then, escape sequences changed when Python 2 arrived (PEP 223 – Change the Meaning of \x Escapes).

And then there was change elsewhere in the language itself with Python 3.

MegaIng · December 16, 2024, 12:31am

This has been discussed a lot. The impact on users has been considered. You need to argue why this needs to be re-litigated again right now. I don’t believe any new arguments have been made on either side:

Issue 27364: Deprecate invalid escape sequences in str/bytes - Python tracker
[Python-ideas] Make non-meaningful backslashes illegal in string literals
Mailman 3 What to do about invalid escape sequences - Python-Dev - python.org
There is also a different, earlier thread in python-dev according to the bpo discussion, where guido made a ruling. If someone can find it, I will put a link here

Is there anyone else on this forum who has an open mind and is willing to legitimately consider the pros and cons before this becomes set in stone?

Many people have done so in the past. Why does this need to relitigated? Because you didn’t like the result? Did anything change since those earlier discussions? If so, what?

And since then, Guido has changed his mind and made a ruling in his role as the BFDL. Why should the SC now overrule it?

These have happened, you just didn’t take part in them. Or have you already read all 4 of the above mentioned discussions and came to the conclusion that they weren’t done in good faith?

There is always the option to report a post if you feel like it cross a line. Use it!

sirosen · December 16, 2024, 1:01am

It not being an error makes the rules inconsistent.

If "a\.b" works, is it the same as "a\\.b"?
Okay, so then these two are also the same, right? "a\nb" and "a\\nb"
What’s the difference between single and double quotes? What does f"a\b" mean?

Consistent rules make the language easier to learn.

With respect to the inflammatory tone in some of this thread, I think that some folks are making exaggerated statements, and others are rising to that bait.

For example, in @wombat’s post calling out Chris’ tone^[1],

I don’t see recognition of the “real engagement” from community members. Or consideration of the fact that calling the change “unnecessary” assumes that only one perspective – yours – is right.

You’re holding a double-edged sword here: speaking in absolutes and showing a limited willingness to take other perspectives seriously basically kills a conversation.

We have heard complaints about this change at this point, but are there maintained tools which will find this change challenging? That’s just about the only thing which I can imagine emerging as new information that would justify a new discussion on this topic.

On a related note, showing more examples of the sheer number of incorrect escapes on GitHub is unconvincing. It can be read as evidence of how many people have been relying on a problematic misfeature, or simply not relevant – if all that code is abandoned, who cares?
Find other ways to talk about why this topic should be reopened. “Number of lines of code on GitHub” is increasingly often treated as an impact assessment, but it’s an extremely noisy signal – and in cases like this, it’s not even clear what conclusions should be drawn from it.

To my eye, Chris’ posts are sometimes quite harsh, and even dismissive, but I’ve never known him in this forum to espouse views which didn’t come from a place of reason. I wish he would be gentler, but I’d rather have his expertise and willingness to engage than pure radio silence on many topics. ↩︎

umarbutler · December 16, 2024, 1:22am

Without speaking on behalf of anyone else, I am personally ok with this provided that the SyntaxWarning persists for a while (and, ideally, a more descriptive message is added to flag to users: hey, you need to change this code because it will break later on).

Users should be given more time to change their behaviour and code. If a few years pass and there has already been sufficent warning, then breakage is far more reasonable.

Truthfully, when I first saw the SyntaxWarnings I assumed they were coming from my Jupyter engine and not Python itself (as I work entirely in Jupyter) and sometimes I get iPython specific warnings… There was no message flagging that this is going to create a breaking change.

If there was a message along with the SyntaxWarning stating as such, then I would’ve taken it more seriously. Only after getting a bunch of the warnings did I decide to investigate further.

Some users might not think to investigate if they find that the code still works fine.