Please don't break invalid escape sequences

jamestwebber · December 16, 2024, 7:56pm

But if no one ever runs the code again, is it truly “broken”?

I don’t doubt that a large amount of code would be incompatible with a future Python version that makes this an error. I suspect that nearly all of that code would need major updates to be compatible with future Python anyway. If something isn’t being maintained it doesn’t matter if it would theoretically break.

I doubt anyone has looked at most of the code I wrote in grad school, let alone tried to run it. If they did, I’m sure they’d find it was broken in some way, and not because I invented new escape sequences in my strings.

ajoino · December 16, 2024, 8:05pm

Yeah this mirrors my experience where most student/research code is used in one project over a limited timespan and is never ran again in it’s entirety, meaning that it’s likely only run on a single Python version. Furthermore, in my experience most of that code is not turned into libraries and instead copy-pasted when reused, which provides a nice opportunity to upgrade the code to a newer Python version.

jamestwebber · December 16, 2024, 8:19pm

Certainly none of that is a good situation (from the perspective of how research gets done), but I definitely don’t think it’s a problem that should stop Python from improving.

patrick-kidger · December 16, 2024, 9:02pm

Reproducibility is a cornerstone of the scientific endeavor. We should absolutely expect to be able to run old code again, often only occasionally as part of building a better attempt at the same problem some years later.

Python 2-to-3 was bad enough. I don’t want us to repeat that transition.

jamestwebber · December 16, 2024, 9:07pm

I totally agree that it should be that way, but I also acknowledge reality: a large majority of the code used in research publications is never used again, if it’s made public in the first place. Particularly the type of code written by a grad student, who moves on to a career somewhere else and has no incentive to keep it running^[1].

We can’t invoke 2-to-3 every single time a change is made to Python. This is trivially easy to fix and so far we haven’t seen any examples of maintained code that couldn’t handle this very small change.

or responsibility! It’s really their PI’s job! ↩︎

ajoino · December 16, 2024, 9:43pm

I think the only way to make it truly repeatable is to run it with the Python version it was made for, since even minor/patch (the last number) releases can introduce subtle incompatibility issues. Reproducibility in CS (well, ML at least) has much bigger issues than needing to update a couple of strings here and there.

PeterL · December 17, 2024, 12:40am

And don’t forget \a (alert - bell), \b (backspace), \r (carriage return), \u (unicode 16), \U (unicode 32), \x (hex), \N (named unicode escape), \v (vertical tab), \0 (octal), \t (tab) and probably some more.

alphaparrot · December 17, 2024, 2:59am

Actually quite a bit of the code I used in grad school was written in the 70s, 80s, or 90s (and distributed in tarballs). It was mostly Fortran, but with some very low-level python implementations of convenient functions (which could be re-implemented to be faster with numpy, at significant expense to the student, but which in current form was unaffected by 20 years of Python upgrades). Yes, arguably almost all code written in grad school gets abandoned. That’s a separate question from what fraction of code still in use was written by a grad student or postdoc years ago, who has long since moved on.

Again, I personally can update my code to not break. I actively maintain my code. My point is that most Python users are not like anyone on this thread; if you ask them to modify a pip-installable package (or even source code you emailed to them), you have lost them. We can argue until the cows come home whether that’s an acceptable state of affairs, but it is the state of affairs. Like it or not, Python is the lingua franca of data. And that means an awful lot of people know how to use a jupyter notebook and pip and that’s it. And those people (and grad student programmers) are actually why I linked the numpy docstring How-To–not because it’s an example of code that will break. That document is a great example of LaTeX being used in a docstring. And if you are an amateur trying to learn how to document your code, and you read the numpy guide (because numpy is basically the gold standard for well-documented code), you will not learn that you should use an r-string in order to use LaTeX. You will just see that the reST docstring syntax supports it. Arguably that’s a criticism of the numpy docstring guide, but the point is that many people have seen that document and others, which indicate that LaTeX is an option but don’t necessarily make any mention of r-strings, and those people have written code, which would break. You may feel quite strongly that people should learn how to do things properly, and that breaking the code they use would be a useful pedagogical exercise, but this is condescending and patronizing–many people are just using this for work, and are trying to produce work products with as little fussing with debugging someone else’s code as possible.

And this, and my earlier example of 50-year-old code that still works perfectly well, brings me to actually my most philosophical point: the Python developer community has taken a brash attitude towards backwards compatibility that is often hostile to both users and programmers. Some breaking changes are necessary; many (many) have not been. Supporting new escape characters without breaking code each time one is added is a great idea. So let’s implement that in a way that’s backwards compatible. There are plenty of brilliant programmers working on Python. Other languages like Fortran have managed to stay relevant for decades with continuous development and upgrades, and only minimal breaking changes (I can think of one (1) breaking Fortran change in the last 40 years that I’ve encountered). That Python (and the major libraries like numpy, scipy, etc) regularly break perfectly fine old code is not a good thing, and it shouldn’t be considered normal.The comparison with the 2 to 3 transition is apt, because so much code broke in that transition. It’s the exact same situation I and others are trying to prevent.

blhsing · December 17, 2024, 8:11am

Alphaparrot:

A lot of great points have been made over the weekend. Here are some of my big ones that I keep coming back to as I think about this:

Code that is infrequently maintained is not necessarily infrequently used. A lot of good scientific code (good meaning it does what’s intended, and does so quickly and efficiently, not necessarily that it’s pretty or conforms to PEP) does not rely much on external libraries, and therefore doesn’t need to change much between say Python 2 and Python 3.18. If it ain’t broke, why fix it (and more to the point, why break it unnecessarily).

Code which does what it’s intended to do is not broken. Full stop. PEP-incompliant and broken are two very different things. If the docstring exists to function as documentation (nominally its only purpose), and isn’t being passed to a parser that complains about escape sequences, then it’s not broken no matter what’s in it.

Many users of such codes are not in a position to maintain them themselves, either because they lack the time or the know-how. Sure, they could learn; that is unrealistic and overly idealistic. The fact of the matter is that placing an unreasonable burden on the community will undoubtedly lead to fragmentation.

If you prefer not to make any changes to your existing code, it is actually possible to transparently transform the old code at runtime to comply with the new syntax by installing a site-wide import hook with importlib.machinery.SourceFileLoader.get_source overloaded to tokenize the source code, replace invalid escape sequences in non-raw string and f-string tokens with escaped backslashes, and untokenize the new tokens back into source code.

Here’s a sample source transformer for you to build a custom source file loader:

def _new_escaper():
    def _escaper(
        token,
        _fstrings=[],
        _not_raw=re.compile(r'[^Rr]*[\'"]').match,
        _escape=partial(
            re.compile(r'\\(?=[^\n\\\'"0-7NUabfnrtuvx])').sub, '\\\\\\\\')
    ):
        match token:
            case TokenInfo(type=tok.STRING, string=s) if _not_raw(s):
                return token._replace(string=_escape(s))
            case TokenInfo(type=tok.FSTRING_START, string=s):
                _fstrings.append(_not_raw(s))
            case TokenInfo(type=tok.FSTRING_MIDDLE, string=s) if _fstrings[-1]:
                return token._replace(string=_escape(s))
            case TokenInfo(type=tok.FSTRING_END):
                _fstrings.pop()
        return token
    return _escaper

def escape_invalid_sequences(source):
    return untokenize(
        map(_new_escaper(), generate_tokens(StringIO(source).readline))
    )

so that for example:

print(escape_invalid_sequences(r'''"\t\d\n";re.compile(f"\d{rf'\d{1}'}")'''))

would output:

"\t\\d\n";re.compile(f"\\d{rf'\d{1}'}")

Demo here

umarbutler · December 17, 2024, 8:15am

This I think is a bit extreme and very hacky also once the change is made, your vscode will look very unhappy at any invalid escapes I’m sure.

blhsing · December 17, 2024, 8:17am

I’m specifically addressing @alphaparrot, who knows people who are reluctant to make any changes at all to the old code that runs. VSCode doesn’t apply because those people aren’t going to edit the old code anyway.

Furthermore, this transformer code can also be used to automate the entire refactoring of the old code base permanently, should the user chooses to do so. In that case though, some additional logics to convert a string to a raw string when there is no valid escape sequences inside may be preferred over escaping backslashes with backslashes.

sirosen · December 17, 2024, 3:31pm

I understand your comparison with Fortran but it is not an appropriate one for Python. If you have 50 year old Python code, I’m more interested in your time machine than I am in programming language evolution.

I dislike phrasing this vaguely in terms of “the Python developer community” because the responsibility there is diffuse. There are issues with some popular libraries’ handling of compatibility, but that’s not what we’re talking about here. We’re talking about the language itself, and in that context there’s much more concentrated power and responsibility:

the SC, and previously BDFL, has final say
the core developers have commit rights
the core developers have special advocacy rights (the ability to sponsor PEPs)
anyone in the world can petition them for changes

So what you’re really saying here is that you find the compatibility policy applied by the core devs and the SC to be insufficiently stringent.

There’s nothing wrong with that opinion. I disagree. I like using a language which evolves at a deliberate, predictable, moderate pace. Not a crawl, but not a sprint.

Phrasing it as “user hostile” or “harmful to developers” is narrow and incorrect. It is harmful to some people. It is helpful to others.

If you want 50-year stability, frankly this is the wrong language. Fortran probably is a much better choice, and that’s not me being snarky. It really is. Its a much more tightly constrained language with different security concerns. There’s absolutely nothing wrong with different languages being good at different things.

There are two potential positions that you can advocate here, as far as I can tell. One is that the SC and core devs should change the overall pace of change in the language (I advise a new thread if so, this one is getting dragged all over the place as it is). You would need to present a strong argument, and probably clear criteria for how compatibility breaking changes should be identified and handled. I would argue against you in such a discussion, since I do not think that the pace of change needs to be adjusted. And yes, I have had something break in almost every minor release since 3.5. Handling that is part of my job as a package maintainer.

The other way you can advocate for your position is to say that changing the handling of invalid escapes is so harmful, so impactful, that the past discussions in this topic were not sufficient. They’ve missed the scale of the impact. You have new and better information. Key to this: don’t just scan GitHub for “number of broken files”. Per my point earlier, that’s simply not as good of a tool for impact assessment as people think it is. You can’t just say “I don’t like this change and I’m worried about X” – in this case, there’s already been discussion and a decision made. In this case, if you want to change the decision which has already been made, you need to bring receipts.

blhsing · December 18, 2024, 2:18am

"\😊" won’t actually get deprecated (notice how it doesn’t generate a SyntaxWarning) because it is automatically translated into "\u005c\U0001f60a" for you before it reaches the parser.

Similarly, there’s a test case in CPython that relies on this behavior to work :

github.com

python/cpython/blob/ba2d2fda93a03a91ac6cdff319fd23ef51848d51/Lib/test/test_utf8source.py#L11


      
          import unittest
          
          class PEP3120Test(unittest.TestCase):
          
              def test_pep3120(self):
                  self.assertEqual(
                      "Питон".encode("utf-8"),
                      b'\xd0\x9f\xd0\xb8\xd1\x82\xd0\xbe\xd0\xbd'
                  )
                  self.assertEqual(
                      "\П".encode("utf-8"),
                      b'\\\xd0\x9f'
                  )
          
              def test_badsyntax(self):
                  try:
                      import test.tokenizedata.badsyntax_pep3120  # noqa: F401
                  except SyntaxError as msg:
                      msg = str(msg).lower()
                      self.assertTrue('utf-8' in msg)
                  else:

bavalpey · December 18, 2024, 3:58am

With all due respect, in the case of docstrings, no, this would not break the code. Please stop claiming it would.

umarbutler · December 18, 2024, 6:44am

To be fair, I think it would.

def blah(escape_code: str) -> str:
    '''The only escape code `blah()` currently supports is '\e'.'''

    return escape_code

Currently, if you hover over blah() in vscode, this will render as The only escape code blah() currently supports is '\e'., however, if \e was added as an escape code, it could end up rendering as The only escape code blah() currently supports is ' '.. Why? Because if you change the docstring to The only escape code blah() currently supports is '\n'. vscode will render it as The only escape code blah() currently supports is ' '..

Personally, >90% of my time spent reading docstrings is spent reading docstrings via vscode.

I think you’d probably have the same problem with other applications that render docstrings.

bavalpey · December 18, 2024, 2:18pm

But this wouldn’t break the code. It would break comment. The core library functionality would continue to work.

Nineteendo · December 18, 2024, 3:16pm

It would break this code for example:

def escape_esc(s):
    return s.replace("\x1b", "\e")

Now:

>>> print(escape_esc("\x1b[31m<text>\x1b[39m"))
\e[31m<text>\e[39m

Future:

>>> print(escape_esc("\x1b[31m<text>\x1b[39m"))
<text> # in red

barry-scott · December 18, 2024, 4:32pm

So the change is a good thing because you can see where to fix the bug?

jagerber · December 19, 2024, 12:02am

@bavalpey

…

I find this line of conversation confusing. Not sure what you mean by “in the case of docstrings, no, this would not break the code”. By this do you mean this change would not break the docstring? It your next post it sounds like you agree that this change WOULD break the docstring (“It would break comment”). This is exactly what Umar was addressing.

We are consider the case where

No SyntaxError is ever raised in the case of invalid escape characters
a new escape character \e is introduced into python between versions 3.A and 3.B

Now suppose a user has the string print("H\ello World") in their source code. The output of this code would differ between versions 3.A and 3.B. I think everyone would agree that the introduction of the new escape character broke this user’s code.

Now suppose the user has a docstring of the form """...d\e...""". Umar has pointed out that IDE tools that render docstrings will render this docstring different between version 3.A and 3.B. I think it is fair to say that this change has broken the docstring. To understand this we have to remember the docstrings are python objects, as explored more thoroughly in this thread.

So I would say that (in this hypothetical), the introduction of new escape characters has the potential to break both code and docstrings.

jamestwebber · December 19, 2024, 1:25am

The pedantic point that was being made was simply: the docstring is not the code. You can break your documentation and the code still functions, it’s just documented worse.

How this led to so many words being spilled, I don’t know.