Please don't break invalid escape sequences

seberg · December 13, 2024, 8:29am

No opinion from me on whether or not it should be done, but there is a nice easy win here in either case, I think:
Both the warning and the message could suggest adding the r"" string if there are no other escapes and suggest using \\ if there are.
Since these are literals, you could probably even suggest the whole fixed string for the user to copy-paste (if it is short).

storchaka · December 13, 2024, 8:51am

Good example. These that are not raw strings are broken (hint: \t is a tabulation character).

umarbutler · December 13, 2024, 9:01am

I didn’t even realise that I guess that’s why you see r much more than the other examples I provided.

umarbutler · December 13, 2024, 9:03am

Here are some more:
"\frac | 2.9k | Code search results · GitHub (this is already broken)
"\sum | 1.3k | Code search results · GitHub
"\int | 1.3k | github . com/search?q=language%3Apython+%22%5Csum&type=code
"\pi | 2k | github . com/search?q=language%3Apython+%22%5Cpi&type=code

Keep in mind, this is only grabbing strings that start with LaTeX. Who knows how many contain LaTeX, probably way more.

pekkaklarck · December 13, 2024, 9:24am

Do you know is there some reason r"trailing\" couldn’t be parsed as "trailing\\"? It looks like a special case that the parser could handle unambiguously.

Nineteendo · December 13, 2024, 9:34am

This would change docstrings containing escaped backslashes:

def foo():
    """foo\\bar"""

This allows to escape double quotes: r"\"hi\"".

Rosuav · December 13, 2024, 12:02pm

Fundamentally, string prefixes do not change how you find the end of the string. Tokenization isn’t affected by the prefix. In other words, if "embedded \" quote" is a single string, so must r"embedded \" quote" be.

Stefan2 · December 13, 2024, 12:26pm

As Serhiy said, good example since \t a valid sequence. I only checked the first page of results, and every single case that didn’t use a raw string looked like a bug.

Btw I don’t think I saw any example of what @alphaparrot talked about (“LaTeX-formatted sequences in their docstrings, as some documentation engines can automatically parse LaTeX math sequences into mathematical symbols.”). @alphaparrot, do you have a good example of that?

petercordia · December 13, 2024, 1:03pm

This is an example from @umarbutler 's list of "\sqrt instances:

github.com

ensemble-learning/simpy/blob/415594423660a08d413b6db706f11c1e143918de/arc/linux_tools/block.py#L51


      
              else:
                  res = len(data)/seq + 1
              for i in range(res):
                  if (i+1)*seq < len(data):
                      x.append(data[i*seq: (i+1)*seq])
                  else:
                      x.append(data[i*seq:])
              return x
          
          def block_average(fname, outfile):
              """\sqrt{\frac{c_{0}(n/m)}{n/m-1}}
              n : the size of sample
              m : the block size
              """
          
              x, y = parse_xvg(fname)
              ave = get_mean(y)
              print "Ave : ",ave,
              msd = math.sqrt(get_s2(y, ave)/(len(y)-1))
              print "Std Dev: ", msd
              o = open(outfile, 'w')

I would argue it’s not a bug if it works as intended, and all of the code I saw (looking through @umarbutler 's github searches) I expect to work as intended.
They’re not neat. The "\sqrt search in particular was a doozy. A lot of this should be written better. But they’re not bugs. Merely code that defies documented intent and “best practice”.

TIGirardi · December 13, 2024, 1:23pm

Isn’t \f a valid sequence? I know this is just one example, my point is that string literals with LaTeX in it are a mess, and is really hard to see if they’re broken. One should already be using raw string literals for this use case, the code is already likely broken otherwise.

mikeshardmind · December 13, 2024, 1:26pm

There’s really never going to be a good solution for embedding another arbitrary language’s syntax without ever colliding with escapes. I’d argue that LaTeX in docstrings is “doing it wrong” and this should be a place where you link to a formula or teach tooling to read comments. (The former is what I do at work for documenting math, we have separate files for rendering math formulas and generating links in rendered docs)

Stefan2 · December 13, 2024, 1:28pm

Yes, \f is valid, and help(block_average) shows me this:

Help on function block_average in module __main__:

block_average(fname, outfile)
    \sqrt{rac{c_{0}(n/m)}{n/m-1}}
    n : the size of sample
    m : the block size

Note how \frac became rac. Bad. Do the mentioned “documentation engines” preserve it?

pf_moore · December 13, 2024, 1:28pm

Is the problem here that the tools which read docstrings like this and format the content as LaTeX simply aren’t parsing the strings correctly, and are failing to replace \f with a formfeed character?

If people are saying that using LaTeX like this “works”, then they clearly aren’t using a tool that correctly parses Python source code.

Rosuav · December 13, 2024, 1:44pm

I disagree. It can still be a bug even if it works by accident. But also, the code you’re looking at is ancient, as it is written for Python 2 - and the file it’s in hasn’t been touched in six years. If people aren’t going to update their code for Python 3, why would they bother fixing broken docstrings?

petercordia · December 13, 2024, 2:09pm

The point was precisely that we don’t expect these people to bother fixing their docstring?

The particular library I linked is still in active use/development, with the most recent commit 5 days ago at time of writing, to add support for macbookpro. All of this is more of an accident than anything else.
The code clearly worked for them when they made the library, and I don’t think \f is young enough that the introduction of \f broke their code.

jamestwebber · December 13, 2024, 3:10pm

Regardless of how recently they made commits to the repository, if they’re still on Python 2 they are blissfully unaware of the SyntaxWarning and the later error. The decision has no relevance to their code.

petercordia · December 13, 2024, 3:54pm

yes, yes. Look, all of this was just because @Stefan2 wanted to know what it looks like when people put Latex in docstrings. And I’m not a github wizard, but I can recognise docstrings and latex at a glance. (But not Python2, otherwise I would have picked a different example.)

Nothing specific to these people, who just happened to be the top code on a search for "\sqrt that actually put it in a docstring, is relevant to the larger point.

I just thought @Rosuav was wrong on both points he made in the post I replied to.

Actually their code base does reassure wrt SyntaxWarning breakage. For significant portions of this code of which it’d be a shame if it got broken, it’s probably ok to run it on old python versions. If you just need code to validate an old research paper, that would usually be an adequate solution, I imagine.

Sympy is the only library I’m aware off that uses Latex in docstrings and that I need to run on modern python versions, and they do properly use r-strings. (But I can’t be certain nothing I rely on would be broken by the suggested change, due to the nature of warnings.)

MRAB · December 13, 2024, 4:43pm

\f predates Python 2 (it was even in Python 1.5), so that code dates broken when written.

chepner · December 14, 2024, 9:27pm

The code worked for the inputs they tested it on. Just because a bug hasn’t been triggered yet doesn’t mean it doesn’t exist.

Wombat · December 14, 2024, 9:59pm

I agree!!!

This places a burden on users to know and remember exactly what is valid and isn’t. The much easier rule was “when in doubt, escape it.” The new restriction is a double burden when using regular expressions which have their own escaping rules (and which generally allow to escape when in doubt as Python used to do).

The breakage is also a pain if you put ascii art diagrams in docstrings with rows of hyphens, columns of pipes, backslashes and pluses as joiners, and greater-than or less-than symbols for arrow. I am now having to take all of those diagrams and turn them into raw strings. That then creates a new problem where all my unicode escapes have to be replaced by the actual character. That in turn creates a new problem where my source files are no longer all ascii and various tools (like file viewers and grep no longer work reliably).

I don’t want to be negative, but this potential code invalidation really sucks. As far as I can tell, no existing code will get any offsetting benefit. It is a pure loss.