Proposal: add \z as a synonym for \Z in Python REs for standardization

Hello - I’m with the Austin Common Standards Revision Group - the joint technical working group established to develop and maintain the core open systems interfaces that are the POSIX™ 1003.1 (and former 1003.2) standards, ISO/IEC 9945, and the core of the Single UNIX Specification.

We have had a request to unify/rationalize the regex behaviors for “anchor at string beginning” (^ is the closest in POSIX) and “anchor at string end” ($ is the closest in POSIX). A description of this problem in depth can be found here and a table that scopes the varied solutions across varying languages can be found here.

Our working group has come to the conclusion that \A and \z are widely implemented across many ecosystems and are the most “standard” solution to the issue. We are asking if the Python community would consider adding “\z” as a synonym for “\Z” in their regex lexicon.

3 Likes

I support standartization. It is sad that Python’s \Z is not compatible with other implementations.

The eraly history of the regular expressions support in Python is complex, but it seems that the modern syntax was based on PCRE 0.95. \Z meant the end of the buffer at that time. Only in PCRE 2.0 its meaning was changed, and \z was added.

https://www.pcre.org/original/changelog.txt

As far as I can see, Perl 5.0 had \Z and \z when it was released in 1994, and PCRE was released in 1997.

Yes, it was a mistake on PCRE side, which Python inherited.

I do not think that there are downsides of introducing \z (except that it could already be used in our tests as an example of “illegal sequence”). Several years after this (not earlier than in 3.16 or 3.17) we can start to deprecate \Z.

BTW, I planned to migrate most $ to \Z, but this is a large work which needs a lot of tests, so I have not finished it yet.

@msbrown, can you open an issue for adding \z?

2 Likes

Deprecate \Z, or change \Z to match Perl and PCRE?

It is safer to remove it than to change its behavior. It is equvalent to \n?\z. Users that need this can just use \n?\z. But most users will use \Z incorrectly, from old memory or copying from old books and internet forums.

We can do this only when \z exists in all maintained Python versions.

1 Like

As a minimum, allow for “\z” to cause the same behavior as “Z” does today.

Done: add \z as a synonym for \Z in Python REs for standardization · Issue #133306 · python/cpython · GitHub

“it” in the quote does not refer to current behavior, but rather the behavior of a theoretical standards-conformant version. Hence the suggestion to just remove it rather than confuse people.

I sent a message on the bug report, but think that it is worth mentioning here too. There is a discussion on the glibc mailing list about what syntax should be standardized taking into account existing implementations [1]. I think that discussion should be resolved before changing anything.

[1] [PATCH] regex: Add \A and \z synonyms to \` and \'

1 Like

Even if \` and \' be standardized, \z is a better option for Python, and it is already de-facto standard.

BTW, \Z in Perl is rather equivalent to (?=\n?\z) than to \n?\z. Never mind, it’s rarely needed.

I don’t like \` and \' for the same reason I don’t like \< and \>, for which I use \m and \M in the regex module.

This is a reply I wrote in an issue about the latter (syntax for beginning and end of the word? · Issue #16 · mrabarnett/mrab-regex · GitHub):

I borrowed \m and \M from TCL.

The reasoning behind using them instead of \< and \> is as follows.

All metacharacters, eg ^ and $, are punctuation, and all escape sequences consist of \ followed by A-Z, a-z or 0-9. If you see \ followed by any character other than A-Z, a-z or 0-9, you know that it's a literal, eg \$ is a literal.

If you saw \< you might think that it's also a literal, perhaps because < itself is a metacharacter. The escaping would be inconsistent.

I think that \< and \> may have originated with the BRE syntax in which ( and ) are literals and \( and \) are for capturing.
2 Likes

I think that \' and the backtick equivalent (which I can’t even work out how to type in Markdown :slightly_frowning_face:) are ugly and awkward to type. So regardless of what the standards end up as, I’d want \A and \z in Python[1].


  1. If the glibc version got standardised, I’d reluctantly accept having it as well, as a discouraged version available for compatibility only, but that’s as far as I’d go. ↩︎

1 Like

@msbrown, Tcl also uses \Z. Contact them.

For backward compatibility, we can only use \ followed by a Latin letter, because \ followed by non-letter and non-digit character is often used in regular expression with meaning of the literal character. For many character (., ?, (, [, etc) this is necessary, for other (', ", /) it may be necessary due to the syntax of programming language in which the RE is embedded, and the rest are often escaped just because the author was not sure whether it was necessary. So \ followed by a Latin letter, is the only option.

Other option is to use the (? prefix (you need also the closing ) for readability). But it is too verbose for anchors.

I agree that \' and \` are best avoided, regardless of if we adopt \z. The trick to typing is adding an extra ` and surrounding the inner part with spaces (i.e. `` \` ``), but this isn’t obvious!

That discussion was started by one of my peers on the Austin Group, part of the same rationalization attempt…

Will do, I’ll let out committee know.

That discussion was started by one of my peers on the Austin Group, part of the same rationalization attempt…

Thanks! I just wanted to make sure the disagreements were known before moving forward. I don’t have super strong opinions on it, but I was sympathetic to \` and \' since glibc, FreeBSD, and boost already support them. But likewise \z is supported by many languages.