Hello - I’m with the Austin Common Standards Revision Group - the joint technical working group established to develop and maintain the core open systems interfaces that are the POSIX™ 1003.1 (and former 1003.2) standards, ISO/IEC 9945, and the core of the Single UNIX Specification.
Our working group has come to the conclusion that \A and \z are widely implemented across many ecosystems and are the most “standard” solution to the issue. We are asking if the Python community would consider adding “\z” as a synonym for “\Z” in their regex lexicon.
I support standartization. It is sad that Python’s \Z is not compatible with other implementations.
The eraly history of the regular expressions support in Python is complex, but it seems that the modern syntax was based on PCRE 0.95. \Z meant the end of the buffer at that time. Only in PCRE 2.0 its meaning was changed, and \z was added.
Yes, it was a mistake on PCRE side, which Python inherited.
I do not think that there are downsides of introducing \z (except that it could already be used in our tests as an example of “illegal sequence”). Several years after this (not earlier than in 3.16 or 3.17) we can start to deprecate \Z.
BTW, I planned to migrate most $ to \Z, but this is a large work which needs a lot of tests, so I have not finished it yet.
It is safer to remove it than to change its behavior. It is equvalent to \n?\z. Users that need this can just use \n?\z. But most users will use \Z incorrectly, from old memory or copying from old books and internet forums.
We can do this only when \z exists in all maintained Python versions.
“it” in the quote does not refer to current behavior, but rather the behavior of a theoretical standards-conformant version. Hence the suggestion to just remove it rather than confuse people.
I sent a message on the bug report, but think that it is worth mentioning here too. There is a discussion on the glibc mailing list about what syntax should be standardized taking into account existing implementations [1]. I think that discussion should be resolved before changing anything.
I borrowed \m and \M from TCL.
The reasoning behind using them instead of \< and \> is as follows.
All metacharacters, eg ^ and $, are punctuation, and all escape sequences consist of \ followed by A-Z, a-z or 0-9. If you see \ followed by any character other than A-Z, a-z or 0-9, you know that it's a literal, eg \$ is a literal.
If you saw \< you might think that it's also a literal, perhaps because < itself is a metacharacter. The escaping would be inconsistent.
I think that \< and \> may have originated with the BRE syntax in which ( and ) are literals and \( and \) are for capturing.
I think that \' and the backtick equivalent (which I can’t even work out how to type in Markdown ) are ugly and awkward to type. So regardless of what the standards end up as, I’d want \A and \z in Python[1].
If the glibc version got standardised, I’d reluctantly accept having it as well, as a discouraged version available for compatibility only, but that’s as far as I’d go. ↩︎
For backward compatibility, we can only use \ followed by a Latin letter, because \ followed by non-letter and non-digit character is often used in regular expression with meaning of the literal character. For many character (., ?, (, [, etc) this is necessary, for other (', ", /) it may be necessary due to the syntax of programming language in which the RE is embedded, and the rest are often escaped just because the author was not sure whether it was necessary. So \ followed by a Latin letter, is the only option.
Other option is to use the (? prefix (you need also the closing ) for readability). But it is too verbose for anchors.
I agree that \' and \` are best avoided, regardless of if we adopt \z. The trick to typing is adding an extra ` and surrounding the inner part with spaces (i.e. `` \` ``), but this isn’t obvious!
That discussion was started by one of my peers on the Austin Group, part of the same rationalization attempt…
Thanks! I just wanted to make sure the disagreements were known before moving forward. I don’t have super strong opinions on it, but I was sympathetic to \` and \' since glibc, FreeBSD, and boost already support them. But likewise \z is supported by many languages.