Add "re.prefixmatch()", deprecate "re.match()"

+1. Maybe it will have better luck now? I’ll note that the Steering Council recently approved the PEP introducing math.integer, which will end up duplicating a pile of math functions in the new math.integer submodule. No code needs to change - they’ll continue to live in math too.

It’s much the same “look to the future” motivation: math has been becoming an increasingly incoherent mish-mash of functions covering very different kinds of applications. Floating-point approximations and exact integer results live in different conceptual worlds.

Too late to rewrite history, but we can supply clearer ways to spell intent for the future.

Don’t want to leave people in the dark with that tease. Offline, @Stefan2 shared an extremely clever idea: use this regexp instead:

r"(?<!\d)\d+\s+"

The negative lookbehind assertion makes the regexp blazing fast in all cases, linear time at worst for search() even in fails-to-match cases. A possessive \d++ is probably faster still, but even without it, it’s so fast you wouldn’t care.

In failing cases with very long strings of digits, search still tries every possible starting position, but each attempt after the first fails at once. The search at the previous index failed then, so can’t possibly succeed simply starting from a later adjacent digit, and the assertion weeds out such cases nearly instantly.

The logic behind it may not be obvious at first glance, but it’s spot-op.

So one could optimize the regex engine by adding a preprocess step to the compilation of patterns, and it would apply transformations like:

  • r"\d+\s+"r"\d++\s++"

  • r"\d+\s+"r"(?<!\d)\d+\s+"

(Which would be a technique similar to one employed in Python’s Powersort, which consists of preprocess and merge steps.)

However, this may require a considerable effort if one wants to (correctly) optimize beyond trivial cases.

If because of some superstition you don’t want to use re.match(), simply add \A at the beginning of the pattern. re.search() is smart enough to stop matching beyong the start in that case.

People who like backtracking engines are well advised to check out the very popular and very capable “regex” extension package:

It works much harder at optimizing than Python’s “re”, and offers a world of additional features. For example, it’s immune to many cases of “catastrophic backtracking” that most other backtracking engines (including CPython’s) fall prey to.

Not really - and I wrote “timsort”, so you can trust me on that :wink: “Powersort” is a tiny part of it, just concerned with picking a good order in which to merge runs of possibly wildly different lengths. The sort started life with a different homegrown merge strategy, which was much harder to reason about. Powersort was a huge improvement on that count. But in real life, it’s not often actually faster. It’s still a linear-time approximation to a problem whose optimal solution requires quadratic time (in the number of runs).

Any backtracking regexp engine is a major effort to write at all. The regexp “language” has become very elaborate. regex has kept up with that, long offering features like \K, (*SKIP),(*PRUNE), and “recursive regexps”, which are becoming widespread.

Like there will be no requirement for all existing code to stop using threading.Thread().setDaemon(True) or threading.currentThread() when someone decided those sheds were the wrong color.

Given that those APIs still exist there are tons of “better” examples where old names were actually removed. Like the entirety of Python 3. I can also come up with arbitrary examples of things that’ve been given better names who’s old names still exist and do not. So what? We’re never going to be historically consistent as a project made by hundreds of people over decades.

My PR does not deprecate anything and makes the situation about being more explicit about the intent in a readable manner when possible clear in the docs.

Ignore the title of this thread, we’d never plan to remove the old re.match name. We don’t even want a deprecation warning on it.

I did the research on other languages a few years ago. original in this feature request comment 4y ago, reproduced here for reference as it illustrates how Python is the unusual API:

What do other APIs in widely used languages do with regex terminology? We appear to be the only popular language who anchors to the start of a string with an API named “match”.

libpcre C: uses “match” to mean what we call “search” - pcre2_match specification

Go: Uses “Match” to mean what we call “search” - regexp package - regexp - Go Packages

JavaScript: Uses “match” to mean what we call “search” - String.prototype.match() - JavaScript | MDN

Java: Uses “matches” (I think meaning what we call fullmatch?) - Pattern (Java Platform SE 7 )

C++ RE2: explicit “FullMatch” and “PartialMatch” APIs - GitHub - google/re2: RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

Jave re2j: uses “matches” like Java regex.Pattern - GitHub - google/re2j: linear time regular expression matching in Java

Ruby: Uses “match” as we do “search” - Class: Regexp (Ruby 2.4.0)

Rust: Uses match as we do “search” - regex - Rust

So if something can be deprecated on what… code style? Really?… then why would re.match() fare any better when it’ll actually have a functional reason for needing removed –“If prefixmatch matches the prefix then surely regular match matches something else, right?”

Maybe you specifically have no intention for deprecating it but I don’t trust that someone else won’t. That trust has been broken every year by each Python version’s needless breaking changes like that one. There’s no joy in new Python releases anymore – they’re too overshadowed by the forced mass refactors of every piece of Python I still care about that they bring, so many of which are just dubiously beneficial renames. It burns me out more than any of the other stereotyped ways to make an OSS maintainer give up and quit.

This adds a bunch of lines to the documentation a new user trying to figure out how the re module works shouldn’t have to read:

re — Regular expression operations — Python 3.15.0a5 documentation

prefixmatch() vs. match()

Why is the match() function and method name being discouraged in favor of the longer prefixmatch()spelling in very recent Python?

Many other languages have gained regex support libraries since regular expressions were added to Python. However in the most popular of those, they use the term match in their APIs to mean the unanchored behavior provided in Python by search(). Thus use of the plain term match can be unclear to those used to other languages when reading or writing code and not familiar with the Python API’s divergence from what otherwise become the industry norm.

Quoting from the Zen Of Python (python3 -m this): “Explicit is better than implicit”. Anyone reading the name prefixmatch() is likely to understand the intended semantics. When reading match() there remains a seed of doubt about the intended behavior to anyone not already familiar with this old Python gotcha.

We will never remove the original match() name, as it has been used in code for over 30 years.

The new Python 3.15 just wants to use Python. They don’t want a history lesson.

If we must be renaming this (or additional functions), let’s make it explict. I’ve started new topic: https://discuss.python.org/t/function-alias-proposal

I had this thought as well. I don’t think adding prefixmatch is going to reduce confusion, it’s probably going to increase it.

There’s a lot of decisions in python that haven’t stood the test of time. match isn’t a good name when pretty much every other regex implementation uses this to mean something different. But unless we’re looking for a python 4, I’d rather leave this untouched. These “small changes” over time are in some ways worse than any breakage I experienced in the 2->3 transition, because they are an ongoing thing with more and more of them every so often.

The parallel to other apis changing doesn’t sit well with me. math.integer is an example of doing it better, a new namespace offers new opporutnity: why not re.modern namespace that has a set of modern APIs consistent with expectations?

Fair point! Likely better:

  • Rename match to prefixmatch in the current docs.
  • Add an entry for match, saying just that it’s an alias for prefixmatch, which latter name is preferred now for clarity.

Please note that “(longest) prefix matching” is a defined term in (text) pattern matching and does not correspond to what a renamed re.match() method/function would do.

Instead, it refers to “the problem of finding which string from a given set is the longest prefix of another, given string”.

See e.g. Fast Prefix Matching of Bounded Strings or Fast Longest Prefix Matching: Algorithms, Analysis, and Applications

A better name would be re.startswith(), since that would simply be a regexp version of the string method of the same name.

I like the startswith bikeshed color.

A bit late to the conversation here, but would changing the name of the function be more beneficial than adding an anchor flag or something similar? e.g. re.match(…, anchor=False) or re.match(…, anchor=True)

Or possibly just using the string flag style that open uses, re.match(…, anchor=’front’)?

This avoids the issue of having an alias that isn’t backwards compatible, but I guess changing the signature has its own downsides. Since that argument isn’t expected in older versions.

Could be more explicit for someone reading what your intention is with the function call, especially since people will likely continue using match either way.

I don’t really think the alias is worth adding. But if it is added, I suggest that the documentation be kept short and sweet. e.g.,

re.match(pattern, string, flags=0)
re.startswith(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding Match. ... [rest of current docstring]

match is the original name for this function. Users are advised to prefer startswith where possible.

> Added in 3.15: startswith is a new alias for match.

I think it could be confused with str.startswith(). Perhaps, regex matching of prefix should be named differently, clearly signalling a different sematics (fixed string comparison vs. pattern matching).

Also, re.startswith() instead of re.prefixmatch() will then be inconsistent in naming with respect to re.fullmatch().

I don’t think this matters. They’re methods on different objects/modules. I don’t think there’s much risk of confusion here.

This concerns me a little, but I think it’s a minor point. For me, I use match/search much more often than fullmatch. I always have to look up what match/search do (although the up-thread suggestion of “search within a string” helps). I’ve never been confused about search vs fullmatch or match vs, fullmatch.

By the way, if we look at examples like

# 1
re.startswith(pattern, string)

# 2
compiled_pattern = re.compile(pattern)
compiled_pattern.startswith(string)

# 3
string.startswith(substring)

There seems to be inconsistency, linguistically. For the same verb (if we view startswith as a verb in a sentence), we have inconsistent order of subject and object (natural order in example #3, reversed order in example #2).

And another consideration is, are we compelled to add re.endswith() then? And do we need to invent new name for re.fullmatch()?