Add "re.prefixmatch()", deprecate "re.match()"

I’ve spent, and continue to spend, time debugging bugs that are fixed by replacing re.match()re.search(), over and over again.

It’s one of old confusing quirks for which there’s an (more or less) obvious solution.

Surprisingly, I couldn’t find any proposal with deprecation of re.match(). So, I suggest we discuss it.

Also, perhaps we may want to add re.test() that returns bool, for the case when we don’t need a match object.

10 Likes

And another possible source of bugs is when re.match()was used while re.fullmatch()was intended. However, I haven’t encountered this in practice.

3 Likes

To convince people that re.match() should be deprecated, you have to show that the cost of deprecation and replacement is lower than the cost of maintaining the status quo. So my first question to you is: Do you know how extensive the costs are of a change like this? Before even discussing which cost is greater, be sure you actually understand what you’re asking for here. Is there a planned removal for re.match()? If so, how soon? If not, why not?

Have you observed other deprecations and how they have been accepted?

Then, make your case. What are the costs of the status quo? How frequently do you run into this problem yourself? How much code out there is likely to be buggy? Is that code part of maintained software that is likely to be fixed if the deprecation goes through, but isn’t otherwise going to be changed? What are all the costs associated with NOT deprecating this?

You’re going to need to do a lot of work here, a lot of research… but on the plus side, this is a proposal relating to regular expressions, and most of your research is going to be doing regular expression searches through large corpuses of code (eg searching GitHub) :slight_smile:

3 Likes

Rather then deprecate match and search why not propose easier to remember names as aliases?

For “match” alias as “startswith”.
For “search” alais as “contains” (not sure this is best choice).

5 Likes

A soft deprecation is obvious choice.

I almost never see a re.match()where the author of code indeed intended to match the prefix. It can often be seen re.match(r'^ ...')with anchor when one needs to match the prefix, because almost nobody who use it knows the secret that it matches the prefix. And if there was no anchor, it eventually gets debugged and replaced with re.search().

And the clear indication that re.match() wasn’t really intended are the regex patterns like r'\b ...' or r'(?<! ... ) ...'. Possibly such code can be found in GitHub, including the commit history. Sure, if the idea gets traction to the point of drafting PEP, somebody may be inclined to gather the statistics to prove the need for deprecation.

I assume you just mean by this that the docs would suggest not to use it. I don’t think that is necessary but perhaps the docs could be a bit clearer about when to use or not use it. The docs already explain the difference though and there is even a search-vs-match section. I don’t generally use regex but I’m pretty sure in any situation where I might I would be much more likely to want fullmatch or search rather than match but that seems pretty clear from reading the docs.

What exactly would you propose to change about the docs? A PEP is not needed just to add some clarifying text to the docs.

3 Likes

My point is that it isn’t obvious. You may think it’s obvious based on your personal experience, but that’s an argument that has to be made. You cannot assume that we also already agree with you.

But let’s suppose that soft deprecation is all that happens. In other words, there is no date at which the existing API is to be removed. All you’re doing is putting a note in the docs saying “use this alias instead”. Okay. So, suppose you’re developing some software. You have a choice: use re.match(), which will work on all existing versions of Python and all planned future versions as well, or use re.search() with an anchored regex, which will also work on all existing and all planned versions, but will be less efficient. Which do you choose? Does the deprecation make any difference here?

You are, of course, free to replace all uses of re.match in your own code with re.search. That’s fine. Nothing wrong with it. But the deprecation won’t actually add anything to that argument, unless you can show that there is real benefit to be gained here.

Liike I said, you’re going to need to do some research here. “I almost never see” isn’t enough of an argument. How many cases do you find of this on GitHub (or some other large corpus of code)? How many major projects have this happening?

No, the time to get those statistics is now. You won’t get traction for any further steps otherwise. And “somebody may be inclined to”? Are you asking someone else to do the work for you? If so, go change this in your own codebase only, and don’t ask for deprecation. If you want to push for a language change, you have to be prepared to do your own research.

3 Likes

There are quite a dissapointing number or re.match("^...")s out there. 418k out of 926k uses of re.match() directly on a literal.

But these renames never pay off the cost changing everything. Even if re.match is only soft deprecated, there will still be linters and IDEs and drive-by PRs pushing people to change code that has nothing wrong with it.

4 Likes

Here’s an interesting rexexp that came up recently:

r"\d+\s+"

What’s the big deal? Run it with .match() and it returns “almost instantly” even if the target string doesn’t match.

Bur run it with .search() on a string like "5" * N (which can’t suicceed) , and it takes time quadratic in N to fail. But N has to be in the thousands before this becomes very noticeable.

I don’t believe I’ve ever seen a discussion of this kind of failure mode. I’ll leave it to you to figure out why it happens :wink:

This is not a case of “catastrophic backtracking” (which consumes time exponential in N to fail to match), it’s just a consequence of how .search() works. There appears to be nothing you can do to the regexp to make it fast in all cases. Using possessive \d++ instead does speed it quite a bit, but it’s still quadratic time.

Also true under the very capable regex extension module, which is immune to many ways to try to provoke exponential time behavior. It’s no faster in this case than the core’s re module.

Personally, I almost always use “match” instead of “search”. But then I don’t use regexps to try to do “too much” at a time. I use it more like a flexible lexer, to pick off “the next” token in an input string, typically passing a “start index” argument too to a compiled pattern.

Deprecating match would just annoy people like me a lot :wink:

6 Likes

Could be confusing since str.startswith returns a bool.

2 Likes

Yes, there are at least 2 sources of quadratic slowdown.

This is a matter of optimization of the RegEx engines.

Then again, when one needs to do something that is “too much“ for RegEx, they resort to other tools. So nobody is inclined to invest heavily in optimization of the general-purpose RegEx engines in scripting languages.

re.prefixmatch() vs re.match() is a matter of clarity and bug avoidance, and naturally such cases may involve the performance vs correctness tradeoff.

Soft deprecation may be without warning, both in interpreter itself and in linters. It could be understood as avoid using in new code.

But WHY avoid using it? You’re proposing creating a new API that won’t work on any older version of Python, which has to compete with an old API that works on all older versions and all new versions. What is the point of avoiding the old API that works just fine, and will continue to work?

Soft deprecation is utterly meaningless unless there is some real benefit to using the new API, and a simple rename seldom achieves that.

2 Likes

While I don’t expect this will make progress, I think you’d have a much better chance of adding a wordier alias for match than deprecating anything (“soft” or not). “match” has been there for 3 decades, and a great many have never had any notable problem with the name. Some people do, and especially newbies. But it’s generally a shallow learning curve they quickly climb. Your:

is something I hadn’t heard of before. The lack of “me too!” responses in this topic suggests it’s not part of many others’ experience either.

BTW, the “newbie confusions” fell after a suggestion of mine made many years ago: instead of listing the re’s module’s functions in alphabetical order, put “search” before “match”. While it’s not how I happen to use the module, I did (& still do) believe most newcomers are looking for “search”.

3 Likes

There was limited support for an alias some years ago in this issue:

2 Likes

See Proposal: re.prefixmatch method (alias for re.match) · Issue #86519 · python/cpython · GitHub.

I do not think this is the way to go, because it would not solve any real problem, but would cause a worldwide code churn on par with 2→3.

More common error is when re.search() is used instead of `re.match()`. I encountered this many times. It can be unnoticed for a long time because it “works” if you only use it for expected input and limited variance of invalid input (even if there may be small performance impact).

This is a pretty common error too.

1 Like

Okay. Me too!

I aways have to read the re docs when I use these methods as I cannot keep search or match semantics in my head.

An alias like match_at_start and search_within would work for me.

7 Likes

Replying to multiple comments:

re.match is horrible. Yuk. confusing. Why do ya’ll do that?

But lacking a time machine, the “fix” is far worse that the status quo for all the reasons listed above.

As far as “disappointing” instances of match(^ …), I’m probably using re.search(‘^whatever’).

  • Not all my regexs are Python – I’m so old I still use sed for quick edits. So consistently using the front anchor ^ is just easier.
  • A typical use case is going to be i/o reading whatever I’m regexing, anyway.
  • My re metal model is search good, match bad
  • If I’m processing sufficient volume where I care about performance matches prefixes, I’m just using str.startswith.

Here are some benchmarks:

Test                  | 3.8  | 3.9  | 3.10 | 3.11 | 3.12 | 3.13 | 3.14
----------------------+------+------+------+------+------+------+-----
re.match('dog')       | 0.94 | 0.90 | 0.96 | 0.84 | 0.96 | 0.90 | 0.85
re.match('^dog')      | 0.96 | 0.92 | 0.98 | 0.87 | 0.98 | 0.93 | 0.88
re.search('^dog')     | 1.18 | 1.15 | 1.24 | 0.92 | 1.00 | 1.02 | 0.97
str.startswith('dog') | 0.34 | 0.31 | 0.33 | 0.31 | 0.33 | 0.16 | 0.14
4 Likes

Me too!

I’m inclined to agree with their opinion; I get suspicious whenever re.match() is used.

When I was a newbie it took a long time for me to ingrain the difference between re.match(), re.fullmatch(), and re.search(). These days I tend to just ignore re.match() and re.fullmatch() and always use the respective re.search() equivalent because it seems cleaner to me to have the pattern matching intent baked into the pattern rather than method name.

Using the \A and \z anchors, re.search() can do everything re.match() and re.fullmatch() can:

  • re.match(p)re.search(rf"\A{p}")
  • re.fullmatch(p)re.search(rf"\A{p}\z")

So my perfect world would have re.search() renamed to the better sounding ‘re.match()’ and remove re.match() and re.fullmatch() completely, and not worry so much about the minor performance penalty.

However, I understand that these name changes aren’t really possible at this stage. The current names are not ideal but they’re not terrible, so my vote is to do nothing.

4 Likes

Yeah, name-swapping virtually guarantees that there’s no way to write properly-compatible code, so, that’s basically never gonna happen.

2 Likes

We, as a language, need to stop looking towards the past and look towards what makes it possible to write more clearly understandable code without need for special domain knowledge and reference manuals. Thats why I put up that issue and PR implementing this in the first place.

We need re.prefixmatch.

re.match’s meaning is a real footgun problem that people continually trip over in Python.

Fixing it does not require getting rid of re.match. All we need is the trivial feature using the proper self-explanatory name (see PR) to provide a well lit path of the actually understandable name.

prefixmatch provides a clear way past the footgun. There is no requirement for all existing code to be updated and never will be. But complaining that it makes it worse, some theoretical problem of projects receiveing PRs to “fix” things that aren’t broken is focusing on yesterday instead of the future. Those are non-problems compared to enabling code to be more understandable.

28 Likes