Add "re.prefixmatch()", deprecate "re.match()"

I’ve run into this a number of times too, and my preference is to always use search and be explicit with ^ and/or $, and I don’t think the match and fullmatch functions provide enough convenience to compensate for their potential for inadvertent misuse. Yeah, it’s not that complicated, but it does seem trip a lot of people up, at least in the beginning.

Regarding the search versus match section in the docs that another commenter mentioned, I think that’s an indication that there’s an issue here (although not a huge issue). Since match doesn’t provide anything other than the tiniest bit of convenience, it seems like it would be easier for most people if the search docs just called out the need to use ^ when you need to match at the beginning.

I also like the idea of re.test function rather than comparing the match against None, but that’s more of a nice-to-have than a major issue IMO.

Overall, I agree with the sentiment, although I would go with just search in an ideal world. I also agree that a soft (docs) deprecation is probably the only feasible approach, but I don’t that’s likely to ever happen.

3 Likes

The lack of “me too!” comments at this point (5 days after the OP) might just be due to the fact that some people only get weekly digests from this site and haven’t gotten around to replying yet.

Maybe:

  • match_start(new name for match)
  • match_end
  • match_full (new name for fullmatch)
  • match_in(new name for search)
  • match_all(new name for findall)

Or add a second parameterpattern.match(text, how="start").

This gives a consistent interface for all methods that return a Match-object. Probably not worth it, though.

4 Likes

I am kinda partial to having just one match function with different flags. We don’t have open_text ,open_binary and open_append. Why do we need separate function names for what is essentially just different ways of parsing an expression?

There’s really only start, end, all, full, and in unless I’m missing one. Since most people reach for match immediately, making them be explicit about how they want to match in the function call makes sense.

Edit: An additional argument for this style is centralization of options. Several libraries store kwargs in a separate location to calls so changes can be made to call behavior depending on runtime conditions.

e.g. pydantic stores dataclass kwargs in a separate module

Keeping the function match and allowing flags would mean codebases that use a lot of patterns would just use **match_kwargs after the positionals and be able to change behavior from one place instead of hunting for all the match, startswith or fullmatch calls and changing them.

1 Like

Per @bwoodsend earlier point, just saying “preferred” now is going to result in someone, somewhere going on a “fix match” crusade. I’d just say it’s an alias for prefixmatch and leave it at that.

1 Like

Having two equal solutions leads towards fragmentation. There’s value in giving direction. Preference does not necessarily imply everybody has to change, e.g. you could phrase this as „is recommended for new code“.

3 Likes

The pathology of standard library re is like datetime:

  1. Perhaps good enough 30 years ago, awful by modern standards :melting_face:
  2. Python people have gone to rely on them over three decades (meanwhile other programming languages have formed their own different consensuses)
  3. Yet given most use cases don’t require modern features, they haven’t been seen to be bad enough to be redone entirely
  4. Points 2 and 3 lead to the development of more performant and feature-rich drop-in replacement (regex, pcre2, etc.), which due to Point 2 aim to be mostly drop-in API-compatible with the stdlib
  5. Patchwork improvements make the stdlib stay barely afloat, reinforcing Point 3.

The proliferation of third-party replacements has a consequence on the evolution of the standard library API. As third-party replacements aim to be drop-in API-compatible, their API have been unfortunately infected with the same match / search / fullmatch idiosyncracy. People who use these alternatives for performance may expect this kind of code to trivially work:

try:
    import regex as re
except ImportError:
    import re

re.prefixmatch(...)    # BOOM

If we add re.prefixmatch or otherwise change the API, it’d be the onus of third-party maintainers to update their corresponding APIs. But this all means more work for the users who in addition to having to make sense of the original match / search mess, now also have to worry about versions in order to stay on the “correct” side of naming things. Python is much more widely used nowadays than 15 years ago when re.fullmatch was added, and most importantly Python is so often used in teaching programming these days that we have a responsibility to set good examples. @gpshead noted that the match / search naming is very different than in other languages, I believe that it can represent a point of confusion as people move from Python to other languages.

At this point, people’d perhaps just use anchors and that’s probably what should be taught in the docs as the preferred way. In the docs we can add a table showing what the equivalent anchors to match and fullmatch are, and let people treat the functions as shorthands for the anchors (modulo the slight performance differences, which if you’re using the stdlib implementation is a moot point anyway).

Secret? Can the dev team help it if nobody Rs TFM? If anything needs changing it’s the documentation.

Many of us RTFM all the time to make sure we do not use the wrong one of search or match.

It is not a docs issue its a problem that for many of us the names match and search mean the same thing.

1 Like

If nobody Rs TFM, then changing the documentation is, by definition, going to help nobody.

3 Likes

An option that wasn’t discussed [EDIT: my bad Matthew did raise it]

Zero changes to API! Just document under match and/or in “search vs match” section:

The historic “match” name is unfortunate, readers may guess it means search (as it does in many other languages!) or fullmatch. Consider using search instead, with \A (and/or \Z) to anchor the pattern.

BTW, current wording of search vs match section kinda assumes reader understands the English word “match” allows trailing content.

Benefits of putting the “how” in the RE:

  • Less python-specific knowledge to look up / remember.
    Concept of anchoring is universal! Unfortunately syntax varies: Regular Expressions Reference: Anchors
    Tools like grep support only ^/$. Alas, we can’t make a blanket recommendation for the widely known ^ due to MULTILINE mode. And I see many tools take uppercase \Z to mean “end of string, or before trailing newline(s)” but imho it’s too early to recommend \z for cross-language transfer at cost of python <3.14 compat. :thinking:
  • We have other APIs: findall, finditer, split, sub, subn. These all correspond to search in their behavior, and the concept of “use anchors if you want” applies there too, though you rarely do (maybe with sub? And ^/$ with MULTILINE is totally useful with all these).
  • flags are another aspect of “how” that is arguably better embedded in RE e.g. (?m).
    IMHO just think the concept “put everything in RE” is better default for most code…
1 Like

That does not help fix the problem that for many of us match and search mean the same thing.

True, but equally for many of us, they don’t.

A quick dictionary lookup shows

match: correspond or cause to correspond in some essential respect
search: try to find something by looking or otherwise seeking carefully and thoroughly

I’m not suggesting that we don’t need to do something to help people who find the distinction between “match” and “search” unclear, but I’d argue that the dictionary definitions suggest that the two words do have different meanings, and those meanings match[1] the way the methods work. So I’m struggling to understand why the current terminology is so problematic for those people.

One (IMO, rather clumsy) solution is to add explicit, but made up aliases like prefixmatch. But it would be good if we could do better than that - either by clarifying the documentation, or by finding terms that work for everyone, so we don’t end up with two factions using different method names (and having proxy fights over the matter via linter rules :slightly_smiling_face:).

Let’s concentrate on helping the people who find the current names confusing, whether we do so by documentation or renaming, and not have the discussion turn into a fruitless quest to decide who is “right”.


  1. The irony of “match” clearly being better than “search” here is noted :slightly_smiling_face: ↩︎

5 Likes

I‘m always stumbling over the fact that match matches the beginning. My mental model of „matching“ expects either a full match (like fullmatch()) or a match anywhere in the string (like seach()). The one-sided match feels surprising. Is there a mnemonic or logical explanation that makes the behavior more intuitive?

5 Likes

It is a little less common in Python, but there are plenty of other match/parse tools that work at the beginning of the string. I’m personally a big fan of sscanf-style parsing; it’s simpler than a regular expression and can never go quadratic (it doesn’t do backtracking).

There’s a difference between matching and looking (searching) for a match.

Another idiosyncrasy in English-language software is the use the word “find” instead of the word “search”. Really what you mean is “search, and report any matches that you find i.e. discover”.

2 Likes

Agreed! :wink:

1 Like

I covered some of these performance topics at my PyConAu talk in 2023

1 Like

Excellent content! Everyone should know this stuff.

In context here, the problem is subtler:

\d+\s+

applied to long strings of digits without whitespace. There’s nothing problematic about the regexp itself. It fails quickly when using match(), but takes quadratic time to fail when using search(). Very far from “catastrophic” (it’s O(n**2), not O(2**n)).

Of what your talk covered, possessive quyantifiers can cut time-to-failure in half, but it’s still quadratic time. But can be made linear time, wish a suitable negative lookbehind assertion at the start. The trick is in making search() give up almost at once each time it moves to start over at the next index.

2 Likes

The C++ std library has regex_match and regex_search. Although regex_match is closer to Python’s fullmatch