Top-level functions of re.module should support the pos/endpos arguments

Summary

Python should add the pos/endpos optional parameters to the module convenience functions of re.search/match/fullmatch/findall/finditer. This would enable pos/endpos searching without having to first compile the regex to a pattern. Here’s a sample diff that would match up with the underlying C functionality.

If there’s appetite for this sort of idea, I’d be happy to create an issue on the issue tracker and write the code and tests for it.

Rationale

There are a number of methods for in the Python Regex Pattern class that support optional positional arguments (pos/endpos):

  • Pattern.search(string[, pos[, endpos]])
  • Pattern.match(string[, pos[, endpos]])
  • Pattern.fullmatch(string[, pos[, endpos]])
  • Pattern.findall(string[, pos[, endpos]])
  • Pattern.finditer(string[, pos[, endpos]])

Additionally, Python provides access to these pattern methods as top-level convenience functions in the module itself:

  • re.search()
  • re.match()
  • re.fullmatch()
  • re.findall()
  • re.finditer()

However, these top-level convenience functions do not support the optional positional arguments. If anyone wants to utilize the optional positional arguments, they must first compile a pattern with re.compile() and then call the method with the optional argument.

But all the convenience functions do is 1) compile the pattern and then 2) call the method. Here’s an example directly from the re.py source:

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

Looking at the underlying C Code for these methods, the method defines pos and endpos as 0 and PY_SSIZE_T_MAX respectively. It only changes the values if the arg parser detects the presence of either pos or endpos.

Example C code from match (indentation adjusted for readability):

static PyObject *
_sre_SRE_Pattern_match(PatternObject *self, PyTypeObject *cls, PyObject *const *args, Py_ssize_t nargs, PyObject *kwnames)
{
(...)
    Py_ssize_t pos = 0;
    Py_ssize_t endpos = PY_SSIZE_T_MAX;
(...)
    pos = ival;
(...)
    endpos = ival;
(...)
    return_value = _sre_SRE_Pattern_match_impl(self, cls, string, pos, endpos);

We could add equivalent functionality to the top level module functions by simply adding two new optional arguments to each of the related functions.

Here’s a sample of what it would look like for match()

import sys

def match(pattern, string, flags=0, pos=0, endpos=sys.maxsize):
    """Try to apply the pattern at the start of the string, returning
    a match object, or None if no match was found."""
    return _compile(pattern, flags).match(string, pos=pos, endpos=endpos)

And here’s a gist with a full implementation. It’s a very simple change, overall: Adding pos/endpos to re · GitHub

As stated above, if there’s appetite for this sort of idea, I’d be happy to create an issue on the issue tracker and write the code and tests for it.

7 Likes

While this thread hasn’t garnered any discussion, it did receive a few likes, which I presume to be some small measure of support. I believe this to be a relatively simple change which would provide an immediate benefit to users of the regex module.

As such, I’ve gone ahead and submitted a PR for this: PR 113306

Hope that’s okay to do! Happy to take any constructive criticism on the PR or anything else. Cheers!

1 Like

When I wrote the regex module, which is intended to be backwards-compatible with re, I added them very early on. I’ve never regretted that decision.

2 Likes

I did a quick survey of public Github code and found usage of pos/endpos, as well as the pattern of compile → search just to use pos/endpos. Here are the results of that survey.