Module re: add possibility to return matches during substitution

Problem

I would like to have access to the matches from a substitution. This would allow to reconstruct the original string (or parts of it) from the replaced string. An example use case is to (efficiently) parse a long string and then parse it again, e.g.,

s1 = re.sub(' and ', '', 'a and b and c and d')  # 3 matches, s1 = 'abac'
re.finditer('ab|cd', s1)  # two matches span 'ab', 'ac'

Now, I would like to reconstruct the original string partially to obtain ab and ac. Of course, this example is a bit too simplified as here a simple other regular expression (pattern='a( and )?b|c( and )d') would work as well but, in my use case, I have much longer patterns which get impossible to combine to a single one.

Suggestion

Currently, there is

  • re.sub(...) which only returns the substituted text and
  • re.subn(...) which returns additionally the number of substitutions.

I suggest to add either a function (probably my preferred):

  • re.subm(...) which returns both the substituted text and an iterable over the replaced matches (or a list, depending on the implementation)

or an additional argument to the existing re.sub(...), e.g.,

  • re.sub(..., matches=False) which would return additionally the replaced matches iterable.

Possible implementation in python

The following code implements this function in python, but it likely could be done much more efficient when using the proper cpython _sre module.

import re

def subm(pattern, repl, string, count=0, flags=0):
    # assert that pattern is compiled
    pattern = re.compile(pattern, flags=flags)
    # make callable of given repr if it is of type `str`
    if not callable(repl):
        group_for_pos, replacements = re._compile_repl(repl, pattern)
        group_for_pos = dict(group_for_pos)
        def repl(match):
            groups = (None, *match.groups())
            return ''.join(rep
                           if rep is not None
                           else ''
                           if (group := groups[group_for_pos[pos]]) is None
                           else group
                           for pos, rep in enumerate(replacements))
    replaced = string
    matches = re.finditer(pattern, string)
    # assert that count is of type int else raise TypeError
    if type(count) is int and count == 0:
        matches = list(matches)
    else:
        # quicker than list(matches)[:count]
        matches = [match for match, _ in zip(matches, range(count))]
    # replace starting from the last match such that positions are correct
    for match in reversed(matches):
        start, end = match.span()
        replaced = replaced[:start] + repl(match) + replaced[end:]
    return replaced, matches

Nevertheless, in my tests the timing was already not too bad with less than a factor of two between usage of re.sub versus subm.

Edit: fixed example above.

Do you know that you can pass a callable as replacement?

import re
matches = []
def repl(m):
    matches.append(m)
    return ''

print(re.sub(' and ', repl, 'a and b and c and d'))
print(matches)
3 Likes

Moving this to the help category as the capability already exists.

What the… That’s not even remotely true.

I now fixed the example, above.

Sure, that feature I also use in the proposed python implementation of subm. But, that does not solve it for the case (which is my case) when the pattern includes groups, too. Surely your suggestion helps to further simplify it, here a simplified (and also slightly quicker) version of a possible python implementation of subm:

def subm(pattern, repl, string, count=0, flags=0):
    matches = []
    if callable(repl):
        def _repl(match):
            matches.append(match)
            return repl(match)
        return (re.sub(pattern, _repl, string, count=count, flags=flags),
                matches)
    pattern = re.compile(pattern, flags=flags)
    group_for_pos, replacements = re._compile_repl(repl, pattern)
    group_for_pos = dict(group_for_pos)
    def _repl(match):
        matches.append(match)
        groups = (None, *match.groups())
        return ''.join(rep
                       if rep is not None
                       else ''
                       if (group := groups[group_for_pos[pos]]) is None
                       else group
                       for pos, rep in enumerate(replacements))
    return re.sub(pattern, _repl, string, count=count), matches

Still, a big part of that code with the groups could be done more efficiently within _sre. Or are there other functions that can be used with the output of re._compile_repl?

(edit: improved execution time by removing unnecessary re.compile call if repl is a callable)