Problem
I would like to have access to the matches from a substitution. This would allow to reconstruct the original string (or parts of it) from the replaced string. An example use case is to (efficiently) parse a long string and then parse it again, e.g.,
s1 = re.sub(' and ', '', 'a and b and c and d') # 3 matches, s1 = 'abac'
re.finditer('ab|cd', s1) # two matches span 'ab', 'ac'
Now, I would like to reconstruct the original string partially to obtain ab and ac
. Of course, this example is a bit too simplified as here a simple other regular expression (pattern='a( and )?b|c( and )d'
) would work as well but, in my use case, I have much longer patterns which get impossible to combine to a single one.
Suggestion
Currently, there is
re.sub(...)
which only returns the substituted text andre.subn(...)
which returns additionally the number of substitutions.
I suggest to add either a function (probably my preferred):
re.subm(...)
which returns both the substituted text and an iterable over the replaced matches (or a list, depending on the implementation)
or an additional argument to the existing re.sub(...)
, e.g.,
re.sub(..., matches=False)
which would return additionally the replaced matches iterable.
Possible implementation in python
The following code implements this function in python, but it likely could be done much more efficient when using the proper cpython _sre
module.
import re
def subm(pattern, repl, string, count=0, flags=0):
# assert that pattern is compiled
pattern = re.compile(pattern, flags=flags)
# make callable of given repr if it is of type `str`
if not callable(repl):
group_for_pos, replacements = re._compile_repl(repl, pattern)
group_for_pos = dict(group_for_pos)
def repl(match):
groups = (None, *match.groups())
return ''.join(rep
if rep is not None
else ''
if (group := groups[group_for_pos[pos]]) is None
else group
for pos, rep in enumerate(replacements))
replaced = string
matches = re.finditer(pattern, string)
# assert that count is of type int else raise TypeError
if type(count) is int and count == 0:
matches = list(matches)
else:
# quicker than list(matches)[:count]
matches = [match for match, _ in zip(matches, range(count))]
# replace starting from the last match such that positions are correct
for match in reversed(matches):
start, end = match.span()
replaced = replaced[:start] + repl(match) + replaced[end:]
return replaced, matches
Nevertheless, in my tests the timing was already not too bad with less than a factor of two between usage of re.sub
versus subm
.
Edit: fixed example above.