Making str.replace() accept lists

marcospb19 · May 9, 2020, 8:12pm

Syntax of the method: str.replace(old, new, count=-1)

What if replace could accept two lists instead of two strings.

Often I find in code:

text.replace(a, b).replace(c, d)

The concatenation of replace calls can cause a unnecessary performance hit if the string is large or the call is the calls are made too many times.

My suggestion:

text.replace([a, c], [b, d])

That would behave like (text.replace(a, b).replace(c, d)) but, b could not capture text from c, that is, old[i] would not capture new[i - 1] replaced text, or from any new[i - k] with K being a positive integer.

This way we can avoid having to reallocating giant strings multiple times.

Also, this makes room for huge replace commands that could be a handful tool for some cases.

steven.daprano · May 9, 2020, 11:51pm

Having str.replace support multiple targets (but with a single
replacement) has been suggested many times before, and it has always
foundered over the problem of what to do when the targets overlap.

text.replace(('pepper', 'red pepper', 'green pepper'), 'tomato')

The conclusion was always to recommend that if your replacement needs
were more complex than just changing a single substring at a time, you
should move to using regular expressions.

marcospb19 · May 10, 2020, 1:11am

The example you gave can be trivially moved to regular expressions:

text = 'pepper, red pepper, green pepper'
regex = re.compile(r'(pepper|red pepper|green pepper)')

print(regex.sub('tomato', text))

But, if instead, we consider my example.

text.replace(a, b).replace(c, d)
text.replace([a, c], [b, d])

How can this be replaced with regular expressions?

steven.daprano · May 10, 2020, 3:54am

[João Marcos Bezerra]
“How can this be replaced with regular expressions?”

See the docs for re.sub.

The replacement can be a callback function that takes the match object
as argument, so you can decide what to replace it with.

By the way, if you’re going to still argue in favour of changing
str.replace, the other string methods take tuples, not lists, for
multi-valued arguments. This is a common pattern in Python:

string.startwith(('a', 'b'))
isinstance(obj, (int, float))

I think there may be a good case for a helper function in re that
takes an iterable of (original, replacement) pairs and calls re.sub for
you:

re.multireplace([(a, b), (c, d)])
# replace a with b, c with d

but I don’t think the str.replace method should be changed.

storchaka · May 10, 2020, 8:41am

Other problem with the original proposition is that when a function accepts lists, it is rarely accepts only lists. It usually accepts arbitrary iterables. But string is an iterable, so str.replace('ab', 'cd') should be equivalent to str.replace(['a', 'b'], ['c', 'd']).

If we solve the ambiguity by choosing other syntax, we will still need to add a lot of complex code. We will need to implement a part of the regular expression engine in the interpreter core. But we already have a dedicated module for regular expressions.

luciano · May 12, 2020, 2:28am

Interesting! A few days ago I wrote an example for Fluent Python 2ed (unpublished) that looks like this:

from typing import Iterable, Tuple

FromTo = Tuple[str, str]

def zip_replace(text: str, changes: Iterable[FromTo], count:int = -1) -> str:
    for from_, to in changes:
        text = text.replace(from_, to, count)
    return text

It’s an example for one of the chapters about type hints in the book. I do my best to avoid foo/bar examples, and this is a nice little function to discuss a type alias and an Iterable argument.

methane · May 12, 2020, 2:56am

If same set of replaces happen many times, something like Go’s Replacer is better for performance point of view.

It solves the proposed restriction too; " but, b could not capture text from c , that is, old[i] would not capture new[i - 1] replaced text, or from any new[i - k] with K being a positive integer."

I think there are some TRIE implementations in PyPI.

marcospb19 · May 16, 2020, 12:41am

Oops, thanks for notifying!

Here it is:

text.replace((a, c), (b, d))

I don’t think that solving this restriction is a need because wanting c to capture b looks like a very rare usage case, where a replacement is taken on top of another, for this rare occasion you could just concatenate two uses of replace, also because I created this restriction to make the replacement algorithm faster.

I see how a Trie can solve this problem, but the memory usage would be a problem, and linked lists can not make use of the sequential memory caching (complexity is lower, but runtime can be larger).
Maybe a linear sweep similar to the current implementation of replace is a better solution.

marcospb19 · May 16, 2020, 1:16am

steven.daprano:

Having str.replace support multiple targets (but with a single
replacement) has been suggested many times before, and it has always
foundered over the problem of what to do when the targets overlap.
text.replace(('pepper', 'red pepper', 'green pepper'), 'tomato')
The conclusion was always to recommend that if your replacement needs
were more complex than just changing a single substring at a time, you
should move to using regular expressions.

The Idea you cited is very different from mine, so let’s not hastily close this, thinking that is the same thing.

If someone is replacing text of a string, it is very likely that it’s being done more than just once.

Let’s compare the available solutions.

Consider we want to do 3 changes to the string text:

text: str = get_text()

changes: List[Tuple[str, str]] = [
	(a, b),
	(c, d),
	(e, f)
] # Imagine those variables are strings

Common solution (same as what @luciano teaches):

for from_, to in changes:
    text = text.replace(from_, to)

RegEx solution (suggested by @steven.daprano and based on the re.sub() documentation):

import re

changes: dict[str, str] = dict(changes)

def callback_repl(matchobj) -> str:
	replacement: Optional[str] = changes.get(matchobj.group(0), None)
	if replacement is not None:
		return replacement
	raise Exception('The match object don\'t match!')

re.sub(rf'({a}|{c}|{e})', callback_repl, text)

New suggested solution:

text.replace((a, b), (c, d), (e, f))

I can’t complaint about the first solution, it works, the only reason I’m posting this is because I think that the operation of writing multiple replaces is very common and can be optimized.

The second solution is complicated for the simple job it solves, I can see people copying it from StackOverflow haha, jokes aside, the function call adds an unnecessary overhead to the algorithm.

methane · May 16, 2020, 1:55am

TRIE is the most performant way to multiple keywords in long string.

We use “Boyer-Moore String Search Algorithm” to search single word. Using the algorithm multiple times (e.g. s.replace(k1, r1).replace(k2, r2)) doesn’t solve the proposed restriction. So we can not use the BM algorithm for replacing multiple keywords.

Bulding TRIE consumes some CPU and memory. But if same replacement happens multiple times, it is fast. That’s why Go has Replacer in its stdlib.

marcospb19 · September 8, 2020, 5:49pm

I guess this proposal is now dead.

I don’t know exactly what you mean by this.

But on the performance side, we can tell for sure that for lower amounts of tuples (a, b) to replace a for b, using a trie is worst than iterating on all lists that we are trying to match.

Benchmarks would be necessary to tell from what point the advantage shifts from one approach to another.

I mean, a trie would just work better for the worst cases so… If you don’t care about the rest, that’s the way to go.

andrometocs · December 30, 2020, 9:40am

that’s great syntax idea

idanw206 · July 5, 2021, 9:05pm

I like the multireplace approach suggested by @marcospb19 , but I think it should be a method of the str type, for instance:

`my_str.multireplace(['a', 'c'], ['b', 'd])

It can also be taken one step further, and support dicts as an input:

my_str.multireplace({'a': 'b', 'c': 'd'})

rangeles89 · August 27, 2021, 6:49am

Hi, another possible solution could be implement the replace algorithm using find method…

For my solution I downloaded a txt version of “El Quijote de la mancha”, in order to have a string long enough to measure time.

with urlopen("https://gist.githubusercontent.com/jsdario/6d6c69398cb0c73111e49f1218960f79/raw/8d4fc4548d437e2a7203a5aeeace5477f598827d/el_quijote.txt") as f:
    text = f.read()
text = str(text, 'utf-8')
to_replace = list(set([t for t in choices(text.split(), k=4000) if len(t)>3 ]))
replace_map =  list(map(lambda x: (x, f'new_string_to_replace_with_{x}'), to_replace))
print(replace_map)
print(len(replace_map))

Then I created a function using the nested calls to replace method

def multireplace_v1(s, changes):
    for old, new in changes:
        s = s.replace(old, new)
    return s

And another function using find method, and creating a list of all possible replacements using the changes

def multireplace_v2(s, changes):
    right = len(s)-1
    replacements = []
    for old, new in changes:
        i = 0
        l = len(old)
        while True:
            n = text_test.find(old, i, right)
            if n == -1:
                break
            i = n + l
            replacements.append((n, i, l, new))
    replacements = sorted(replacements, key= lambda x: x[0])
    
    i = 0
    prev_s = -1
    prev_e = -1
    new_s = ""
    for b, e, l, t in replacements:
        if b >= prev_s and  b+l <= prev_e:
            continue
        prev_s = b
        prev_e = b+l
        new_s += s[i:b] + t
        i = e
    new_s += s[i:]
    return new_s

The call

result1 = multireplace_v1(text, replace_map)

took 3.06 seconds to finish

And

result2 = multireplace_v2(text, replace_map)

took 914ms

The proposed solution is faster, and also prevents from replace the already replaced string, the priority is the occurrence of one of the string in changes

In the v1 function you’ll see things like this

Miguel de Cervnew_string_to_replace_with_new_string_to_replace_with_antes Saavedra

because is replacing the words ante and antes

In the v2 version you’ll get this:

Miguel de Cervnew_string_to_replace_with_antes Saavedra

Only one replacement per “word”.