Str.replace of a set of characters

Paddy3118 · December 1, 2019, 12:09am

I would like str.replace, when given a set of characters, to replace occurrence of any of the characters in the set. I.e. 'ASDFGH'.replace(set('SFH'), '') == 'ADG' should then hold.
It would mean I would not have to go to the trouble of using the re module for a common string operation.

By extension, a set of multi-character strings should have any occurrences of the strings replaced, (in an indeterminate order, so watch out for replacement sets of strings were one string is a sub-string of another).

Whodjafink?

mjpieters · December 1, 2019, 12:52am

That’s what str.translate() already can do:

value = 'ASDFGH'
dropset = 'SFH'
dropmap = dict.fromkeys(map(ord, dropset))
print(value.translate(dropmap))

The dict.fromkeys() / ord() dance is used because the str.translate() method takes a map from integer codepoint to replacement value (and None means “remove”). str.translate() is fast and efficient and more powerful than simply as a character removal tool, hence the specific input requirement. It’s existence does mean there is no need to complicate the str.replace() API, however.

To make the tool easier to use there is a helper function: str.maketrans(), which can either transform a dictionary with single characters as keys or two or three strings into a suitable map for str.translate(). So this works too:

dropmap = str.maketrans(dict.fromkeys(dropset))

or

dropmap = str.maketrans('', '', dropset)

This last version requires dropset to be a string, while dict.fromkeys() will take any iterable.

Paddy3118 · December 1, 2019, 1:40am

Thanks Martijn for your reply, it’s appreciated

Unfortunately I am “cursed” by knowing regular expressions and am much more likely to use re.sub(r'[SFH]', '', 'ASDFGH'). I do prefer to use str methods over re but your use of translate seems overkill for my usual use cases where re is fast enough but I am trying to limit it’s use; (regular expressions are the way to go in scripting languages like Perl and Awk).

On my suggestion complicating str.replace: Yes it would. sets of single characters I think would be straight-forward to learn; the complications of multiple sub-strings - less so.

steven.daprano · December 1, 2019, 3:36am

You don’t need the dict.fromkeys/ord dance, that’s what the

maketrans method is for:

> "python".translate(str.maketrans('tp', 'τπ'))

'πyτhon'

mjpieters · December 1, 2019, 7:19am

I did introduce str.maketrans() too, in the next paragraph

Paddy3118 · December 1, 2019, 8:56am

Thanks Steven,

So in summary:

I asked for: 'ASDFGH'.replace(set('SFH'), '')
After using: re.sub(r'[SFH]', '', 'ASDFGH')
There exists: 'ASDFGH'.translate(str.maketrans('', '', 'SFH'))

I know which way this is going but thought to bring it up as I often parse text and am used to using the other str methods then having the dissonance of needing re. I’ll have to give this new (to me), use of str.translate a go and see if I can get it to “flow”.

Ahah, just read the docs on str.translate: I would have read of its seemingly intended use case of charcter replacement and use in the codecs, ran a few examples, then quietly forgot about it as the “replacement” thoughts drowned out any connection between my task and “translation”.

Thanks again Martijn, Steve.

derek_v · December 17, 2019, 6:09pm

Using str.translate for this is a bit like using urllib2 to make HTTP requests. For readability, I would want to either wrap it in a more expressive function or just use re.sub. It would be nice to have a simple string method.

uranusjr · December 17, 2019, 6:34pm

I kind of feel str.translate itself isn’t the problem per se, the name is straightforward enough to me, but maketrans would throw me off if I don’t know it already. I think I would’ve have no problems understanding any of the followings at first sight:

'ASDFGH'.translate({'S': '', 'F': '', 'H': ''})
'ASDFGH'.translate('SFH', 'JKL')
'ASDFGH'.translate(remove='SFH')

So maybe a possible solution would be to extend translate to create a one-off maketrans under the hood?

(For context, the second and third lines would throw TypeError right now; the first fails silently because translate does accept a dict, but look up using code point, not single-char str.)

Paddy3118 · December 21, 2019, 4:29am

Ah, your post has revealed something further to me.:

The common string functionality I am after could be better described as str.remove.
I am asking for an extension to the existing str.replace.
str.translate has some of the functionality, but must be used with maketrans.

Mind you, a str.remove might prompt others to extend its functionality to removing strings, and lists of strings

mjpieters · December 22, 2019, 6:17pm

No, must is the wrong term here. It is easier to use when using str.maketrans(). You can use dict.fromkeys(map(ord, toremove)) instead, for example, or {ord(c): None for c in toremove} or the (ugly and definitely not recommended) dict(zip(map(ord, toremove), [None] * len(toremove))).

Or you could hard-code the numeric Unicode codepoint values you wanted to remove, up front. Passing str.translate the dictionary {83: None, 70: None, 72: None} lets you remove the characters S, F and H, too.

str.maketrans() is a utility to transform common patterns into input that str.translate() accepts, you can replace the utility with one of your own.

At any rate, if your code really needs to handle lots of character removes based on sets, then just add a function to your utilities module:

def remove_set(s: str, to_remove: str) -> s:
    """Removes the individual characters in to_remove from s"""
    return s.translate(str.maketrans('', '', to_remove))

and then from there on out use remove_set('ASDFGH', 'SFH').

Paddy3118 · December 22, 2019, 8:48pm

Yea. It’s just that it’s something I’ve done multiple times over the years, hence my inquiry about adding functionality to the language itself.

Thanks.