The existing functionality vs. the Zen
For such a simple and elegant and useful idea, str.translate
is hard to explain:
translate(self, table, /)
Replace each character in the string using the given translation table.
table
Translation table, which must be a mapping of Unicode ordinals to
Unicode ordinals, strings, or None.
The table must implement lookup/indexing via __getitem__, for instance a
dictionary or list. If this operation raises LookupError, the character is
left untouched. Characters mapped to None are deleted.
To understand this, I have to understand “Unicode ordinal”, which here is used to mean “Unicode code point, as an integer”. But this isn’t a complaint about the phrasing; it’s a complaint that I need to provide keys in that format in the first place. In practical terms, I am expected to fill in this gap with str.maketrans
(not appearing in this film the above documentation):
maketrans(...)
Return a translation table usable for str.translate().
If there is only one argument, it must be a dictionary mapping Unicode
ordinals (integers) or characters to Unicode ordinals, strings or None.
Character keys will be then converted to ordinals.
If there are two arguments, they must be strings of equal length, and
in the resulting dictionary, each character in x will be mapped to the
character at the same position in y. If there is a third argument, it
must be a string, whose characters will be mapped to None in the result.
That’s also hard to explain for all the same reasons. Worse yet, the argument semantics vary depending on how many there are, like range
but worse. It’s also not possible to, say, combine a dictionary of defaults with pair of special-case lists, and specifying only characters to remove requires giving two empty strings first. Special cases aren’t special enough to break the rules.
Speaking of which, there’s no particular reason why these dictionaries have to treat None
values specially; an empty string could be used instead to get the same result.
On top of all of that, str.translate
allows errors to pass silently without being explicitly silenced:
>>> 'a'.translate({'a': 'b'}) # Wait, what?
'a'
>>> 'a'.translate({ord('a'): 'b'}) # Ah, right.
'b'
Current performance considerations
I don’t want to go through this kind of hassle unless it’s for Numpy-at-its-best sorts of performance gains. And str.translate
… just isn’t that impressive. My quick test framework:
import random, timeit
def test(d, crange, size):
count = 10_000_000 // size # sensible on my system
e = str.maketrans(d)
f = {k:(v if v else None) for k, v in e.items()}
def opt():
return txt.translate(e)
def auto():
return txt.translate(e)
def manual():
return ''.join(d.get(c, c) for c in txt)
def assess(f):
return timeit.timeit(f, number=count)
txt = ''.join(chr(random.randrange(crange)) for _ in range(size))
assert opt() == auto() == manual()
to, ta, tm = assess(opt), assess(auto), assess(manual)
# absolute time for the manual version, then ratios using str.translate
return tm, (to/tm, ta/tm)
I tried adding the opt
version just in case switching empty strings to None
is somehow really relevant to the performance. The effect is negligible in all cases.
Overall, str.translate
is of course faster, but in my testing it ranges from 1.4x speed to 3.7x depending on the mapping and the Python version. And it tends strongly towards the low end of that in most practical cases (the best case for auto
is when it could be replaced by return ''
). For example, a simple ROT-13 implementation:
$ python3.11
Python 3.11.2 (main, Apr 5 2023, 03:08:14) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import translate
>>> import string
>>> u, l = string.ascii_uppercase, string.ascii_lowercase
>>> d = dict(zip(u+l, u[13:]+u[:13]+l[13:]+l[:13]))
>>> translate.test(d, 0x100, 1000)
(0.746712852967903, (0.7000551883315342, 0.6960985174501878))
And actually, that difference gets even smaller if I allow str.join
to work on a list comprehension instead of a generator expression - then str.translate
is only about 10% faster.
Short version: as it stands, realistically the main benefit is the clarity of using a named, special-purpose method. If that’s what we’re going for, we really should have a named, special-purpose method that people would want to use - that doesn’t lock half its interface behind a helper, doesn’t have weird input expectations that are hard to document, and doesn’t silently do something useless when the input is subtly wrong.
Proposal
Of course, changing any of this directly in translate
or maketrans
would almost certainly break tons of code. I propose a new method, str.mapped
, following the implementation below. It would:
- directly map from characters to output strings, simply expecting only strings for both keys and values
- allow for describing the mapping in multiple, convenient ways, with clear and expected semantics in case of conflict
- not practically require or benefit from any helper methods
- be named like the transformation it is, rather than the in-place mutation it can’t be (since
str
is immutable) - check for errors by default
- use keyword-only arguments to avoid ambiguity
- be easy to explain
Here is a pure-Python implementation of the functionality I imagine, in terms of the existing methods, written as a function.
def mapped(s, mapping, *, replace='', replace_with='', remove='', check=True, **extra):
"""Create a new string based on mapping each character of the input.
mapping -> a dict where keys are characters to replace
and values are string replacements.
replace, replace_with -> two strings of equal length; each character
in `replace` is mapped to the corresponding character in `replace_with`
remove -> any character in this string is removed
check -> if set (the default), an exception will be raised if the mapping
has any invalid keys or values
**extra -> single-letter keyword arguments specify individual replacements
Replacements specified by keyword arguments take precedence, then
`remove`, `replace`/`replace_with` specifications, and the base `mapping`,
in order."""
m = {
**mapping,
**dict(zip(replace, replace_with)),
**dict.fromkeys(remove, ''),
**extra
}
if check:
if any(not isinstance(v, str) for v in m.values():
raise TypeError('values must all be strings')
ks = m.keys()
if any(not isinstance(k, str) for k in ks):
raise TypeError('keys must all be strings')
if any(len(k) != 1 for k in ks):
raise ValueError('keys must all be length 1')
return s.translate(str.maketrans(m))
I would like to see the equivalent implemented as a native, built-in method, with appropriate optimizations.