Nicer interface for str.translate

kknechtel · July 28, 2023, 9:53am

The existing functionality vs. the Zen

For such a simple and elegant and useful idea, str.translate is hard to explain:

translate(self, table, /)
    Replace each character in the string using the given translation table.
    
      table
        Translation table, which must be a mapping of Unicode ordinals to
        Unicode ordinals, strings, or None.
    
    The table must implement lookup/indexing via __getitem__, for instance a
    dictionary or list.  If this operation raises LookupError, the character is
    left untouched.  Characters mapped to None are deleted.

To understand this, I have to understand “Unicode ordinal”, which here is used to mean “Unicode code point, as an integer”. But this isn’t a complaint about the phrasing; it’s a complaint that I need to provide keys in that format in the first place. In practical terms, I am expected to fill in this gap with str.maketrans (not appearing in ~~this film~~ the above documentation):

maketrans(...)
    Return a translation table usable for str.translate().
    
    If there is only one argument, it must be a dictionary mapping Unicode
    ordinals (integers) or characters to Unicode ordinals, strings or None.
    Character keys will be then converted to ordinals.
    If there are two arguments, they must be strings of equal length, and
    in the resulting dictionary, each character in x will be mapped to the
    character at the same position in y. If there is a third argument, it
    must be a string, whose characters will be mapped to None in the result.

That’s also hard to explain for all the same reasons. Worse yet, the argument semantics vary depending on how many there are, like range but worse. It’s also not possible to, say, combine a dictionary of defaults with pair of special-case lists, and specifying only characters to remove requires giving two empty strings first. Special cases aren’t special enough to break the rules.
Speaking of which, there’s no particular reason why these dictionaries have to treat None values specially; an empty string could be used instead to get the same result.

On top of all of that, str.translate allows errors to pass silently without being explicitly silenced:

>>> 'a'.translate({'a': 'b'}) # Wait, what?
'a'
>>> 'a'.translate({ord('a'): 'b'}) # Ah, right.
'b'

Current performance considerations

I don’t want to go through this kind of hassle unless it’s for Numpy-at-its-best sorts of performance gains. And str.translate… just isn’t that impressive. My quick test framework:

import random, timeit


def test(d, crange, size):
    count = 10_000_000 // size # sensible on my system
    e = str.maketrans(d)
    f = {k:(v if v else None) for k, v in e.items()}
    def opt():
        return txt.translate(e)
    def auto():
        return txt.translate(e)
    def manual():
        return ''.join(d.get(c, c) for c in txt)
    def assess(f):
        return timeit.timeit(f, number=count)
    txt = ''.join(chr(random.randrange(crange)) for _ in range(size))
    assert opt() == auto() == manual()
    to, ta, tm = assess(opt), assess(auto), assess(manual)
    # absolute time for the manual version, then ratios using str.translate
    return tm, (to/tm, ta/tm)

I tried adding the opt version just in case switching empty strings to None is somehow really relevant to the performance. The effect is negligible in all cases.
Overall, str.translate is of course faster, but in my testing it ranges from 1.4x speed to 3.7x depending on the mapping and the Python version. And it tends strongly towards the low end of that in most practical cases (the best case for auto is when it could be replaced by return ''). For example, a simple ROT-13 implementation:

$ python3.11
Python 3.11.2 (main, Apr  5 2023, 03:08:14) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import translate
>>> import string
>>> u, l = string.ascii_uppercase, string.ascii_lowercase
>>> d = dict(zip(u+l, u[13:]+u[:13]+l[13:]+l[:13]))
>>> translate.test(d, 0x100, 1000)
(0.746712852967903, (0.7000551883315342, 0.6960985174501878))

And actually, that difference gets even smaller if I allow str.join to work on a list comprehension instead of a generator expression - then str.translate is only about 10% faster.

Short version: as it stands, realistically the main benefit is the clarity of using a named, special-purpose method. If that’s what we’re going for, we really should have a named, special-purpose method that people would want to use - that doesn’t lock half its interface behind a helper, doesn’t have weird input expectations that are hard to document, and doesn’t silently do something useless when the input is subtly wrong.

Proposal

Of course, changing any of this directly in translate or maketrans would almost certainly break tons of code. I propose a new method, str.mapped, following the implementation below. It would:

directly map from characters to output strings, simply expecting only strings for both keys and values
allow for describing the mapping in multiple, convenient ways, with clear and expected semantics in case of conflict
not practically require or benefit from any helper methods
be named like the transformation it is, rather than the in-place mutation it can’t be (since str is immutable)
check for errors by default
use keyword-only arguments to avoid ambiguity
be easy to explain

Here is a pure-Python implementation of the functionality I imagine, in terms of the existing methods, written as a function.

def mapped(s, mapping, *, replace='', replace_with='', remove='', check=True, **extra):
    """Create a new string based on mapping each character of the input.
    mapping -> a dict where keys are characters to replace 
    and values are string replacements.
    replace, replace_with -> two strings of equal length; each character
    in `replace` is mapped to the corresponding character in `replace_with`
    remove -> any character in this string is removed
    check -> if set (the default), an exception will be raised if the mapping
    has any invalid keys or values
    **extra -> single-letter keyword arguments specify individual replacements
    Replacements specified by keyword arguments take precedence, then
    `remove`, `replace`/`replace_with` specifications, and the base `mapping`,
    in order."""
    m = {
        **mapping,
        **dict(zip(replace, replace_with)),
        **dict.fromkeys(remove, ''),
        **extra
    }
    if check:
        if any(not isinstance(v, str) for v in m.values():
            raise TypeError('values must all be strings')
        ks = m.keys()
        if any(not isinstance(k, str) for k in ks):
            raise TypeError('keys must all be strings')
        if any(len(k) != 1 for k in ks):
            raise ValueError('keys must all be length 1')
    return s.translate(str.maketrans(m))

I would like to see the equivalent implemented as a native, built-in method, with appropriate optimizations.

pf_moore · July 28, 2023, 10:01am

I’ve very rarely, if ever, used str.translate. I suspect that if it didn’t already exist, it wouldn’t be accepted as a new feature. With that in mind, I’ll say what I would have said if this proposal had been made in the absence of str.translate - why don’t you publish this implementation as a 3rd party library on PyPI, collect feedback about how useful it is and whether the design feels right to people that way, and if it’s popular and successful, then propose it for the core?

Even with the existence of str.translate, this is a good approach, as you can get comparative feedback (is this new form useful enough for people to seek it out in preference to str.translate?)

Remember that if this were added to the core, str.translate would have to stay around, quite possibly for ever, because we don’t break working code without a very good reason (and “we thought of a better UI” typically isn’t sufficiently good).

kknechtel · July 28, 2023, 10:33am

The .translate method gets around on GitHub. Certainly some of those are false positives, but the first results I saw were dominated by what appear to be real uses of str.translate. It’s certainly not as popular as .title, but it’s way more common than, say, .removeprefix or .swapcase, and possibly more popular than e.g. .isalpha. I’d call that pretty significant.

In my own experience, I commonly get halfway through suggesting str.translate to beginners for their toy cipher projects and such before remembering how much I have to explain after that point. It also comes up organically when discussing the general topic of sequence-processing algorithms: I might discuss one-to-one mapping and filtering with a list comprehension, then mention one-to-many mapping and how to get a flat result, then how comprehensions don’t care about the input type but dictate an output type… and then str has special machinery, right there, but it’s not so pleasant to explain.

_{Interestingly enough, there is also a bytes.translate and bytes.maketrans, which have equally annoying interfaces that are also obnoxiously inconsistent with the str ones. The mapping for bytes specifically needs to be a 256-byte length bytes (or bytearray, memoryview etc.; but the built-in help doesn’t say that! Or None to specify only removals, but it doesn’t say that either!), and there’s a separate argument for values to remove. A dict with integer keys (it makes sense this time!) isn’t acceptable; even a list or tuple isn’t either. Input bytes can be mapped one-to-one or removed, but not mapped one-to-many. bytes.maketrans can only take exactly two arguments, giving the one-to-one mapping.}

I don’t really know how to “advertise” my work, but I’d certainly be interested in publishing such a utility. (I might want to support corresponding algorithms for other sequences, too.)

I absolutely agree that the existing implementations can’t be removed.

elis.byberi · July 28, 2023, 3:35pm

Why would someone do that when str.translate() documentation clearly states:

You can use str.maketrans() to create a translation map from character-to-character mappings in different formats.

kknechtel · July 28, 2023, 8:02pm

Because of the lack of expectation of a need to worry about it. Also, that’s only in the online documentation, not the built-in help.

elis.byberi · July 28, 2023, 8:25pm

Using help(str.translate) I got:

…
table
Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.
…

The built-in help explicitly states that the mapping key must be a Unicode ordinal:

'a'.translate({ord('a'): 'b'})

kknechtel · July 28, 2023, 9:07pm

Yes, I understand exactly how it works. The point of the proposal is that the way that it currently works is unpleasant and non-obvious, and the requirement seems to be based more in historical reasons than what actually makes the most sense. The fact that an interface is fully documented (at least, for those of us who understand terms like “Unicode ordinal” - which I suggest is a higher bar to clear than the functionality really demands) isn’t supposed to excuse these kinds of shortcomings.

Keep in mind that experienced developers still forget things, and may assume they know how to use something only to have to go through that cycle of debugging. The fact that they’re capable of such debugging, and solving the problem themselves, doesn’t make it any less obnoxious.

elis.byberi · July 28, 2023, 9:34pm

I agree with Paul Moore. It would be better to publish this implementation as a 3rd-party library on PyPI. If it becomes popular, it will likely end up in the Python core.

TeamSpen210 · July 28, 2023, 9:38pm

The problem with combining str.translate and str.maketrans is that translate() is designed so it doesn’t need to do any processing through the mapping it’s given at all - it immediately loops through the string, doing lookups for each character. Ideally you’d store the result of maketrans(), and reuse it for multiple translate() calls.

If you wanted to improve this, perhaps a better API would be to use a bespoke type, which can store the mappings more directly, and check them all.

kknechtel · July 28, 2023, 11:03pm

I’m not entirely sure what you’re trying to say here. The mapping is a mapping either way. str.translate expects the keys to be Unicode ordinals rather than characters. My hypothesis is that this doesn’t meaningfully help the C implementation. Yes, presumably at the C level it can directly determine the code point value as it iterates through the string, and not need to create one-character string objects for the lookup routine. But it presumably does need to create integer objects to use the dict lookup routines.

The code shown is a Python mockup to explain the desired functionality by adapting it to the old interface (which is why it uses the one-argument form of maketrans). If it were implemented in C instead, it could directly do the lookup work with a mapping that uses single-character strings as keys, rather than ordinals - no maketrans call involved. Alternately, I could have written the mockup code to use the ''.join idiom instead.

To showcase the idea fully as a separate package it would need to include a C extension as well, I think. The point is to demonstrate having access to the small performance boost available that way, while not demanding the ordinal-key input format. I should attempt this, because it will help verify the hypothesis as well

Of course, this still doesn’t allow for adding a method to the str type - that can only be done “from inside”, as far as I know. If the idea were accepted later, I would want it to work that way.

Update: no, it is possible and I just saw it in the other thread. Sorely tempted to show it off that way. A hack, yes, but it means the idea as presented in the package works as much as possible like what I envision.

Sure, there are cases where constructing the mapping is expensive compared to the translation, and I can separate the steps. The point is more about also having an interface that is simple to use for a one-shot. The existing design could easily have a fast path for when only a mapping is provided (no copy); then it would just need to provide another function for only creating the mapping without performing translation.

I don’t really like the idea of making a separate type, because it would essentially just wrap a dict and hide most of its functionality. It seems like the kind of thing Jack Diederich warned against.

kknechtel · July 29, 2023, 4:18am

First attempt is looking good and has some basic debugging done (traceback paths censored):

$ python3.11
Python 3.11.2 (main, Apr  5 2023, 03:08:14) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import betterstr
>>> 'foobar'.mapped({'f': 's'}, replace='r', replace_with='m', remove='o', b='p')
'spam'
>>> m = str.mapping({'f': 's'}, replace='r', replace_with='m', remove='o', b='p')
>>> 'foobar'.mapped(m)
'spam'
>>> m[42] = ()
>>> 'foobar'.mapped(m)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "...", line 67, in mapped
    check_mapping(base)
  File "...", line 12, in check_mapping
    raise TypeError('values must all be strings')
TypeError: values must all be strings
>>> 'foobar'.mapped(m, check=False)
'spam'
>>> str.mapping({'f': 's'}, replace='r', replace_with='m', remove='o', b='p')
{'f': 's', 'r': 'm', 'o': '', 'b': 'p'}

Next step is proper unit tests, and then maybe I can look at the C API a little…

davidism · August 2, 2023, 7:26pm

2 posts were split to a new topic: Why don’t the core devs make the changes I want?

davidism · August 3, 2023, 1:59am

5 posts were merged into an existing topic: Why don’t the core devs make the changes I want?

kknechtel · August 3, 2023, 3:11am

… You know, it’s strange.

I know about str.translate because I had one of those ciphering or filtering tasks at some point, did some research and figured out how to use it.

But now, when I started looking into the C code for str.translate, as well as how it was documented for 2.x (when str meant the bytes type), I got the distinct impression that it was always intended for tasks that involve dealing with code pages. The implementation is heavily reliant upon the same machinery that encode and decode use, and the 2.x documentation even talks about the codecs module and gives encodings.cp1251 as an example codec.

The problem is, none of the feasible tasks I can imagine make much sense this way, nor did they ever. Not least because explicit .decode and .encode methods exist, of course.

(Skip to the bottom paragraph if you don’t like rambling technical analysis I guess.)

You can’t use bytes.translate, and couldn’t effectively use str.translate, to decode bytes to Unicode, even for these single-byte encodings. It expects a mapping table that is bytes:

>>> import encodings.cp1251
>>> b'foo'.translate(encodings.cp1251.decoding_table)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'

In 2.x, of course you can supply a Unicode object, but somehow there is an implicit coercion of the bytes to Unicode before the translation is attempted:

>>> import encodings.cp1251
>>> 'foo'.translate(encodings.cp1251.decoding_table)
u'foo'
>>> '\xff'.translate(encodings.cp1251.decoding_table)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

Weirdly, it’s self that gets coerced, not the argument!

You also can’t readily use it to convert from bytes-masquerading-as-string in one code page to another code page. First off because that task is a non-starter in general (by design, only the ASCII range is reliable; most or all of the “extended ascii characters” represented by the current code page aren’t represented by any byte in the target code page; that’s why there are code pages). But also because the only error handling options you have are to delete the character or translate it to a replacement character like ? (not an actual Unicode replacement character, of course).

I mean, the approach does give better performance:

def cp850_to_cp1252(b, replace):
    return b.translate(matches, b'' if replace else delete)

def xcp850_to_cp1252(b, replace):
    return b.decode('cp850').encode('cp1252', errors=('replace' if replace else 'ignore'))

# test results:
>>> from timeit import timeit
>>> timeit("cp850_to_cp1252(bytes(range(256)), False)", globals=globals())
3.338275376241654
>>> timeit("xcp850_to_cp1252(bytes(range(256)), False)", globals=globals())
7.253022187855095

But look how much setup (and knowledge) is required:

import encodings.cp850, encodings.cp1252

# n.b. neither of these is actually ISO-8859-1!
windows = encodings.cp1252.decoding_table
latin1 = encodings.cp850.decoding_table

matches = [windows.find(latin1[i]) for i in range(256)]
delete = bytes(i for i in range(256) if matches[i] == -1)
# convert -1 values to 63 (ASCII question mark)
matches = bytes(63 if x == -1 else x for x in matches)

And, of course, we lose the option of strict or (for those who actually have a use for it) xmlcharrefreplace error handling.

Before you ask, no, the convoluted .find logic is necessary. You can’t go through the corresponding .encoding_tables - nor can you use them directly with translate. They’re instances of an EncodingMap class (defined by builtins despite not exposing a builtin name) that doesn’t do anything useful at the Python level:

>>> dir(encodings.cp1252.encoding_table) # in particular, notice there's no __getitem__
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'size']
>>> encodings.cp1252.encoding_table.size()
1119
>>> # and there are no magic attributes either:
>>> encodings.cp1252.encoding_table.__class__.__getattribute__ is object.__getattribute__
True

(Well, that was enlightening.)

Short version: I don’t really understand why the method was exposed in the first place, because its apparently intended use case makes little sense. It’s presented as an encoding and decoding tool, but it maps bytes to bytes and Unicode to Unicode. However, nowadays str.translate provides a powerful tool for filtering and “ciphering” strings - it replaces something like ''.join(mapping[x] for x in original), but faster and would be simpler, were it not for the interface weirdness. That weirdness, in turn, seems to be a historical artifact that’s hard to justify.

encukou · August 3, 2023, 9:06am

translate predates Unicode strings. In Python 2, where str was bytestrings, the translation table was a 256-character string. You can still do that in Python 3:

>>> swapcase = '________________________________ !"#$%&\'()*+,-./0123456789:;<=>?@abcdefghijklmnopqrstuvwxyz[\\]^_`ABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> 'Hello'.translate(swapcase)
'hELLO'

That’s why the lookup is by ordinal, rather than the character itself: swapcase[72] = 'h'.

For Unicode strings, the translation table needs to be a mapping instead… and, yeah, the API doesn’t make that much sense nowadays, if you don’t know the history.

IMO, if you want to improve str.translate, it’s best to leave it alone, for code ported from py2, and design a new API, starting with the use actual cases – not “a better API for translate” :‍)

kknechtel · August 3, 2023, 12:24pm

Well, that led me down an interesting path. Since 2.0, the str type had both translate and encode methods (where, of course, encode has to decode implicitly first), along with the unicode type being added. Before that, in 1.x, str objects don’t appear to have had any methods besides __mod__ and the usual sequence methods. However, translate and maketrans existed in the form of functions provided by the string standard library module. I couldn’t even check documentation pages before 1.4 through the legacy downloads pages; but I was able to find the source tarball releases which apparently included LaTeX documentation files. Through this I could confirm that translate was added in 1.3, and maketrans in 1.4.

I’m aware of that part, and my historical examination required getting up close and personal with it. (I’m not re-uninstalling 2.7 this time, just in case I ever have to do something like this again )

A bunch of historical notes

Doing even further reading, investigating the fact that .encode was provided since 2.0, but .decode was only added in 2.2… I checked out the What’s new in 2.2 document, along with whatever documentation I could find from that era. It comes across that 2.0/2.1 byte-based str objects could .encode just like Unicode strings, purely as an artifact of the attempt to handle the two types the same way (“These are the string methods which both 8-bit strings and Unicode objects support”). Of course, that would entail an implicit decoding first, apparently expecting plain ASCII. So there wasn’t actually a direct way to convert from bytes to Unicode with a specified encoding: .encode couldn’t do it, and .translate couldn’t do it (because it was only bytes-to-bytes).

2.2, of course, added the ability to .decode both “byte-strings” and Unicode, which in the latter case entails implicitly encoding first. Decoding functionally starts with bytes and ends with whatever the codec says it should, so this was also great for representing bytes-to-bytes transformation like ROT13, base64 encoding, compression… all sharing the same conceptual space as actual text codecs, and all being newly added. (I even remember using some of these somewhere around 2.4.) Although of course, only the actual text codecs would implicitly encode before decoding; things like .decode('zlib') can only work if every byte is just a byte and all byte values are valid.

It gets even weirder with 2.3. This is when the basestring ABC was added, despite that all the functionality was already shared to a distressing degree. But this is also the version where translate starts being documented as having different parameters for Unicode strings! And then, you couldn’t use string.maketrans to build a mapping suitable for unicode.translate - even giving it two unicode inputs results in a str. That can be keyed with Unicode ordinals, but the resulting one-character strs don’t work with unicode.translate.

Then in 3.0 with the change to Unicode-by-default, that ABC was removed, citing the reasoning that the two types don’t have enough functionality in common. Yet I’m not sure I could name a 3.0 bytes method that wasn’t also present on str. Both also got a .translate, with . But 3.0 only gave maketrans to str (remedied in 3.1). To build an old-school translation table for a bytes object in 3.0, you had to go back to… the string module. Yep.

I know we all know that the history of Unicode support in Python is not all that pleasant.

But what I’m trying to get at is, seen as a tool for handling text encoding, the translate method was never particularly useful. For all the time that separate bytes (whether treated as “byte-strings” or not) and Unicode types existed, it didn’t convert from one to the other, despite apparently leveraging the mechanisms for doing so.

Version-by-version breakdown

In 1.3 through 1.6, with considerable effort (keep in mind there were no list comprehensions, and joining a list l of strings would have looked like string.joinfields(l, '')), you could possibly have put together a translation table to map from one code page to another, at least the subset of characters represented by both. But there wasn’t Unicode at all.
In 2.0 and 2.1, you didn’t have any convenient bytes->Unicode route, and translate couldn’t do anything encode couldn’t - at least, in terms of legitimate, recognized text encodings.
In 2.2, translate still couldn’t do anything decode couldn’t. It didn’t even have noticeable upsides for bytes-to-bytes transformations; the new codecs were designed with encode and decode in mind, and using those names allowed for being explicit about which direction you were going. Plus they could handle transformations that aren’t 1-to-1 byte mappings.
In 2.3 through 2.7, translate had a separate interface for Unicode objects, but whoever designed this apparently thought that while it makes sense to use arbitrary mapping types and let the values be Unicode text (even more than one character, now!), the keys still had to be ordinals only. The UI was barely connected to the original, but it still kept a weird and awkward restriction. And this was still not useful for encoding, since it mapped (as it still does) Unicode to Unicode. Meanwhile, the str equivalent mapped bytes to bytes, so that wasn’t useful for decoding. And you still had to lean on the string module for byte-based maketrans, and still didn’t have a unicode-based maketrans at all.
In 3.x, the types are properly separated, and translate and maketrans are fully integrated (except bytes.maketrans not existing in 3.0). But it still doesn’t encode or decode text encodings

It’s hard to tell whether the translate and maketrans methods were really ever intended for the purpose. If they were, it raises interesting questions, like:

Why weren’t they ever deprecated?
When the new interface for translate for Unicode strings was added in 2.3, why didn’t it get any maketrans support?
Why did 3.0 add that missing maketrans support to the new str type, while not adding the equivalent to bytes? Did devs just assume it was something Unicode strings were supposed to be able to do, because a thing with the same name was in the old string module?
If it was such a big deal in 3.x that bytes and Unicode were becoming distinct types, and the supposed purpose of translate was anything to do with text encoding, and the whole point of a major version bump was to be able to introduce breaking changes… why didn’t it get changed at that point to make bytes-to-Unicode or Unicode-to-bytes transformations?

But on the other hand: if not that, what was it intended for? (And why does the C implementation do the things it does?)

I can’t make sense of this distinction. The API I have in mind seems better suited to the use cases I’m seeing (and have thought of in my own code), though. For example, Django uses it for escaping some special characters for embedding in HTML and JavaScript contexts, mapping through some simple hard-coded dictionaries. A change like this would allow dropping a lot of ord calls and thus make the code more elegant. I also see it used for things like removing punctuation - routing through maketrans explicitly makes it that much clunkier. Torch uses it to replace several different characters with underscores, rather than chaining multiple .replace calls. (There’s a nested loop, so I’m guessing that explicitly using maketrans there is an intentional performance optimization; but I’m not about to make this use case worse, either.)

It’s also in pathlib, where either of two pre-set mappings is computed using maketrans to avoid the ord calls. (Curiously, your fork came up first on my last GitHub search )

encukou · August 3, 2023, 12:55pm

The use cases you mention (with Django/Torch), and possibly some more. If you want to do some more archaeology, you might want to look at the Unix tr command; it looks to me like translate was inspired by that.

For things like replacing unwanted characters with underscores, in C back in 1999, a lookup table with (up to) 256 bytes would be the obvious choice. It wouldn’t need an explanation. The story of it becoming esoteric is quite interesting; thanks for digging up the Python leg of the journey :‍)

Nowadays, maybe we want something like teaching replace to do multiple replacements instead. It seems that limiting the matches to single characters isn’t particularly important.
Of course, multireplace has its own issues.

kknechtel · August 3, 2023, 1:38pm

Thanks for the information.

For the common case (I might have a little bit of availability bias here at the moment ) of replacing single characters (which implies that matches can’t overlap), I already have a working interface right here (maybe it can be improved a bit).

For multiple-character matches, the semantics that one gets from the standard regex-based approach - i.e., scan the string until something replaceable is found, and don’t consider either the replaced characters or their replacements at all in future matches, by skipping past them - are the only semantics that make any sense to me. An iterated-replacement based approach, aside from having unclear semantics for overlapping matches, has other problems caused by the sequencing e.g. it can’t effectively “swap” two substrings, and it’s potentially slow due to needing to make multiple passes.

Using a trie seems to be essentially a special-purpose regex implementation - traversing the trie rather than a regex engine’s more general-purpose DFA in order to find matches. An approach that actually used regex should look for the longest matches first: matches of the same length can’t conflict; and if a shorter candidate were preferred over a longer one, the shorter one would have to be a prefix in order to conflict; and in this case the longer candidate never matches. However, actually building a regex or a formal trie seems like a lot of work for what will often be a throwaway task. What I think I’ll try first is just precomputing the longest key length in the mapping, then trying to look for matches of that length and backtracking down to 1. (Actually, I can recall a previous project where I did something similar rather than trying to implement a trie properly… I think it was part of a shortest-common-superstring algorithm…)

e-dreyer · August 4, 2023, 6:04pm

It is an easy function to implement with .map or even list comprehension, and after using Python for years I only recently heard of it. Honestly, I’ve never had a ise cae for it and usually it is replacong single or multiple chars for some sort of comprehension or encoding. Really an odd fucntion for me due to its limitations.

AJNeufeld · August 4, 2023, 10:01pm

It may be an unusual function, but it is actually one of my favourite. I’ve answered many a Caesar Cipher question on Stack Overflow using it, but also one using it to perform superscripting of integers.

I could point to plenty of functions in Python I haven’t used, like math.lgamma(x). Just because someone doesn’t ever use a function doesn’t make it odd.

Topic		Replies	Views
Get changed offsets of unicode normalization? Core Development	5	452	March 6, 2022
Plain text dictionary keys? Ideas	16	2308	July 18, 2023
Unpacking generalization for dict assignment with type conversion/data normalization Ideas	11	584	November 20, 2023
Add tuple support to more str functions Ideas	17	634	April 24, 2024
API discussion for improving the CVE-2020-10735 mitigations Core Development	20	1523	October 27, 2022

Nicer interface for str.translate

The existing functionality vs. the Zen

Current performance considerations

Proposal

Related Topics