Yes, I understand exactly how it works. The point of the proposal is that the way that it currently works is unpleasant and non-obvious, and the requirement seems to be based more in historical reasons than what actually makes the most sense. The fact that an interface is fully documented (at least, for those of us who understand terms like “Unicode ordinal” - which I suggest is a higher bar to clear than the functionality really demands) isn’t supposed to excuse these kinds of shortcomings.
Keep in mind that experienced developers still forget things, and may assume they know how to use something only to have to go through that cycle of debugging. The fact that they’re capable of such debugging, and solving the problem themselves, doesn’t make it any less obnoxious.
I agree with Paul Moore. It would be better to publish this implementation as a 3rd-party library on PyPI. If it becomes popular, it will likely end up in the Python core.
The problem with combining str.translate and str.maketrans is that translate() is designed so it doesn’t need to do any processing through the mapping it’s given at all - it immediately loops through the string, doing lookups for each character. Ideally you’d store the result of maketrans(), and reuse it for multiple translate() calls.
If you wanted to improve this, perhaps a better API would be to use a bespoke type, which can store the mappings more directly, and check them all.
I’m not entirely sure what you’re trying to say here. The mapping is a mapping either way. str.translate expects the keys to be Unicode ordinals rather than characters. My hypothesis is that this doesn’t meaningfully help the C implementation. Yes, presumably at the C level it can directly determine the code point value as it iterates through the string, and not need to create one-character string objects for the lookup routine. But it presumably does need to create integer objects to use the dict lookup routines.
The code shown is a Python mockup to explain the desired functionality by adapting it to the old interface (which is why it uses the one-argument form of maketrans). If it were implemented in C instead, it could directly do the lookup work with a mapping that uses single-character strings as keys, rather than ordinals - no maketrans call involved. Alternately, I could have written the mockup code to use the ''.join idiom instead.
To showcase the idea fully as a separate package it would need to include a C extension as well, I think. The point is to demonstrate having access to the small performance boost available that way, while not demanding the ordinal-key input format. I should attempt this, because it will help verify the hypothesis as well
Of course, this still doesn’t allow for adding a method to the str type - that can only be done “from inside”, as far as I know. If the idea were accepted later, I would want it to work that way.
Update: no, it is possible and I just saw it in the other thread. Sorely tempted to show it off that way. A hack, yes, but it means the idea as presented in the package works as much as possible like what I envision.
Sure, there are cases where constructing the mapping is expensive compared to the translation, and I can separate the steps. The point is more about also having an interface that is simple to use for a one-shot. The existing design could easily have a fast path for when only a mapping is provided (no copy); then it would just need to provide another function for only creating the mapping without performing translation.
I don’t really like the idea of making a separate type, because it would essentially just wrap a dict and hide most of its functionality. It seems like the kind of thing Jack Diederich warned against.
I know about str.translate because I had one of those ciphering or filtering tasks at some point, did some research and figured out how to use it.
But now, when I started looking into the C code for str.translate, as well as how it was documented for 2.x (when str meant the bytes type), I got the distinct impression that it was always intended for tasks that involve dealing with code pages. The implementation is heavily reliant upon the same machinery that encode and decode use, and the 2.x documentation even talks about the codecs module and gives encodings.cp1251 as an example codec.
The problem is, none of the feasible tasks I can imagine make much sense this way, nor did they ever. Not least because explicit .decode and .encode methods exist, of course.
(Skip to the bottom paragraph if you don’t like rambling technical analysis I guess.)
You can’t use bytes.translate, and couldn’t effectively use str.translate, to decode bytes to Unicode, even for these single-byte encodings. It expects a mapping table that is bytes:
>>> import encodings.cp1251
>>> b'foo'.translate(encodings.cp1251.decoding_table)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
In 2.x, of course you can supply a Unicode object, but somehow there is an implicit coercion of the bytes to Unicode before the translation is attempted:
>>> import encodings.cp1251
>>> 'foo'.translate(encodings.cp1251.decoding_table)
u'foo'
>>> '\xff'.translate(encodings.cp1251.decoding_table)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
Weirdly, it’s self that gets coerced, not the argument!
You also can’t readily use it to convert from bytes-masquerading-as-string in one code page to another code page. First off because that task is a non-starter in general (by design, only the ASCII range is reliable; most or all of the “extended ascii characters” represented by the current code page aren’t represented by any byte in the target code page; that’s why there are code pages). But also because the only error handling options you have are to delete the character or translate it to a replacement character like ? (not an actual Unicode replacement character, of course).
I mean, the approach does give better performance:
def cp850_to_cp1252(b, replace):
return b.translate(matches, b'' if replace else delete)
def xcp850_to_cp1252(b, replace):
return b.decode('cp850').encode('cp1252', errors=('replace' if replace else 'ignore'))
# test results:
>>> from timeit import timeit
>>> timeit("cp850_to_cp1252(bytes(range(256)), False)", globals=globals())
3.338275376241654
>>> timeit("xcp850_to_cp1252(bytes(range(256)), False)", globals=globals())
7.253022187855095
But look how much setup (and knowledge) is required:
import encodings.cp850, encodings.cp1252
# n.b. neither of these is actually ISO-8859-1!
windows = encodings.cp1252.decoding_table
latin1 = encodings.cp850.decoding_table
matches = [windows.find(latin1[i]) for i in range(256)]
delete = bytes(i for i in range(256) if matches[i] == -1)
# convert -1 values to 63 (ASCII question mark)
matches = bytes(63 if x == -1 else x for x in matches)
And, of course, we lose the option of strict or (for those who actually have a use for it) xmlcharrefreplace error handling.
Before you ask, no, the convoluted .find logic is necessary. You can’t go through the corresponding .encoding_tables - nor can you use them directly with translate. They’re instances of an EncodingMap class (defined by builtins despite not exposing a builtin name) that doesn’t do anything useful at the Python level:
>>> dir(encodings.cp1252.encoding_table) # in particular, notice there's no __getitem__
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'size']
>>> encodings.cp1252.encoding_table.size()
1119
>>> # and there are no magic attributes either:
>>> encodings.cp1252.encoding_table.__class__.__getattribute__ is object.__getattribute__
True
(Well, that was enlightening.)
Short version: I don’t really understand why the method was exposed in the first place, because its apparently intended use case makes little sense. It’s presented as an encoding and decoding tool, but it maps bytes to bytes and Unicode to Unicode. However, nowadays str.translate provides a powerful tool for filtering and “ciphering” strings - it replaces something like ''.join(mapping[x] for x in original), but faster and would be simpler, were it not for the interface weirdness. That weirdness, in turn, seems to be a historical artifact that’s hard to justify.
translate predates Unicode strings. In Python 2, where str was bytestrings, the translation table was a 256-character string. You can still do that in Python 3:
That’s why the lookup is by ordinal, rather than the character itself: swapcase[72] = 'h'.
For Unicode strings, the translation table needs to be a mapping instead… and, yeah, the API doesn’t make that much sense nowadays, if you don’t know the history.
IMO, if you want to improve str.translate, it’s best to leave it alone, for code ported from py2, and design a new API, starting with the use actual cases – not “a better API for translate” :)
Well, that led me down an interesting path. Since 2.0, the str type had both translate and encode methods (where, of course, encode has to decode implicitly first), along with the unicode type being added. Before that, in 1.x, str objects don’t appear to have had any methods besides __mod__ and the usual sequence methods. However, translate and maketrans existed in the form of functions provided by the string standard library module. I couldn’t even check documentation pages before 1.4 through the legacy downloads pages; but I was able to find the source tarball releases which apparently included LaTeX documentation files. Through this I could confirm that translate was added in 1.3, and maketrans in 1.4.
I’m aware of that part, and my historical examination required getting up close and personal with it. (I’m not re-uninstalling 2.7 this time, just in case I ever have to do something like this again )
A bunch of historical notes
Doing even further reading, investigating the fact that .encode was provided since 2.0, but .decode was only added in 2.2… I checked out the What’s new in 2.2 document, along with whatever documentation I could find from that era. It comes across that 2.0/2.1 byte-based str objects could .encode just like Unicode strings, purely as an artifact of the attempt to handle the two types the same way (“These are the string methods which both 8-bit strings and Unicode objects support”). Of course, that would entail an implicit decoding first, apparently expecting plain ASCII. So there wasn’t actually a direct way to convert from bytes to Unicode with a specified encoding: .encode couldn’t do it, and .translate couldn’t do it (because it was only bytes-to-bytes).
2.2, of course, added the ability to .decode both “byte-strings” and Unicode, which in the latter case entails implicitly encoding first. Decoding functionally starts with bytes and ends with whatever the codec says it should, so this was also great for representing bytes-to-bytes transformation like ROT13, base64 encoding, compression… all sharing the same conceptual space as actual text codecs, and all being newly added. (I even remember using some of these somewhere around 2.4.) Although of course, only the actual text codecs would implicitly encode before decoding; things like .decode('zlib') can only work if every byte is just a byte and all byte values are valid.
It gets even weirder with 2.3. This is when the basestring ABC was added, despite that all the functionality was already shared to a distressing degree. But this is also the version where translate starts being documented as having different parameters for Unicode strings! And then, you couldn’t use string.maketrans to build a mapping suitable for unicode.translate - even giving it two unicode inputs results in a str. That can be keyed with Unicode ordinals, but the resulting one-character strs don’t work with unicode.translate.
Then in 3.0 with the change to Unicode-by-default, that ABC was removed, citing the reasoning that the two types don’t have enough functionality in common. Yet I’m not sure I could name a 3.0 bytes method that wasn’t also present on str. Both also got a .translate, with . But 3.0 only gave maketrans to str (remedied in 3.1). To build an old-school translation table for a bytes object in 3.0, you had to go back to… the string module. Yep.
I know we all know that the history of Unicode support in Python is not all that pleasant.
But what I’m trying to get at is, seen as a tool for handling text encoding, the translate method was never particularly useful. For all the time that separate bytes (whether treated as “byte-strings” or not) and Unicode types existed, it didn’t convert from one to the other, despite apparently leveraging the mechanisms for doing so.
Version-by-version breakdown
In 1.3 through 1.6, with considerable effort (keep in mind there were no list comprehensions, and joining a list l of strings would have looked like string.joinfields(l, '')), you could possibly have put together a translation table to map from one code page to another, at least the subset of characters represented by both. But there wasn’t Unicode at all.
In 2.0 and 2.1, you didn’t have any convenient bytes->Unicode route, and translate couldn’t do anything encode couldn’t - at least, in terms of legitimate, recognized text encodings.
In 2.2, translate still couldn’t do anything decode couldn’t. It didn’t even have noticeable upsides for bytes-to-bytes transformations; the new codecs were designed with encode and decode in mind, and using those names allowed for being explicit about which direction you were going. Plus they could handle transformations that aren’t 1-to-1 byte mappings.
In 2.3 through 2.7, translate had a separate interface for Unicode objects, but whoever designed this apparently thought that while it makes sense to use arbitrary mapping types and let the values be Unicode text (even more than one character, now!), the keys still had to be ordinals only. The UI was barely connected to the original, but it still kept a weird and awkward restriction. And this was still not useful for encoding, since it mapped (as it still does) Unicode to Unicode. Meanwhile, the str equivalent mapped bytes to bytes, so that wasn’t useful for decoding. And you still had to lean on the string module for byte-based maketrans, and still didn’t have a unicode-based maketrans at all.
In 3.x, the types are properly separated, and translate and maketrans are fully integrated (except bytes.maketrans not existing in 3.0). But it still doesn’t encode or decode text encodings
It’s hard to tell whether the translate and maketrans methods were really ever intended for the purpose. If they were, it raises interesting questions, like:
Why weren’t they ever deprecated?
When the new interface for translate for Unicode strings was added in 2.3, why didn’t it get any maketrans support?
Why did 3.0 add that missing maketrans support to the new str type, while not adding the equivalent to bytes? Did devs just assume it was something Unicode strings were supposed to be able to do, because a thing with the same name was in the old string module?
If it was such a big deal in 3.x that bytes and Unicode were becoming distinct types, and the supposed purpose of translate was anything to do with text encoding, and the whole point of a major version bump was to be able to introduce breaking changes… why didn’t it get changed at that point to make bytes-to-Unicode or Unicode-to-bytes transformations?
But on the other hand: if not that, what was it intended for? (And why does the C implementation do the things it does?)
I can’t make sense of this distinction. The API I have in mind seems better suited to the use cases I’m seeing (and have thought of in my own code), though. For example, Django uses it for escaping some special characters for embedding in HTML and JavaScript contexts, mapping through some simple hard-coded dictionaries. A change like this would allow dropping a lot of ord calls and thus make the code more elegant. I also see it used for things like removing punctuation - routing through maketrans explicitly makes it that much clunkier. Torch uses it to replace several different characters with underscores, rather than chaining multiple .replace calls. (There’s a nested loop, so I’m guessing that explicitly using maketrans there is an intentional performance optimization; but I’m not about to make this use case worse, either.)
It’s also in pathlib, where either of two pre-set mappings is computed using maketrans to avoid the ord calls. (Curiously, your fork came up first on my last GitHub search )
The use cases you mention (with Django/Torch), and possibly some more. If you want to do some more archaeology, you might want to look at the Unix tr command; it looks to me like translate was inspired by that.
For things like replacing unwanted characters with underscores, in C back in 1999, a lookup table with (up to) 256 bytes would be the obvious choice. It wouldn’t need an explanation. The story of it becoming esoteric is quite interesting; thanks for digging up the Python leg of the journey :)
Nowadays, maybe we want something like teaching replace to do multiple replacements instead. It seems that limiting the matches to single characters isn’t particularly important.
Of course, multireplace has its own issues.
For the common case (I might have a little bit of availability bias here at the moment ) of replacing single characters (which implies that matches can’t overlap), I already have a working interface right here (maybe it can be improved a bit).
For multiple-character matches, the semantics that one gets from the standard regex-based approach - i.e., scan the string until something replaceable is found, and don’t consider either the replaced characters or their replacements at all in future matches, by skipping past them - are the only semantics that make any sense to me. An iterated-replacement based approach, aside from having unclear semantics for overlapping matches, has other problems caused by the sequencing e.g. it can’t effectively “swap” two substrings, and it’s potentially slow due to needing to make multiple passes.
Using a trie seems to be essentially a special-purpose regex implementation - traversing the trie rather than a regex engine’s more general-purpose DFA in order to find matches. An approach that actually used regex should look for the longest matches first: matches of the same length can’t conflict; and if a shorter candidate were preferred over a longer one, the shorter one would have to be a prefix in order to conflict; and in this case the longer candidate never matches. However, actually building a regex or a formal trie seems like a lot of work for what will often be a throwaway task. What I think I’ll try first is just precomputing the longest key length in the mapping, then trying to look for matches of that length and backtracking down to 1. (Actually, I can recall a previous project where I did something similar rather than trying to implement a trie properly… I think it was part of a shortest-common-superstring algorithm…)
It is an easy function to implement with .map or even list comprehension, and after using Python for years I only recently heard of it. Honestly, I’ve never had a ise cae for it and usually it is replacong single or multiple chars for some sort of comprehension or encoding. Really an odd fucntion for me due to its limitations.
It may be an unusual function, but it is actually one of my favourite. I’ve answered many a Caesar Cipher question on Stack Overflow using it, but also one using it to perform superscripting of integers.
I could point to plenty of functions in Python I haven’t used, like math.lgamma(x). Just because someone doesn’t ever use a function doesn’t make it odd.
str.translate() exists to provide a very fast quick-and-dirty way of defining a character mapping (charmap) codec and calling its .encode() method.
The reason why the mapping is from int (source Unicode ordinal) to int (target Unicode oridinal), bytes or None was for performance reasons in the original implementation. Using the approach, the mapping could be defined as sequence (using the index position as ordinal) or dictionary.
Python’s stdlib charmap codecs today use a more efficient way of defining charmap codecs based on a decoding table defined as a 256 char str (mapping bytes ordinals via their index position in the sequence to Unicode code points) and a fast lookup table called EncodingMap (a 3-level trie) which is generated from these decoding tables for encoding.
For more complete definitions of charmap codecs, have a look at the modules in the stdlib encodings package (e.g. cp1252.py). Those also allow decoding and are typically defined in terms of a decoding table, rather than an encoding table.
The codec subsystem in Python 2.x (see PEP 100) did not mandate input or output types for codecs. The system was designed to have the codecs define the supported types, in order to have a rich codecs eco-system and allow for composable codecs as well. As such it was easily possible to write codecs going from bytes (str in Py2) to text (unicode in Py2), bytes to bytes, text to text. To provide an easier way to access this functionality, .encode() and .decode() were made available on both str and unicode in Python 2. The term “encode” merely means: take the data and call the .encode() method on the referenced codec, nothing more (or less). Similar for “decode”. However, this generic approach via methods did not catch on and caused too much confusion, so it was dropped in Python 3 on the str (Unicode text in Py3) and bytes (binary data in Py3) types, leaving only the paths str.encode() → bytes and bytes.decode() → str accessible via methods. The more generic interface is still available via codecs.encode() and codecs.decode(), though.
Given how easy it is to use the fast builtin charmap codec directly (and without registering a complete codec), I’d recommend using this directly via codecs.charmap_encode() in a helper function and in a similar way as is done in the full code page mapping codecs, rather than relying on str.translate().
PS: We should really document the codecs.charmap_build() function used by those codecs.
Just to clarify, if it isn’t already obvious: str.translate() was never meant as a way to go from bytes to Unicode or the other way around (encode/decode). It was always meant to work on the type itself and only change data or remove data during the translation process, following the example of the Unix tr tool.
Eg. to remove certain characters from a string, upper case a few chars, replace diacritics, etc. and even possibly all in one operation.
As such, the method still proves to be useful for quickly normalizing strings or bytes. The main advantage is that you only have to prepare the mapping once and can then reuse it over and over again, with having the looping over the data happening in C.
I have a general understanding of how these codecs work, but I don’t see how .translate is intended to relate to them. We established (I thought) that it isn’t a competing option (i.e. it doesn’t go from bytes->unicode or unicode->bytes), and it doesn’t appear to be used to implement the codecs either.
Unsurprising. It was prematurely generalized, in that handling text encodings is a fundamentally different task from compression or working with, say, Base64. Fundamentally, Base64 converts between bytes (some file that needs to be encoded in a text medium) and text (ASCII-compatible, but still fundamentally viewed as text). But text encoding is fundamentally a process where text is the “decoded” form and bytes are the “encoded” form, whereas base64 is the opposite. And of course that layered on top of the confusion caused by all the implicit conversion (thus UnicodeDecodeError from encoding attempts and vice-versa). Honestly I was always puzzled by 2.x holdouts who used the new text handling as a reason not to switch…
From what I can tell, typical modern uses of str.translate don’t look anything like what would make sense for what is still seen as a text encoding facility. But more importantly, what you say is “easy” is something I wouldn’t know how to do offhand, and I thought I had learned quite a bit in this corner. For that matter, codecs.charmap_encode doesn’t have a docstring and isn’t documented either, never mind charmap_build.
Not really: The implementation of str.translate() uses the same mapping logic as the charmap codec. However, it is geared towards same type conversions (also see below).
When I designed the codec subsystem, I did not want to create something that is limited to just text encodings. Codecs can be lots of things, with the general theme that you convert data into some other format and back again. Accordingly, I focused on a general purpose codec API specification which would allow implementing codecs for e.g. text conversion, compression, encryption, etc.
The main use in Python was handling text encodings to support the Unicode integration, but we also have codecs for e.g. hex representations, base64 representation, punycode, zlib compression, etc. Those are not compatible with the str and bytes methods in Python 3, but they use the same codec API and are available via codec.encode() and codecs.decode().
Please scratch this part. The exposed charmap encode/decode functions do not allow for bytes → bytes or str → str conversions, so they would not be able to replace str.translate() directly.
That said, it is still very easy to create custom codecs based on the charmap codec, if you need to convert from bytes to str or vice-versa.