Nicer interface for str.translate

A couple of notes:

  • str.translate() exists to provide a very fast quick-and-dirty way of defining a character mapping (charmap) codec and calling its .encode() method.

  • The reason why the mapping is from int (source Unicode ordinal) to int (target Unicode oridinal), bytes or None was for performance reasons in the original implementation. Using the approach, the mapping could be defined as sequence (using the index position as ordinal) or dictionary.

  • Python’s stdlib charmap codecs today use a more efficient way of defining charmap codecs based on a decoding table defined as a 256 char str (mapping bytes ordinals via their index position in the sequence to Unicode code points) and a fast lookup table called EncodingMap (a 3-level trie) which is generated from these decoding tables for encoding.

  • For more complete definitions of charmap codecs, have a look at the modules in the stdlib encodings package (e.g. cp1252.py). Those also allow decoding and are typically defined in terms of a decoding table, rather than an encoding table.

  • The codec subsystem in Python 2.x (see PEP 100) did not mandate input or output types for codecs. The system was designed to have the codecs define the supported types, in order to have a rich codecs eco-system and allow for composable codecs as well. As such it was easily possible to write codecs going from bytes (str in Py2) to text (unicode in Py2), bytes to bytes, text to text. To provide an easier way to access this functionality, .encode() and .decode() were made available on both str and unicode in Python 2. The term “encode” merely means: take the data and call the .encode() method on the referenced codec, nothing more (or less). Similar for “decode”. However, this generic approach via methods did not catch on and caused too much confusion, so it was dropped in Python 3 on the str (Unicode text in Py3) and bytes (binary data in Py3) types, leaving only the paths str.encode()bytes and bytes.decode()str accessible via methods. The more generic interface is still available via codecs.encode() and codecs.decode(), though.

Given how easy it is to use the fast builtin charmap codec directly (and without registering a complete codec), I’d recommend using this directly via codecs.charmap_encode() in a helper function and in a similar way as is done in the full code page mapping codecs, rather than relying on str.translate().

PS: We should really document the codecs.charmap_build() function used by those codecs.

3 Likes

Just to clarify, if it isn’t already obvious: str.translate() was never meant as a way to go from bytes to Unicode or the other way around (encode/decode). It was always meant to work on the type itself and only change data or remove data during the translation process, following the example of the Unix tr tool.

Eg. to remove certain characters from a string, upper case a few chars, replace diacritics, etc. and even possibly all in one operation.

As such, the method still proves to be useful for quickly normalizing strings or bytes. The main advantage is that you only have to prepare the mapping once and can then reuse it over and over again, with having the looping over the data happening in C.

2 Likes

Right… that’s the conclusion I came to…

… but this seems to be saying the opposite?!

I have a general understanding of how these codecs work, but I don’t see how .translate is intended to relate to them. We established (I thought) that it isn’t a competing option (i.e. it doesn’t go from bytes->unicode or unicode->bytes), and it doesn’t appear to be used to implement the codecs either.

Unsurprising. It was prematurely generalized, in that handling text encodings is a fundamentally different task from compression or working with, say, Base64. Fundamentally, Base64 converts between bytes (some file that needs to be encoded in a text medium) and text (ASCII-compatible, but still fundamentally viewed as text). But text encoding is fundamentally a process where text is the “decoded” form and bytes are the “encoded” form, whereas base64 is the opposite. And of course that layered on top of the confusion caused by all the implicit conversion (thus UnicodeDecodeError from encoding attempts and vice-versa). Honestly I was always puzzled by 2.x holdouts who used the new text handling as a reason not to switch…

From what I can tell, typical modern uses of str.translate don’t look anything like what would make sense for what is still seen as a text encoding facility. But more importantly, what you say is “easy” is something I wouldn’t know how to do offhand, and I thought I had learned quite a bit in this corner. For that matter, codecs.charmap_encode doesn’t have a docstring and isn’t documented either, never mind charmap_build.

Not really: The implementation of str.translate() uses the same mapping logic as the charmap codec. However, it is geared towards same type conversions (also see below).

When I designed the codec subsystem, I did not want to create something that is limited to just text encodings. Codecs can be lots of things, with the general theme that you convert data into some other format and back again. Accordingly, I focused on a general purpose codec API specification which would allow implementing codecs for e.g. text conversion, compression, encryption, etc.

The main use in Python was handling text encodings to support the Unicode integration, but we also have codecs for e.g. hex representations, base64 representation, punycode, zlib compression, etc. Those are not compatible with the str and bytes methods in Python 3, but they use the same codec API and are available via codec.encode() and codecs.decode().

Please scratch this part. The exposed charmap encode/decode functions do not allow for bytesbytes or strstr conversions, so they would not be able to replace str.translate() directly.

That said, it is still very easy to create custom codecs based on the charmap codec, if you need to convert from bytes to str or vice-versa.

1 Like

I think there has been a miscommunication. I am not actually trying, myself to use str.translate (nor bytes.translate) for the purpose of converting between bytes and str. Instead, the point was to explain, in detail, how I know that it isn’t for that purpose. I took a look into a rather complex history, because I was trying to figure out what people did use it for historically.

But the fact remains that nowadays, actual recorded uses of str.translate in published codebases look more like, e.g., the way it’s used in pathlib:

# Ahead of time, setting up a global:
_SWAP_SEP_AND_NEWLINE = {
    '/': str.maketrans({'/': '\n', '\n': '/'}),
    '\\': str.maketrans({'\\': '\n', '\n': '\\'}),
}

# Later, nested in some other algorithm:
                trans = _SWAP_SEP_AND_NEWLINE[self.pathmod.sep]
                self._lines_cached = path_str.translate(trans)

Or in Django:

_js_escapes = {
    ord("\\"): "\\u005C",
    ord("'"): "\\u0027",
    ord('"'): "\\u0022",
    ord(">"): "\\u003E",
    ord("<"): "\\u003C",
    ord("&"): "\\u0026",
    ord("="): "\\u003D",
    ord("-"): "\\u002D",
    ord(";"): "\\u003B",
    ord("`"): "\\u0060",
    ord("\u2028"): "\\u2028",
    ord("\u2029"): "\\u2029",
}

# Escape every ASCII character with a value less than 32.
_js_escapes.update((ord("%c" % z), "\\u%04X" % z) for z in range(32))

@keep_lazy(SafeString)
def escapejs(value):
    """Hex encode characters for use in JavaScript strings."""
    return mark_safe(str(value).translate(_js_escapes))

Or in localstack:

_rule_replacements = {"-": "_0_"}
# String translation table for #_rule_replacements for str#translate
_rule_replacement_table = str.maketrans(_rule_replacements)

# Later:
    # replace forbidden chars (not allowed in Werkzeug rule variable names) with their placeholder
    escaped_request_uri_variable = request_uri_variable.translate(_rule_replacement_table)

Or in pytorch:

        translate_table = str.maketrans(" ;\t\n", "____")
        with open(path, "w") as f:
            for evt in self:
                if evt.stack and len(evt.stack) > 0:
                    metric_value = getattr(evt, metric)
                    if int(metric_value) > 0:
                        stack_str = ""
                        for entry in reversed(evt.stack):
                            stack_str += entry.translate(translate_table)
                            stack_str += ";"
                        stack_str = stack_str[:-1] + " " + str(int(metric_value))
                        f.write(stack_str + "\n")

Or in rich:

STRIP_CONTROL_CODES: Final = [
    7,  # Bell
    8,  # Backspace
    11,  # Vertical tab
    12,  # Form feed
    13,  # Carriage return
]
_CONTROL_STRIP_TRANSLATE: Final = {
    _codepoint: None for _codepoint in STRIP_CONTROL_CODES
}

# Later:

def strip_control_codes(
    text: str, _translate_table: Dict[int, None] = _CONTROL_STRIP_TRANSLATE
) -> str:
    # ...
    return text.translate(_translate_table)

And these are all interesting and useful things to be doing, and they could all benefit from a nicer interface.

All of these examples look pretty easy to understand and they all use the concept of reusing a mapping table for multiple operations (which is what makes str.translate() fast and useful).

How would your proposed API make these easier to write or more performant ?

Rather than just theorizing, as I’ve said previously, I agree with @pf_moore that I should get an implementation working and up on PyPI first - showing works better than telling. So I’ve started on that. But just to explain why I don’t think I’m wasting my time with that :wink: :

Ease of use:

  • The pathlib example could omit the str.maketrans calls completely.
  • The Django example could omit all the ord calls. (Granted, this is a case where maketrans would actually be a better approach; but my interface supports that transparently for the purpose of making the mapping ahead of time, and also supports doing it immediately in the mapped call without needing to prepare the mapping.)
  • The Localstack example could define the _rule_replacement_table directly with the dict, and avoid the extra step.
  • Assuming the translation table is necessarily created ahead of time for performance reasons, the PyTorch example gains no benefit, but also loses nothing.
  • The Rich example, could work with characters instead of commented ordinals. Assuming performance is not a concern, it could just pass a string, which could build a mapping on the fly with just the mapped call. (This example especially caught my eye because of the bizarre type annotation Dict[int, None] - a lookup table that’s always expected to look up the same value.)

I’m sure if I kept looking, I could find examples of str.translate(str.maketrans(...)) inlined like that. I basically just took the first few things I could find, that weren’t bytes.translate or some other translate method, that came from projects that I think have some name recognition.

Performance: every case can gain performance from an algorithm customized to the purpose rather than adapting the charmap implementation routines. Specifically:

  • It doesn’t have to special-case None or ordinal values in the mapping, since they aren’t part of the new spec. It just concatenates whatever string it finds
  • It can use the fast-path approach for every character with ordinal < 256, not just the ASCII range. It can map them just as well to any empty or single-character output (not just ASCII) while still maintaining the “just index an array and append to a buffer directly” advantages. And it doesn’t have to abandon the fast path as soon as it discovers something non-ASCII.
  • Alternately, it could accept an EncodingMap trie like you mentioned, and delegate to the fast logic for that; and the prepare-ahead-of-time functionality could create the EncodingMap. (Assuming that these can currently map to an arbitrary value?)
  • It can work directly with its own temporary buffer instead of going through Python objects; and it can use its own resizing policy (since it will implicitly trim when the actual string is created). I haven’t actually tried it yet, but my first draft of the C code suggests that quite a bit of inlining and simplification occurs this way.

(My guess is that, on average, it will not be helpful to count the required total length and character width, so as to allocate only once; but I’m willing to at least try it.)

Fair enough.

You may want to test drive some of these ideas using Cython before diving directly into C code.

Some additional notes:

  • sequence lookups are much faster than dictionary lookups
  • tries are great for small static lookup tables, bloom filters great for larger ones
  • Python’s str objects can be resized to avoid copying content (for bytes as well, but those APIs are currently marked private)
  • There are “writer” APIs for both str and bytes, which are great for creating such objects in chunks, but again, these are still marked private.

What exactly is your idea?

  • Make str.translate() accepting a dict instead of the value created by str.maketrans()? It is already accept a dict, and an arbitrary mapping.
  • Require keys to be strings? It is a breaking change.
  • Allow keys be integers and strings? It has performance penalty, because for every character you will need to create an integer and a string, and do two lookups in worst case.
  • Make str.translate() accepting two or three arguments like str.maketrans() and create the translation mapping on fly? It is not faster than calling str.maketrans() and then str.translate(), and in these cases when the performance is not important you can just combine two functions. Not every combination of two functions deserves adding a builtin.
2 Likes

I was already aware of this, as I’ve already looked through the C code in some detail. Although I do need to research about bloom filters generally (I can only think of post-processing for 3d rendering, which I’m pretty sure is not relevant).

Add a new method (actually two), as described upthread.

This way (and values as well). It is not breaking because it is a new method.

Also the equivalent of this, in the new method. Since a lot of use cases were seen for preparing a lookup table ahead of time, I’m developing separate mapped (for translate) and make_mapping (for maketrans). mapped allows passing extra arguments to build the mapping in-place, or uses a fast path when only a mapping is passed.