Currently in the I/O stack when TextIOWrapper needs to be convert from str to bytes it calls an encoder which encodes the data and returns a bytes[1]. With utf-8 mode it is common that the output stream (ex. sys.stdout) is using utf-8 encoding and the str (unicodeobject) contains utf-8 encoded data so the only work during encoding is allocating a bytes and copying across the data[2][3].
Proposal
I’d like to add a new method of encoding a str that can return any buffer protocol supporting object, such as a memoryview of the underlying bytes data, avoiding the bytes allocation and copy. While it is possible to special case this in TextIOWrapper I think this is common enough of a need to be worth having a more optimized option generally available.
Prior work
Previous C API proposal: Better API for encoding unicode objects with UTF-8
a. Why this avoids the issue that encountered: This proposal is focused on one specific use case where the copy is a measurable percentage of runtime. Writing Unicode, especially emoji , to sys.stdout is an increasingly common operation.
Without Python UTF8 mode (PEP 540) becoming the default (PEP 686) this would be much more difficult. Thanks for all the work to enable it!
Explored alternatives to adding a new method
Change the .encode() signature to return bytes-like. memoryview is sufficiently distinct from bytes that I think this would create a lot of slightly broken code. I have previously atttempted a similar change between bytes and bytearray and found a lot of compatibility issues. Type checkers could aid with this migration but being compatible would mean adding a bytes() call or a conditional bytes call which to me just moves the complexity to every function which calls str.encode; It does not remove it.
Adding a new flag keyword argument. To me this would result in a cleaner API but I think it would break too many exiting .encode functions. The codecs and .encode APIs have been around a long time and have many custom implementations which are unlikely to handle arbitrary kwargs gracefully.
Include a default implementation of the new method which falls back to the copy version. Low cost but makes it harder for to see if a specific encoder has been updated for the new API. getattr with a default provides a simple way to do this.
It’s not easy to judge proposed options without a concrete implementation. Can you show that using a Py_buffer in C or memoryview in Python is faster than creating a bytes object at each call? The benchmark likely depends on the string length. For example, is it worth it for strings shorter than 100 characters?
The current Unicode object caches UTF-8 data when PyUnicode_AsUTF8AndSize() is called, but this cache is not created when decoding from UTF-8 to generate Unicode.
The best solution in the UTF-8 era is for Unicode objects to use UTF-8 as their internal representation and only cache PEP 393 data when requested for compatibility.
If UTF-8 is the default internal representation, high-quality and fast string processing algorithms written in C, Rust, and other languages can be easily utilized.
However, PyUnicode_nBYTE_DATA() and PyUnicode_DATA() returning NULL on allocation failure can negatively impact code that expects these APIs to never fail.
Planning to gather a CPU profile + runtime traces / benchmarks soon; will update this thread when I have.
This is the only common copy in the Text I/O stack I don’t currently have a way to resolve (caveats apply / there are some non-copy fast paths today). With Reworking "Buffered I/O" in CPython the copy of the .write argument into the shared buffer in BufferedReader will go away unless it’s required (the buffer is mutable/non-frozen and need to return from write while not flushing immediately). Inside TextIOWrapper there’s the encoding copy (this) and a second copy to merge the encoded strings into one final buffer which is then passed to BufferedIO.write. I’m hoping with the rework I can eliminate most that / just defer to BufferedIO which internally I plan to have keep a list of parts like TextIOWrapper does internally for efficiency.
All of that hopefully will also mean eliminating locking in many cases; although that depends in part on if I have Rust for making it an experimental I/O stack (_rustio ). My hope is to have a full working _pyio prototype which demonstrates the ideas + efficiency before end of February. Starting this discussion so hopefully can make it a zero-copy I/O stack in the 3.15 ship window :).
Really appreciate these references. Working on reading through them / ingesting what they say. I wasn’t aware that if we start with utf-8 (ex. parsed Python source code or .decode('utf-8')) that CPython always currently copies that out of UTF-8 form and needs to then copy back into it. Resolving that + this should make CPython file → text → file round trip dramatically lower overhead.
That parsing utf-8 doesn’t keep the original bytes if we know they are 1. immutable / owned by python internals already and 2. correctly normalized/encoded is something I’d like to incorporate into shipping this change set potentially. I Think it would gets Python a really nice place for UTF-8 processing.