Encode `str` (utf-8 uncodeobject) without a copy

It’s not easy to judge proposed options without a concrete implementation. Can you show that using a Py_buffer in C or memoryview in Python is faster than creating a bytes object at each call? The benchmark likely depends on the string length. For example, is it worth it for strings shorter than 100 characters?

Explored alternatives to adding a new method

Better API for encoding unicode objects with UTF-8 proposed:

Proposal 2: Optimize PyUnicode_AsUTF8AndSize.
(…) makes the unicode object caches the bytes object instead of a plain memory block


You may have a look at withdrawn PEP 756 – Add PyUnicode_Export() and PyUnicode_Import() C functions which proposed adding a PyUnicode_Export() function. The PEP discuss tricky issues about surrogate characters and embedded null characters.

1 Like