Better API for encoding unicode objects with UTF-8

methane · December 25, 2019, 7:40am

I want to discuss improving UTF-8 encode C API.

Relating b.p.o issue: https://bugs.python.org/issue39087

Background

When we want to get UTF-8 encoded C string from an unicode object, there are two categories of APIs:

a. Returns const char *: PyUnicode_AsUTF8AndSize, PyUnicode_AsUTF8

b. Returns a bytes object: PyUnicode_EncodeUTF8, PyUnicode_AsUTF8String, etc…

But both categories have drawbacks.

a. PyUnicode_AsUTF8AndSize:

When the unicode is ASCII string, or it has UTF-8 cache already, this API is the most efficient.

But when it creates the UTF-8 cache, extra allocation and memcpy are used. So this API is slower than (b) APIs. (See here)

Additionally, if the unicode object lives long but it isn’t encoded to UTF-8 anymore, the cache wastes some memory.

b. PyUnicode_EncodeUTF8 and others:

These APIs always create a new bytes object. So if the unicode object is ASCII string, or it has the UTF-8 cache, they are much slower than (a) APIs.

When I want to write an extension module like serializer, I use a hack like this (see ujson for a real-world example):

if (PyUnicode_IS_COMPACT_ASCII(obj)) {
    // Use PyUnicode_AsUTF8AndSize.
} else {
    // Use PyUnicode_AsUTF8String.
}

But this hack is ugly and doesn’t make sense with other Python implementations. So I want a better API that works well with CPython and other implementations.

Proposal 1: Add PyUnicode_GetUTF8Buffer().

Pull request: https://github.com/python/cpython/pull/17659

This proposal adds a new API: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view).

I like this API. But adding new API always introduce some maintenance cost. If we change the unicode implementation to UTF-8 based one like in PyPy in the future, this API may become overkill and
PyUnicode_AsUTF8AndSize will be faster and more efficient than this API on both PyPy and CPython.

Proposal 2: Optimize PyUnicode_AsUTF8AndSize.

Pull request: https://github.com/python/cpython/pull/17683

This proposal optimize PyUnicode_AsUTF8AndSize to remove extra allocation and memcpy.

This pull request makes the unicode object caches the bytes object instead of a plain memory block to remove extra allocation and memcpy. But we may be able to implement UTF-8 encoder which encodes the unicode object into memory block instead of the bytes object to reduce the overhead of bytes object.

This proposal doesn’t introduce any new APIs. But if we recommend PyUnicode_AsUTF8AndSize for extension modules, we need to dismiss the memory used by the UTF-8 cache until we change the Unicode implementation to UTF-8 based.

For example, orjson uses PyUnicode_AsUTF8AndSize always.

Which idea do you like?

Happy holidays!

pitrou · December 26, 2019, 7:20pm

Filling a Py_buffer is fast, so I don’t really believe the efficiency problem. I’d say do both: proposal 1 and proposal 2