PyUnicode_DATA and the stable API

MRAB · November 23, 2023, 9:38pm

Is there a way to access the underlying data of a PyUnicode object in the stable API? PyUnicode_DATA and the rest aren’t available.

Are they absent because the layout of the data is really an implementation detail, unlike for PyBytes objects where there’s only one “reasonable” layout, namely, as a sequence of bytes?

I ask because it would be nice to be able to read strings while the GIL released, like I do in the regex module, but without being tied to a specific Python version (and also assuming that we’re going to stick with Flexible String Representation for the foreseeable future).

barry-scott · November 24, 2023, 8:36am

I am also have code that accesses the utf-32 data inside PyString objects when interfacing with C++ code.

pitrou · November 24, 2023, 1:40pm

That certainly sounds like a good idea. A possible API would be like this:

enum PyUnicodeRepr {
  PyUnicodeRepr_UCS1 = 1,
  PyUnicodeRepr_UCS2 = 2,
  PyUnicodeRepr_UCS4 = 3,
  PyUnicodeRepr_UTF8 = 4,
  // Other values may be later added
};

/// \brief Get the native representation of the unicode object.
///
/// The returned value can fall outside of known PyUnicodeRepr values,
/// for example if called on a more recent Python version than the one
/// that user code was compiled against.
PyUnicodeRepr PyUnicode_GetNativeRepr(PyObject* unicode);

/// \brief Export unicode data with the given representation
///
/// On success, 0 is returned and the data is exported in the `Py_buffer`.
/// On failure, -1 is returned and an exception is set.
///
/// If the requested `repr` is the native representation of `unicode`,
/// this operation is zero-copy.
/// In any case, the buffer must be released once the caller has
/// finished working with it.
int PyUnicode_ExportData(PyObject* unicode, PyUnicodeRepr repr, Py_buffer* out);

(caveat: I’m using an enum for clarity above, but for the stable ABI a plain int would be better)

malemburg · November 24, 2023, 3:15pm

Wouldn’t it be better to readd the tp_as_buffer buffer interface to Unicode objects and then use the standard PyObject_GetBuffer() with the representation set as flag ?

The PyUnicode_GetNativeRepr() API would still be useful in this case, of course, to avoid copying the data.

pitrou · November 24, 2023, 5:00pm

I think it would be extremely error-prone (what semantics should the user expect when a unicode string is passed to hashlib.sha256?). It would also reintroduce the notion of implicit conversion between bytes-like and unicode that was deliberately removed when transitioning to Python 3.

malemburg · November 24, 2023, 5:32pm

Not really:

The Unicode type did have a buffer interface in Python 2.x which always returned the UTF-8 version of the string and so the buffer interface could again default to this encoding.

Since conversion is only from Unicode to bytes and using a fixed encoding per default, there is no confusion. The flags parameter could be used to ask for the different internal encodings UCS1-4, to avoid copying.

BTW: The confusion in Python 2.x only came from converting bytes to Unicode implicitly. Then again, this was needed as a pragmatic approach to get the adoption of Unicode going in Python 2.

pitrou · November 24, 2023, 6:10pm

Let’s say we disagree on that, then

malemburg · November 24, 2023, 6:25pm

Fair enough