PyUnicode_FromKindAndData memory ownership semantics clarification

kknechtel · July 30, 2023, 6:47am

From the documentation:

PyObject *PyUnicode_FromKindAndData(int kind, const void *buffer, Py_ssize_t size)
Return value: New reference.

Create a new Unicode object with the given kind (possible values are PyUnicode_1BYTE_KIND etc., as returned by PyUnicode_KIND()). The buffer must point to an array of size units of 1, 2 or 4 bytes per character, as given by the kind.

If necessary, the input buffer is copied and transformed into the canonical representation. For example, if the buffer is a UCS4 string (PyUnicode_4BYTE_KIND) and it consists only of codepoints in the UCS1 range, it will be transformed into UCS1 (PyUnicode_1BYTE_KIND).

To be clear, the buffer will always be copied (to memory that is fully owned by the resultant PyUnicode object) even if it isn’t transformed, right?
If not, how am I intended to know whether to free a dynamically allocated buffer after the call? (For that matter, would it mean I can’t use an automatic-storage buffer that doesn’t outlive the PyUnicode object?)

storchaka · July 30, 2023, 7:01am

More correctly, the input buffer is copied and, if necessary, transformed into the canonical representation.

MRAB · July 30, 2023, 4:57pm

I hadn’t read the description that closely, so I didn’t notice that its phrasing is misleading. I just assumed, correctly, that it would always copy.

kknechtel · July 30, 2023, 10:38pm

Good to know.

(Maybe this is a Documentation issue, then.)

vstinner · August 25, 2023, 12:13pm

I agree that the doc is misleading, “is copied” should be removed from the doc. Does somone want to propose a PR?

storchaka · August 25, 2023, 12:56pm

Vice versa, “is copied” should stay, but without any “if”.