Inefficiencies in "writer" APIs

markshannon · October 10, 2025, 10:16am

We have or are getting “writer” APIs for incrementally creating immutable objects like string, bytes and tuples.

I am concerned that these APIs are inefficient and will thus be ignored by C extension authors and we will make no progress in moving away from old safe APIs, but just accumulate new APIs to maintain. If we are to expect users to move to the new APIs they must be close to the old API in efficiency and, ideally, easy to port to.

Taking tuples as an example:

Currently to create a tuple of length 2 from a source of objects, the code looks something like this:

PyObject *t = PyTuple_New(3);
// Handle error case
PyObject *o0 = get_object(0);
// Handle error case
PyTuple_SET_ITEM(t, o0, 0);
PyObject *o1 = get_object(1);
// Handle error case
PyTuple_SET_ITEM(t, o1, 1);

with PyTupleWriter this becomes:

PyTupleWriter *w = PyTupleWriter_Create(3);
// Handle error case
PyObject *o0 = get_object(0);
// Handle error case
int err = PyTupleWriter_Add(w, o0);
Py_DECREF(o0);
// Handle error case
PyObject *o1 = get_object(1);
// Handle error case
int err = PyTupleWriter_Add(w, o1);
Py_DECREF(o1);
// Handle error case
PyObject *t = PyTupleWriter_Finish(w);
// Handle error case

Not only are there many more calls, there is more error handling, and an additional heap allocation for the writer.

We could make it more efficient by not requiring heap allocation of the writer, stealing the reference to the object (only relevant for PyTupleWriter, the bytes and str versions do no have this issue) and guaranteeing success if the writer is initialized to the exact size needed.

PyTupleWriter w;
PyTupleWriter_Init(&w, 2);
// Handle error case
PyObject *o0 = get_object(0);
// Handle error case
PyTupleWriter_AddSteal(&w, o0);
PyObject *o1 = get_object(1);
// Handle error case
PyTupleWriter_AddSteal(&w, o1);
PyObject *t = PyTupleWriter_Finish(&w);

It is still more code than the original, but it is close enough that we can reasonably expect authors to move to the new API, and that tooling to do it automatically can be easily made.

malemburg · October 10, 2025, 10:42am

How about taking this further by adding a vararg API, which takes an arbitrary number of objects to “write” to the tuple:

PyTupleWriter_FillSteal(&w, o0, o1, o2, o3);

steve.dower · October 10, 2025, 11:16am

We’ve benchmarked the writer APIs pretty thoroughly, and for the most part, they are the equivalent of the existing ones. We just fudge some types around so that it’s not obvious you could cast a half-initialised writer to PyObject* and use it (because, you know, that’ll make you crash).

Varargs APIs are something we want to avoid, as they’re very tied to the ABI and are not easy to use from languages other than C.

The CAPI WG decisions tracker is the usual chokepoint for these design proposals (they also tend to come through Discourse, but apparently not obviously enough that you saw them). You’re welcome to follow that repo and chime in when you see them - it’s just a focal point, it’s not exclusive to the WG. We’ve got a tuple proposal there right now, and the past str and int ones spent a lot of time being discussed both there and here.

(Ah, and I see Victor closed those proposals and opened the one you’re referring to here. FWIW, I don’t think this particular one is worthwhile, so I agree with you, I just thought it was a hypothetical example and not a real one.)

encukou · October 10, 2025, 11:51am

Mark Shannon:

PyTupleWriter w;
PyTupleWriter_Init(&w, 2);
// Handle error case
PyObject *o0 = get_object(0);
// Handle error case
PyTupleWriter_AddSteal(&w, o0);
PyObject *o1 = get_object(1);
// Handle error case
PyTupleWriter_AddSteal(&w, o1);
PyObject *t = PyTupleWriter_Finish(&w);

That looks OK, but I’d:

Make PyTupleWriter_Init allocate a tuple-sized chunk of heap, which PyTupleWriter_Finish would turn into a PyObject in place. This way we don’t have a struct in the ABI.
Require the initial size to be exact. For a growable staging area you’d use a list. (Or is that also less efficient?)

malemburg · October 10, 2025, 11:59am

But that shouldn’t stop us from having them as well for the benefit of C extension writers, no ?

vstinner · October 10, 2025, 2:47pm

Mark is referring to the issue C API: Add PyTupleWriter API that I just proposed. This API is mostly a replacement for _PyTuple_Resize(): when the input size is not known in advance. For example, look at my PR to see how PySequence_Tuple() becomes simpler with PyTupleWriter.

When the input size is known, there are other existing safe functions: PyTuple_FromArray() (new! I just added it), PyTuple_Pack(), Py_BuildValue(), etc.

I ran a micro-benchmark comparing [tuple] to [writer]:

[tuple]: PyTuple_New() and PyTuple_SetItem().
[writer]: PyTupleWriter_Create(), PyTupleWriter_AddSteal() and PyTupleWriter_Finish().

Results:

Tuple of 1 item (worst case scenario, measure the overhead of the abstraction): [tuple] 32.4 ns +- 0.1 ns -> [writer] 36.1 ns +- 0.6 ns: 1.11x slower. It’s only 3.7 nanoseconds slower.
Tuple of 1024 items: [tuple] 7.09 us +- 0.08 us -> [writer] 7.62 us +- 0.06 us: 1.07x slower. I’m disappointed by these numbers, but I trust benchmarks

My implementation calls PyTuple_New() and _PyTuple_Resize() internally, so it’s hard to be faster than these functions.

IMO between 3.7 ns slower and 1.07x slower on a micro-benchmark is an acceptable trade-off for an abstraction and a safer API.

A PyTupleWriter instance is allocated on the heap memory, but there is a free list which reduces the cost of the memory allocation and deallocation. I designed the API to be compatible with the stable ABI in the long term. Hiding the structure members is required for that. I would also prefer to not have a structure of a fixed size, since it would be the implementation more complicated and less flexible (it would be harder to try other optimizations later).

PyTupleWriter uses a small array of 16 items to avoid having to resize small tuples multiple times. It switches to an internal concrete tuple object for 17 items and more. I would prefer to not leak such implementation details in the ABI.

I didn’t measure the PyTupleWriter_AddArray() performance, it should be more efficient since it works on an array.

vstinner · October 10, 2025, 2:53pm

As I wrote in the issue, the problem is that converting a list to a tuple requires a memory copy PyList_AsTuple() and _PyList_AsTupleAndClear() functions copy the items array, but at least the private _PyList_AsTupleAndClear() avoids having to call Py_INCREF/Py_DECREF on each item.

@erlendaasland wrote a micro-benchmark to measure that: using a temporary list makes the code slower (1.8x slower on a tuple of 10 items).

steve.dower · October 10, 2025, 3:07pm

Sure, but where possible we want them to be a wrapper around a single API that implements the behaviour. For example, I’d hope that PyTuple_Pack can be implemented as a wrapper around PyTuple_FromArray so that C developers can write PyTuple_Pack(3, a, b, c) but we don’t have to maintain two completely separate functions that do essentially the same thing.

The tradeoff between convenience for C-only users, maintenance cost for ourselves, compatibility with other languages, and realistic performance goals leads here, I believe. Unfortunately, we’re constantly arguing about micro-optimizations, which leads to API bloat, extra maintenance for us, and provides some small benefit to the C-only users and nothing for anyone else.

malemburg · October 10, 2025, 3:52pm

To be fair: Tuple generation is a very common task when writing C extensions, so I think micro optimizations are well worth it.

Then again: You would probably only use the writer API for creating larger tuples in loops, where a vararg API would not provide much win. Having a way to avoid frequent DECREFs sounds like a more useful approach.

steve.dower · October 10, 2025, 3:59pm

If your bottleneck is getting values in and out of Python-land, you need a more interesting problem to work on

Every time I’ve had to do things along these lines, Py_BuildValue has been just fine (or PyList_Append). Because that’s not where the work or speedup is going to come from - it’s from being able to generate a full dataset in native code (potentially without the GIL, but I’ve gotten >40x speedups before even with the GIL held) and then do the conversion at the end.

It’s easy enough for a user to avoid fancy APIs if they’ve hit the point of diminishing returns, but that doesn’t save us from having to write, test, document and maintain them. Let’s at least try and prove that there’s a need for a more performant API here before we start bloating our maintenance burden.

encukou · October 10, 2025, 6:17pm

So does a growable/overallocated staging buffer (unless you guess correctly for preallocation).

In that benchmark, the size is known in advance. I agree that a temporary list isn’t great in that case.

pitrou · October 10, 2025, 6:43pm

Frankly, no. There are many cases where the performance of converting PyObjects to/from native values is important, and those are not “uninteresting” cases. I’ve worked on two of them (in Numba and PyArrow), but I’m sure NumPy and Pandas developers will feel concerned too.

Is it acceptable to lose a bit of performance on some workloads in exchange for ABI stability and more optimization prospects for CPython? Perhaps. Does that make the problem entirely “uninteresting”? Definitely not.

davidhewitt · October 10, 2025, 6:46pm

As a consumer of collection generation downstream in PyO3, I would be pleased to have this PyTupleWriter API (and similar for PyListWriter).

At the moment _PyTuple_Resize is private API so we don’t have a good way to directly construct a tuple where the size is not known. The situation is not much better for lists, as far as I know there is PyList_Appendbut that requires starting from an empty list if you’re not sure the iterator will fill your preallocation.

Rust iterators have a .size_hint() but they are not in general required to yield exactly this many items. There are a subset known as ExactSizeIteratorwhich promise (non bindingly) to adhere to their hint; PyO3 currently restricts APIs to build lists and tuples from Rust iterators to this subset and carefully checks they supplied the number of items they hinted to ensure we fully initialized the collections.

(So we currently use PyList_New + PyList_SET_ITEM and similar for tuples.)

If the proposed writer APIs supported inexact size in the preallocation hint, this would be perfect for PyO3 to use to build lists and tuples from arbitrary Rust iterators straightforwardly.

Regarding microbenchmarks - my view for PyO3 (which might be similar in Cython, pybind11 etc) is that it’s worth optimizing at this level in these binding frameworks, because the impact of such optimizations can be multiplied across the ecosystem.

So my ideal would be to have a combination of what Mark originally proposed (including stealing) with Petr’s suggestion to avoid the struct in the ABI. (If you hide the details of the allocation created by the writer _Init function, it might be possible in the future to use the same implementation with both lists and tuples.)

markshannon · October 11, 2025, 10:14am

Varargs can be surprisingly inefficient, and as Steve points out, messy from an ABI point of view.

A bulk adder taking an array and a size is much simpler from an ABI perspective and probably as fast or faster.

markshannon · October 11, 2025, 10:18am

I don’t see how an API that performs three heap operations and N extra increfs and decrefs can be as efficient. The writer APIs just do a lot more work than is necessary. How did you benchmark these APIs?

markshannon · October 11, 2025, 10:24am

I’m talking about the writers for strings and byte objects as well. The same arguments apply there, possible more so for simple operations (if formatting is involved, it will likely dominate).

~10% slower isn’t bad, but why accept it when we can better with an API that is IMO just as usable, if not more so.

I agree with your goals of better abstractions and a safer API, but we only achieve those goals if the API is actually used and it won’t be used if it perceived to be slow.

markshannon · October 11, 2025, 10:46am

What is wrong with have a struct in the ABI? It just means we can’t change its size or alignment. The ABI already has ints, longs and pointers in it. What the difference between requiring a 4 or 8 byte chunks of memory and a 32 byte chunk of memory?

Something like:

struct _writer {
    alignas(sizeof(void*)*2) void *dont_use[4];
};

is flexible enough without needing changes to its size or alignment

vstinner · October 11, 2025, 9:46pm

(post deleted by author)

vstinner · October 11, 2025, 9:54pm

In short, yes, it works as you expect.

In PyTupleWriter_Create(n), n is just a hint to preallocate a buffer. Then it computes a size when PyTupleWriter_Add() is called, and PyTupleWriter_Finish() resizes the internal buffer to the exact size at the end. PyTupleWriter_Create(n) can be smaller or larger than the final size, it doesn’t matter.