Inefficiencies in "writer" APIs

vstinner · October 10, 2025, 2:47pm

UPDATE: There was a bug in my benchmark. I fixed it and reran the benchmark. Now it’s faster instead of slower for tuple-1000

Mark is referring to the issue C API: Add PyTupleWriter API that I just proposed. This API is mostly a replacement for _PyTuple_Resize(): when the input size is not known in advance. For example, look at my PR to see how PySequence_Tuple() becomes simpler with PyTupleWriter.

When the input size is known, there are other existing safe functions: PyTuple_FromArray() (new! I just added it), PyTuple_Pack(), Py_BuildValue(), etc.

I ran a micro-benchmark comparing [tuple] to [writer]:

[tuple]: PyTuple_New() and PyTuple_SetItem().
[writer]: PyTupleWriter_Create(), PyTupleWriter_AddSteal() and PyTupleWriter_Finish().

Benchmark	tuple	writer
tuple-1	37.4 ns	41.3 ns: 1.10x slower
tuple-5	65.7 ns	68.8 ns: 1.05x slower
tuple-10	99.9 ns	102 ns: 1.02x slower
tuple-100	800 ns	762 ns: 1.05x faster
tuple-1000	7.68 us	7.28 us: 1.05x faster
Geometric mean	(ref)	1.01x slower

tuple-1 is the worst case scenario, measure the overhead of the abstraction): it’s only 3.9 nanoseconds slower.

My implementation calls PyTuple_New() and _PyTuple_Resize() internally, so it’s hard to be faster than these functions.

IMO between 3.7 ns slower and 1.07x slower on a micro-benchmark is an acceptable trade-off for an abstraction and a safer API.

A PyTupleWriter instance is allocated on the heap memory, but there is a free list which reduces the cost of the memory allocation and deallocation. I designed the API to be compatible with the stable ABI in the long term. Hiding the structure members is required for that. I would also prefer to not have a structure of a fixed size, since it would be the implementation more complicated and less flexible (it would be harder to try other optimizations later).

PyTupleWriter uses a small array of 16 items to avoid having to resize small tuples multiple times. It switches to an internal concrete tuple object for 17 items and more. I would prefer to not leak such implementation details in the ABI.

I didn’t measure the PyTupleWriter_AddArray() performance, it should be more efficient since it works on an array.