Inefficiencies in "writer" APIs

pitrou · October 13, 2025, 8:17am

Why would they be? I understand that non-tiny tuples are non-existent in most applications, but they could still be relevant for some of them.

encukou · October 13, 2025, 9:18am

IMO, this topic is full of strawmen and attacks on them – starting with the non-existent PyTupleWriter in the original post. Apples-to-oranges benchmarks don’t help either.
Can we get back on track somehow?
What’s the purpose of this discussion – designing new API for use cases we’re missing, or criticizing what’s already there?

Looks like performance-wise there are wildly different use cases: 2, 16, or 100 items; with the size known/approximated/unknown in advance; with things like error handling for get_object in the OP.
Strings are even ickier, with their different storage representations, and tradeoffs involved in being fast with CPython’s current internals vs. allowing future changes.

vstinner · October 13, 2025, 11:55am

Well, it depends on your use case. @emmatyping wrote a nice optimization to replace a list of bytes objects with PyByteWriter in zlib and it makes the code 10-30% faster: PR gh-139976.

vstinner · October 13, 2025, 12:14pm

Ok, here is a benchmark comparing PyTuple_SET_ITEM() to PyTupleWriter_AddSteal():

Benchmark	tuple	writer
tuple-1	33.0 ns	42.4 ns: 1.28x slower
tuple-5	51.4 ns	70.3 ns: 1.37x slower
tuple-10	74.1 ns	105 ns: 1.42x slower
tuple-100	567 ns	802 ns: 1.41x slower
tuple-1000	5.43 us	7.71 us: 1.42x slower
Geometric mean	(ref)	1.38x slower

Well, PyTuple_SET_ITEM() is a macro modifying directly the tuple, wheras PyTupleWriter_AddSteal() is a function call with additional checks, sure it’s slower

Notes: PyTuple_SetItem() takes the ownership of its argument, so there is no additional Py_INCREF/DECREF. PyTuple_SET_ITEM() is not available in the limited C API.

My PR modifies PySequence_Tuple() to use PyTupleWriter. I ran a benchmark to measure the impact on performance.

I don’t think that it’s misleading, it compares one strategy (using an array-8 + a list) to the writer strategy (using an array-16 + a tuple). It gives an idea of the writer strategy is relevant or not.

markshannon · October 13, 2025, 2:04pm

You’re right about PyTuple_SetItem() taking ownership. My mistake.

markshannon · October 13, 2025, 2:35pm

I used PyTupleWriter as an example, because it has the simplest API, but the writers for strings and bytes share the same flaw: unnecessary heap allocation.

I’m in favour of adding the writer APIs, because they allow us to change the internal representation, but if we are to expect people to use them, we should make them as efficient as possible.

I’ll redo my initial example using the existing PyBytesWriter API.
Old style:

PyObject *b = PyBytes_FromStringAndSize(NULL, 20);
// Handle error case
char *ptr = PyBytes_AsString(b);
memcpy(ptr, text, 20);

With PyBytesWriter:

PyBytesWriter *w = PyBytesWriter_Create(20);
// Handle error case
void *ptr = PyBytesWriter_GetData(w);
memcpy(ptr, text, 20);
b = PyBytesWriter_Finish(w);
// Handle error case

There is no need to heap allocate the writer.
We should also guarantee success for writers with the correctly allocated size, although this is less import here than the tuple case.
Like this:

PyBytesWriter w;
PyBytesWriter_Init(&w, 20);
// Handle error case
PyBytesWriter_WriteBytes(&w, text, 20);
b = PyBytesWriter_Finish(&w);

encukou · October 13, 2025, 3:06pm

IMO, this is fine API level. It does want an optimization where PyBytesWriter_Finish doesn’t allocate, but reuses the writer’s memory.
But, turns out it was fast enough without that optimization.

markshannon · October 14, 2025, 8:32am

What does “fast enough” mean? Who determines what is fast enough?
Why not make it as fast as we can (within the constraints of keeping a portable, usable API)?

vstinner · October 14, 2025, 7:17pm

I wrote PR gh-140129 which allocates PyTupleWriter on the stack memory, instead of the heap memory.

Benchmark comparing PyTuple_SET_ITEM() to PyTupleWriter_AddSteal() where heap is PR gh-139891 and stack is this PR gh-140129.

Benchmark	tuple	heap	stack
tuple-1	31.3 ns	42.3 ns: 1.35x slower	39.1 ns: 1.25x slower
tuple-5	51.0 ns	70.4 ns: 1.38x slower	65.2 ns: 1.28x slower
tuple-10	74.6 ns	106 ns: 1.41x slower	96.7 ns: 1.30x slower
tuple-100	567 ns	803 ns: 1.42x slower	754 ns: 1.33x slower
tuple-1000	5.52 us	7.73 us: 1.40x slower	7.33 us: 1.33x slower
Geometric mean	(ref)	1.39x slower	1.30x slower

Comparison of stack to heap:

Benchmark	heap	stack
tuple-1	42.3 ns	39.1 ns: 1.08x faster
tuple-5	70.4 ns	65.2 ns: 1.08x faster
tuple-10	106 ns	96.7 ns: 1.09x faster
tuple-100	803 ns	754 ns: 1.07x faster
tuple-1000	7.73 us	7.33 us: 1.05x faster
Geometric mean	(ref)	1.07x faster

So yeah, the stack memory flavor is faster. I still prefer the heap flavor since it gives more freedom to change PyTupleWriter structure in the future (don’t need to hardcode the structure size in the ABI)

malemburg · October 14, 2025, 7:27pm

As a data point from practical use of tuple creation:

Most uses in my code are for small tuples (2-5 items) stored in lists and used for e.g. representing parsing output, or tuples which represent database query rows (most in the range of 2-50; only few going up to 100 or 200), also stored in lists.

The tuples are usually all the same length in those lists.

I use PyTuple_SET_ITEM() for most of those use cases. I would not want to switch to an API which is 30-40% slower.

vstinner · October 14, 2025, 8:23pm

As I wrote before, PyTupleWriter is for code which uses currently _PyTuple_Resize(): when the input length is not known in advance.