Inefficiencies in "writer" APIs

Why would they be? I understand that non-tiny tuples are non-existent in most applications, but they could still be relevant for some of them.

IMO, this topic is full of strawmen and attacks on them – starting with the non-existent PyTupleWriter in the original post. Apples-to-oranges benchmarks don’t help either.
Can we get back on track somehow?
What’s the purpose of this discussion – designing new API for use cases we’re missing, or criticizing what’s already there?

Looks like performance-wise there are wildly different use cases: 2, 16, or 100 items; with the size known/approximated/unknown in advance; with things like error handling for get_object in the OP.
Strings are even ickier, with their different storage representations, and tradeoffs involved in being fast with CPython’s current internals vs. allowing future changes.

Well, it depends on your use case. @emmatyping wrote a nice optimization to replace a list of bytes objects with PyByteWriter in zlib and it makes the code 10-30% faster: PR gh-139976.

1 Like

Ok, here is a benchmark comparing PyTuple_SET_ITEM() to PyTupleWriter_AddSteal():

Benchmark tuple writer
tuple-1 33.0 ns 42.4 ns: 1.28x slower
tuple-5 51.4 ns 70.3 ns: 1.37x slower
tuple-10 74.1 ns 105 ns: 1.42x slower
tuple-100 567 ns 802 ns: 1.41x slower
tuple-1000 5.43 us 7.71 us: 1.42x slower
Geometric mean (ref) 1.38x slower

Well, PyTuple_SET_ITEM() is a macro modifying directly the tuple, wheras PyTupleWriter_AddSteal() is a function call with additional checks, sure it’s slower :slight_smile:

Notes: PyTuple_SetItem() takes the ownership of its argument, so there is no additional Py_INCREF/DECREF. PyTuple_SET_ITEM() is not available in the limited C API.

My PR modifies PySequence_Tuple() to use PyTupleWriter. I ran a benchmark to measure the impact on performance.

I don’t think that it’s misleading, it compares one strategy (using an array-8 + a list) to the writer strategy (using an array-16 + a tuple). It gives an idea of the writer strategy is relevant or not.

You’re right about PyTuple_SetItem() taking ownership. My mistake.

I used PyTupleWriter as an example, because it has the simplest API, but the writers for strings and bytes share the same flaw: unnecessary heap allocation.

I’m in favour of adding the writer APIs, because they allow us to change the internal representation, but if we are to expect people to use them, we should make them as efficient as possible.

I’ll redo my initial example using the existing PyBytesWriter API.
Old style:

PyObject *b = PyBytes_FromStringAndSize(NULL, 20);
// Handle error case
char *ptr = PyBytes_AsString(b);
memcpy(ptr, text, 20);

With PyBytesWriter:

PyBytesWriter *w = PyBytesWriter_Create(20);
// Handle error case
void *ptr = PyBytesWriter_GetData(w);
memcpy(ptr, text, 20);
b = PyBytesWriter_Finish(w);
// Handle error case

There is no need to heap allocate the writer.
We should also guarantee success for writers with the correctly allocated size, although this is less import here than the tuple case.
Like this:

PyBytesWriter w;
PyBytesWriter_Init(&w, 20);
// Handle error case
PyBytesWriter_WriteBytes(&w, text, 20);
b = PyBytesWriter_Finish(&w);
1 Like

IMO, this is fine API level. It does want an optimization where PyBytesWriter_Finish doesn’t allocate, but reuses the writer’s memory.
But, turns out it was fast enough without that optimization.

1 Like

What does “fast enough” mean? Who determines what is fast enough?
Why not make it as fast as we can (within the constraints of keeping a portable, usable API)?

4 Likes

I wrote PR gh-140129 which allocates PyTupleWriter on the stack memory, instead of the heap memory.

Benchmark comparing PyTuple_SET_ITEM() to PyTupleWriter_AddSteal() where heap is PR gh-139891 and stack is this PR gh-140129.

Benchmark tuple heap stack
tuple-1 31.3 ns 42.3 ns: 1.35x slower 39.1 ns: 1.25x slower
tuple-5 51.0 ns 70.4 ns: 1.38x slower 65.2 ns: 1.28x slower
tuple-10 74.6 ns 106 ns: 1.41x slower 96.7 ns: 1.30x slower
tuple-100 567 ns 803 ns: 1.42x slower 754 ns: 1.33x slower
tuple-1000 5.52 us 7.73 us: 1.40x slower 7.33 us: 1.33x slower
Geometric mean (ref) 1.39x slower 1.30x slower

Comparison of stack to heap:

Benchmark heap stack
tuple-1 42.3 ns 39.1 ns: 1.08x faster
tuple-5 70.4 ns 65.2 ns: 1.08x faster
tuple-10 106 ns 96.7 ns: 1.09x faster
tuple-100 803 ns 754 ns: 1.07x faster
tuple-1000 7.73 us 7.33 us: 1.05x faster
Geometric mean (ref) 1.07x faster

So yeah, the stack memory flavor is faster. I still prefer the heap flavor since it gives more freedom to change PyTupleWriter structure in the future :slight_smile: (don’t need to hardcode the structure size in the ABI)

As a data point from practical use of tuple creation:

Most uses in my code are for small tuples (2-5 items) stored in lists and used for e.g. representing parsing output, or tuples which represent database query rows (most in the range of 2-50; only few going up to 100 or 200), also stored in lists.

The tuples are usually all the same length in those lists.

I use PyTuple_SET_ITEM() for most of those use cases. I would not want to switch to an API which is 30-40% slower.

As I wrote before, PyTupleWriter is for code which uses currently _PyTuple_Resize(): when the input length is not known in advance.