Why would they be? I understand that non-tiny tuples are non-existent in most applications, but they could still be relevant for some of them.
IMO, this topic is full of strawmen and attacks on them – starting with the non-existent PyTupleWriter
in the original post. Apples-to-oranges benchmarks don’t help either.
Can we get back on track somehow?
What’s the purpose of this discussion – designing new API for use cases we’re missing, or criticizing what’s already there?
Looks like performance-wise there are wildly different use cases: 2, 16, or 100 items; with the size known/approximated/unknown in advance; with things like error handling for get_object
in the OP.
Strings are even ickier, with their different storage representations, and tradeoffs involved in being fast with CPython’s current internals vs. allowing future changes.
Well, it depends on your use case. @emmatyping wrote a nice optimization to replace a list of bytes objects with PyByteWriter
in zlib
and it makes the code 10-30% faster: PR gh-139976.
Ok, here is a benchmark comparing PyTuple_SET_ITEM()
to PyTupleWriter_AddSteal()
:
Benchmark | tuple | writer |
---|---|---|
tuple-1 | 33.0 ns | 42.4 ns: 1.28x slower |
tuple-5 | 51.4 ns | 70.3 ns: 1.37x slower |
tuple-10 | 74.1 ns | 105 ns: 1.42x slower |
tuple-100 | 567 ns | 802 ns: 1.41x slower |
tuple-1000 | 5.43 us | 7.71 us: 1.42x slower |
Geometric mean | (ref) | 1.38x slower |
Well, PyTuple_SET_ITEM()
is a macro modifying directly the tuple, wheras PyTupleWriter_AddSteal()
is a function call with additional checks, sure it’s slower
Notes: PyTuple_SetItem()
takes the ownership of its argument, so there is no additional Py_INCREF/DECREF. PyTuple_SET_ITEM()
is not available in the limited C API.
My PR modifies PySequence_Tuple()
to use PyTupleWriter
. I ran a benchmark to measure the impact on performance.
I don’t think that it’s misleading, it compares one strategy (using an array-8 + a list) to the writer strategy (using an array-16 + a tuple). It gives an idea of the writer strategy is relevant or not.
You’re right about PyTuple_SetItem()
taking ownership. My mistake.
I used PyTupleWriter
as an example, because it has the simplest API, but the writers for strings and bytes share the same flaw: unnecessary heap allocation.
I’m in favour of adding the writer APIs, because they allow us to change the internal representation, but if we are to expect people to use them, we should make them as efficient as possible.
I’ll redo my initial example using the existing PyBytesWriter
API.
Old style:
PyObject *b = PyBytes_FromStringAndSize(NULL, 20);
// Handle error case
char *ptr = PyBytes_AsString(b);
memcpy(ptr, text, 20);
With PyBytesWriter:
PyBytesWriter *w = PyBytesWriter_Create(20);
// Handle error case
void *ptr = PyBytesWriter_GetData(w);
memcpy(ptr, text, 20);
b = PyBytesWriter_Finish(w);
// Handle error case
There is no need to heap allocate the writer.
We should also guarantee success for writers with the correctly allocated size, although this is less import here than the tuple case.
Like this:
PyBytesWriter w;
PyBytesWriter_Init(&w, 20);
// Handle error case
PyBytesWriter_WriteBytes(&w, text, 20);
b = PyBytesWriter_Finish(&w);
IMO, this is fine API level. It does want an optimization where PyBytesWriter_Finish
doesn’t allocate, but reuses the writer’s memory.
But, turns out it was fast enough without that optimization.
What does “fast enough” mean? Who determines what is fast enough?
Why not make it as fast as we can (within the constraints of keeping a portable, usable API)?
I wrote PR gh-140129 which allocates PyTupleWriter
on the stack memory, instead of the heap memory.
Benchmark comparing PyTuple_SET_ITEM()
to PyTupleWriter_AddSteal()
where heap
is PR gh-139891 and stack
is this PR gh-140129.
Benchmark | tuple | heap | stack |
---|---|---|---|
tuple-1 | 31.3 ns | 42.3 ns: 1.35x slower | 39.1 ns: 1.25x slower |
tuple-5 | 51.0 ns | 70.4 ns: 1.38x slower | 65.2 ns: 1.28x slower |
tuple-10 | 74.6 ns | 106 ns: 1.41x slower | 96.7 ns: 1.30x slower |
tuple-100 | 567 ns | 803 ns: 1.42x slower | 754 ns: 1.33x slower |
tuple-1000 | 5.52 us | 7.73 us: 1.40x slower | 7.33 us: 1.33x slower |
Geometric mean | (ref) | 1.39x slower | 1.30x slower |
Comparison of stack to heap:
Benchmark | heap | stack |
---|---|---|
tuple-1 | 42.3 ns | 39.1 ns: 1.08x faster |
tuple-5 | 70.4 ns | 65.2 ns: 1.08x faster |
tuple-10 | 106 ns | 96.7 ns: 1.09x faster |
tuple-100 | 803 ns | 754 ns: 1.07x faster |
tuple-1000 | 7.73 us | 7.33 us: 1.05x faster |
Geometric mean | (ref) | 1.07x faster |
So yeah, the stack memory flavor is faster. I still prefer the heap flavor since it gives more freedom to change PyTupleWriter
structure in the future (don’t need to hardcode the structure size in the ABI)
As a data point from practical use of tuple creation:
Most uses in my code are for small tuples (2-5 items) stored in lists and used for e.g. representing parsing output, or tuples which represent database query rows (most in the range of 2-50; only few going up to 100 or 200), also stored in lists.
The tuples are usually all the same length in those lists.
I use PyTuple_SET_ITEM() for most of those use cases. I would not want to switch to an API which is 30-40% slower.
As I wrote before, PyTupleWriter
is for code which uses currently _PyTuple_Resize()
: when the input length is not known in advance.