Inefficiencies in "writer" APIs

UPDATE: There was a bug in my benchmark. I fixed it and reran the benchmark. Now it’s faster instead of slower for tuple-1000 :grin:

Mark is referring to the issue C API: Add PyTupleWriter API that I just proposed. This API is mostly a replacement for _PyTuple_Resize(): when the input size is not known in advance. For example, look at my PR to see how PySequence_Tuple() becomes simpler with PyTupleWriter.

When the input size is known, there are other existing safe functions: PyTuple_FromArray() (new! I just added it), PyTuple_Pack(), Py_BuildValue(), etc.

I ran a micro-benchmark comparing [tuple] to [writer]:

  • [tuple]: PyTuple_New() and PyTuple_SetItem().
  • [writer]: PyTupleWriter_Create(), PyTupleWriter_AddSteal() and PyTupleWriter_Finish().
Benchmark tuple writer
tuple-1 37.4 ns 41.3 ns: 1.10x slower
tuple-5 65.7 ns 68.8 ns: 1.05x slower
tuple-10 99.9 ns 102 ns: 1.02x slower
tuple-100 800 ns 762 ns: 1.05x faster
tuple-1000 7.68 us 7.28 us: 1.05x faster
Geometric mean (ref) 1.01x slower

tuple-1 is the worst case scenario, measure the overhead of the abstraction): it’s only 3.9 nanoseconds slower.

My implementation calls PyTuple_New() and _PyTuple_Resize() internally, so it’s hard to be faster than these functions.

IMO between 3.7 ns slower and 1.07x slower on a micro-benchmark is an acceptable trade-off for an abstraction and a safer API.

A PyTupleWriter instance is allocated on the heap memory, but there is a free list which reduces the cost of the memory allocation and deallocation. I designed the API to be compatible with the stable ABI in the long term. Hiding the structure members is required for that. I would also prefer to not have a structure of a fixed size, since it would be the implementation more complicated and less flexible (it would be harder to try other optimizations later).

PyTupleWriter uses a small array of 16 items to avoid having to resize small tuples multiple times. It switches to an internal concrete tuple object for 17 items and more. I would prefer to not leak such implementation details in the ABI.

I didn’t measure the PyTupleWriter_AddArray() performance, it should be more efficient since it works on an array.

4 Likes