Ideas for forward-compatible and fast extension libraries in Python 3.12

wjakob · May 25, 2022, 11:55am

Dear CPython team,

I am currently in the process of making the nanobind C++ ↔ Python binding library ready for an eventual switch to Py_LIMITED_API. nanobind is essentially a faster minified rewrite of pybind11, hence this can also be interpreted as a proof-of-concept for eventually making similar changes in pybind11.

This is to …

be resilient to changes in internal data structures that are likely going to be pursued by the faster CPython team.
dramatically simplify the wheel distribution situation. The effect of this won’t be felt for some years due to the latency between merging something into CPython main and this feature actually being present on the user’s side. Still, the eventual possibility of a shipping a single wheel per platform would be magnificent and makes this a worthy pursuit.

Of course, switching to a limited API would be a shame if this means that the extension library runs slowly. I would like to start a discussion about changes in CPython that would make it possible to have a forward-compatible ABI and retain performance.

Hence, here is a “pie in the sky” wish list for Python 3.12, in order of perceived importance, with some commentary.

A way to create types with a custom metaclass using the PyType_FromSpec - style interface. A PR for a proposed new function PyType_FromMetaclass is here: gh-60074: add new stable API function PyType_FromMetaclass by wjakob · Pull Request #93012 · python/cpython · GitHub

Why needed: this allows the binding tool to dynamically create a type without messing with opaque PyTypeObject internals. The ability to specify a metaclass is important because the binding tool may, e.g., need to intercept type-related operations like subclassing to update its internal data structures.
A way for this function to extend the __basicsize__ of types (including PyTypeObject itself). Petr Viktorin had a proposal that would legitimize this and provide a safe API for obtaining a pointer to the added storage region: Mailman 3 C-API for extending opaque types - capi-sig - python.org

Why needed: this will allow the binding tool to store its type data structures directly in the Python type object, which significantly reduces pointer chasing and improves performance. Without this, any type-related binding code will require indirections through a hash-table (to map PyTypeObject * to an internal data structure), which is just not a good idea (imagine if every call to Py_TYPE(..) performed a dictionary lookup…)
A way for custom callables to receive PEP 590 vector calls. PyType_FromSpec allows creating types with a __vectorcalloffset__ – nice! Is this usage allowed as part of Py_LIMITED_API? The documentation is not super-clear on this. There are certainly some obvious omissions… For example, a type providing __vectorcalloffset__ should also specify a Py_tp_call slot with the compatibility dispatcher PyVectorcall_Call, but this function is not part of the limited API. Similarly, decoding a vector call requires some constants/inline functions that aren’t included: PyVectorcall_NARGS, potentially also PY_VECTORCALL_ARGUMENTS_OFFSET.

Why needed: The most performance-critical function in a binding tool like nanobind or pybind11 is the function that receives a Python function call and then figures out what to do with it (it has to decode positional/keyword arguments, identify the right overload, perform implicit type conversions, handle exceptions, etc.). Binding frameworks normally implement custom callable objects that co-locate information needed to dispatch the function calls, which is why the traditional PyMethodDef-style interface is insufficient.

This code is very “tuple/dictionary-heavy” when implemented using the old tp_call-style interface, and that mixes badly with Py_LIMITED_API especially because the tuple memory layout is opaque. The vector call API is a huge improvement here because it get rid of most tuples and also the dictionaries and lots of CPython API calls related to them. Being able to efficiently receive vector calls is (IMO) much more important to issuing vector calls, but I have a wish list entry about the latter as well – see point number 6.
Add a limited API function that provides access to a PyObject** representation of a sequence type, but in a future-proof way. Concrete example code:
```
PyObject *const *PySequence_Items(PyObject *seq, PyObject **owner_out, Py_ssize_t *size_out) {
    if (PyTuple_Check(seq)) {
        *size_out = PyTuple_GET_SIZE(seq);
        *owner_out = Py_NewRef(seq);
        return ((PyTupleObject *) seq)->ob_item;
    } else if (PyList_Check(seq)) {
        *size_out = PyList_GET_SIZE(seq);
        *owner_out = Py_NewRef(seq);
        return ((PyListObject *) seq)->ob_item;
    } else {
        PyObject *temp = PySequence_List(seq);
        if (!temp)
            return NULL;

        *owner_out = temp;
        *size_out = PyList_GET_SIZE(temp);
        return ((PyListObject *) temp)->ob_item;
    }
}
```
Why needed: type conversion become more expensive when lists/tuples are opaque, especially when working with very large lists/vectors. The idea of this change is to provide a read-only interface akin to PySequence_Fast_ITEMS which provides direct access to a PyObject **-style representation that the binding code can quickly iterate over.

PySequence_Fast_ITEMS is not part of the limited API because it relies on implementation details (for example, the fact that lists/tuples actually have a contiguous PyObject **-based representation). The proposed function also uses that in this specific implementation to be fast, but it could also be implemented in numerous different ways differently. The important thing is that it returns an owner object of an unspecified type (via the owner_out parameter). Its only purpose is to hold a strong reference that the callee must continue to hold until it is done accessing the return value. For example, the function could malloc an array of PyObject * pointers and then set owner to a PyCapsule, whose destructor will free the memory region. Note that it is illegal to write to the returned array of pointers, since this may not update the original sequence.

This feature is related to a discussion by Victor Stinner here: Mailman 3 (PEP 620) C API for efficient loop iterating on a sequence of PyObject** or other C types - Python-Dev - python.org
In addition to the functions in point 3, I propose adding PyObject_Vectorcall and PyObject_VectorcallMethod to the limited API.

Why needed: For binding code that implements callbacks (e.g. a C++ GUI library like Qt that wants to dispatch a button push to a Python handler), it is also useful to be able to issue vector calls. This will significantly cut down tuple/dictionary construction which, again, is costly when operations like PyTuple_SET_ITEM cannot be inlined and dictionaries potentially have to be created to pass keywords. This is (IMO) less important than the receiving end mentioned in point 3, but making both of those changes would be nicely symmetric.
Clarify the status of keyword arguments in PEP 590 vector calls. It would be helpful if the spec would explicitly say that keyword arguments names must always be interned strings.

Why needed: to resolve keyword arguments, the binding library would need to perform many calls to the slow PyUnicode_Compare function. Python internally often uses a pattern where it first tries a pointer equality check and then falls back to _PyUnicode_EQ (not part of the limited API). It would IMO be much easier to simply establish a clear rule that all extension libraries must abide by.

Anyways, this was a long thread — thank you if you read all the way to the end. It would be wonderful if some of these features could be included in Python 3.12 so that it could provide a foundation for extension modules that are both forward-compatible and fast.

Some of these ideas might be controversial, and the purpose of this post was to stimulate some discussion. I don’t think that all of these features are critical, but especially the first 2-3 offer a big bang for the buck.

Thanks,
Wenzel

encukou · May 25, 2022, 2:14pm

Let’s do that!
I’m in favor, obviously
Yes, it seems vectorcall is indeed ready for getting a promise of decades of support.
I Perhaps we can even make PyType-FromSpec set tp_call to PyVectorcall_Call by default.

Like PySequence_Fast_ITEMS, there are issues with a list being mutated while its underlying array is in use. The behavior is predictable in CPython, but might be different in other implementations.
- Is this actually needed for lists?
Is this still needed with vectorcall?

Sounds reasonable.
I don’t think we can tighten the spec now, since vectorcall is used in the wild.
Can we expose something closer to _PyUnicode_EQ instead?

cc @ctismer — would this help PySide?

wjakob · May 25, 2022, 2:40pm

That would make sense.

This is relevant even when the vector call feature is implemented. For example, consider taking a Python List[float] and using it to call a C++ function taking a std::vector<float> – the suggested interface would remove one layer of indirection to access the list elements during this conversion. I don’t see how lists/tuples would make a fundamental difference. If anything, it’s more important for lists because they tend to be larger.

This kind of function would only be used temporarily for read-only access. Perhaps the contract of the function could be set up so that it’s really only valid in a purely read-only context: no guarantees are made if the original list is mutated, and the user of this function can also not update the returned pointer array in any way.

It should be possible to ensure in CPython itself, and there probably aren’t many extensions that are using vectorcalls yet. Elevating vector calls to a stable ABI without addressing this point would IMO be an important missed optimization opportunity.

encukou · May 25, 2022, 3:18pm

To call a user-defined function, you already need to copy the list and to incref each element (since the function can mutate the list and thus drop the references you’re borrowing). Right?
Would it work to have a function to copy the pointers to a pre-allocated PyObject ** buffer, increfing each? And a function for mass decref?

But if there are any, they’ll start failing in mysterious ways. I really don’t think we can do that change.
We can encourage interned strings, but the checks/fallbacks needs to stay.

ctismer · May 25, 2022, 3:26pm

Hi Petr,

I have been struggling quite much until I could tweak the CPython implementation
to play well with PySide and the Limited API, of course. In CPython, it was possible to do
some nasty tricks like changing the meta type after type creation.

For porting PySide to PyPy, a heavy undertaking in fact, I had to do much more drastic things:

write a special typefactory.cpp which creates types in a PyPy compatible way: The base function
for type creation is

LIBSHIBOKEN_API PyTypeObject *SbkType_FromSpec_BMDWB(PyType_Spec *spec,
PyObject *bases,
PyTypeObject *meta,
int dictoffset,
int weaklistoffset,
PyBufferProcs *bufferprocs);
and of course a few convenience versions of that with more speaking name
instead of using opaque type extensions, I had to replace that by lookup dicts.
A bit slower, but works nicely.

The SbkType_FromSpec_BMDWB function had a lot of CPython code replication, because
I needed a version that does not call PyType_Ready too early. Instead, all preparations
must be done, and then PyType_Ready must be called.

Summary:

Vectorcall would be nice to have. It will probably give some speed.
A function like the above SbkType_FromSpec_BMDWB function would be very helpful,
because we could get rid of those ugly patches.
A general type extension framework would be quite helpful, but not crucial.

And vectorcall is not interesting for PyPy at all. The acceleration of PySide code is great.

Cheers – Chris

guido · May 25, 2022, 3:33pm

The problem is that we cannot guarantee that the list isn’t mutated during the time that you are accessing the items this way (even an innocent DECREF call could run Python code that could manipulate the list, causing you to read garbage).

Since tuples are immutable they don’t have this problem, and the temporary copy you make for other types is also fine (since you are the only code with a reference to it). I think this would be okay if you took out the special case for lists. Would that work? Or do you expect to routinely work with lists that are so long that copying them would be prohibitive?

If that’s the case I’m not sure what to suggest. Making the caller promise on the Zen of Python that they won’t make any Python API calls, not even one little DECREF, while they access that array of object pointers is asking for trouble in my opinion – because usually, you can get away with it just fine (it doesn’t crash every time you violate the rule), so people get lax.

I could imagine that list objects would grow a new flag bit indicating that they are temporarily in read-only mode. All list-mutating operations should raise an error if this bit is set. Your API would then have to have an explicit “end access” operation, rather than just a DECREF of the owner object.

encukou · May 25, 2022, 3:38pm

These can be set using PyMemberDef.

Those are coming in 3.11!

wjakob · May 25, 2022, 3:39pm

It sounds like a copy is unavoidable in some cases. I do like the idea about tuples – that would at least give a copy-free option for one important case.

If the input type is not a tuple, it could be converted into one temporarily using PySequence_Tuple that is returned as the owner of the PyObject** pointer. Performing that copy + mass incref within CPython should be more efficient than getting out PyObject * pointers individually using the limited API.

encukou · May 25, 2022, 3:53pm

This is still making assumptions about memory layout: consider an implementation with tagged pointers and specialized small-int-only tuples which don’t store a PyObject* array at all.

My question might have been buried:

guido · May 25, 2022, 3:55pm

Yes, that sounds like a good idea. So now perhaps all we need is an API that takes a tuple and returns a pointer to its array of items, as part of the stable ABI. The user can call PySequence_Tuple prior to making that call if desired.

wjakob · May 25, 2022, 6:45pm

As @encukou’s mentioned, a hypothetical Python implementation might not represent tuples internally as PyObject** (tagged pointers, etc.), and so a hypothetical function

PyObject ** PyTuple_Data(PyObject *)

specific to tuples might not be sufficiently future-proof solution.

Yes, that is not a bad solution, but I think we can do better. If the underlying implementation has tuples stored as contiguous PyObject* pointers, then this copy is superfluous, and it would be nice to avoid it. In a more exotic implementation, the function could still malloc a dedicated output buffer, mass-Py_INCREF pointers, etc.

So my suggestion would be a slightly reduced version of my previous proposal that would be compatible with both of these requirements:


/**
   PySequence_Items()

   Take a sequence 'seq' and return a pointer to an
   `PyObject *` array representing its contents. The returned
   object array is immutable: it may neither be changed by the
   caller, nor will it reflect future changes performed to 'seq'.

   The 'owner_out' parameter is used to return a Python object 
   owning returned memory region. It may or may not equal 'seq',
   and no assumptions should be made about its type. Its only
   purpose is lifetime management -- in particular, the caller
   should `Py_DECREF(..)` this object once it no longer needs
   access to the object array returned by this function.
   The 'size_out' parameter returns the size of the returned
   sequence.
 */


PyObject *const *PySequence_Items(PyObject *seq,
                                  PyObject **owner_out,
                                  Py_ssize_t *size_out) {
    if (PyTuple_Check(seq)) {
        *size_out = PyTuple_GET_SIZE(seq);
        *owner_out = Py_NewRef(seq);
        return ((PyTupleObject *) seq)->ob_item;
    } else {
        PyObject *temp = PySequence_Tuple(seq);
        if (!temp) {
            *owner_out = NULL;
            *size_out = -1;
            return NULL;
        }

        *owner_out = temp;
        *size_out = PyTuple_GET_SIZE(temp);
        return ((PyTupleObject *) temp)->ob_item;
    }
}

A more exotic python implementation simply would not have the special case for tuples. It would always allocate new memory for a pointer array, mass-incref, and then return a Python capsule or similar via owner_out, which performs a mass-decref and releases the memory upon expiration of the capsule.

The caller would use this function as follows

PyObject *seq = ...;

PyObject *owner;
Py_ssize_t size;

PyObject **items = Py_Sequence_Items(seq, &owner, &size);
if (!items) { .. error handling .. }

for (Py_ssize_t i = 0; i < size; ++i) {
    .. something involving items[i] ..
}

Py_DECREF(owner);

What do you think?

encukou · May 26, 2022, 10:16am

One possible downside I see is that copying a Python list to a std::vector<PyObject*> would involve two copies – one for the temporary tuple, and another to the vector. (For those not familiar with std::vector: it’s a resizable array much like a CPython list).
I don’t know if that’s a use case worth optimizing for.
OTOH, I also don’t know if zero-copy for tuples is worth it, since vectorcall might replace most uses of large tuples.

markshannon · June 16, 2022, 5:26pm

In reply to the numbered points:

1,2 as Petr said.

Just use the vectorcall protocol, already
All the macros and function pointer typedefs should be part of the limited API; they are specified in PEP 590 as final.
Feel free to copy and paste them if necessary, they aren’t going to change.
Not going to happen, sorry. Apart from not being safe, it is likely to be an obstacle to future improvements.
See 3. The protocol works both ways. If we add the functions to the limited API (which I have no problem with), then you’ll only be able to use them in 3.12. The underlying protocol is good for 3.8 onward.
I don’t think that we should force the strings to be interned, as that is a breaking change.
I’d be happy documenting that performance is likely to be better if they are interned, though.

encukou · June 17, 2022, 11:37am

PEP 590 says they’re not part of the stable ABI (and so, the limited API).
If we add them we need to make sure calls stay consistent when __call__ on a class from Python code. The discussion is in gh-93274.

steve-s · July 15, 2022, 12:58pm

Could you expand on this? I think that the point that doing an actual API call to get each element one by one is going to cause performance issues may be valid. Is this comment about that idea in general, or about the details?

I would like to better understand your concerns, because HPy is considering providing similar APIs.

Note that the API is not meant to force the implementation to use some specific storage strategy, the array of PyObject* is just a way to exchange the data between the caller and the Python engine.

I think that providing a buffered access would be better to avoid materialization of some optimized compact representation of otherwise large list/tuple – the idea is to provide something like PySequence_Items(PyObject *seq, PyObject **buffer, Py_SSize_t start, Py_SSize_t size) and the user would iterate the sequence in two loops, outer loop fetching a buffer of some smallish size like 256 elements, inner loop would iterate over that. In such way, the inner loop would be free of the API calls to fetch elements, i.e., can be better optimized by the C compiler. This is API that R provides to iterate over its vectors. The issue is, however, that what useful thing can one do with opaque PyObject* other than do some other API calls. @wjakob do you have any use-case in mind? In R the vectors are typed, so you’re getting out, e.g., a buffer of plain C integers.

encukou · August 8, 2022, 1:33pm

PyType_FromMetaclass is in 3.12
Extending opaque types is next up on my TODO list
Receiving vectorcall is now in 3.12
sequences as arrays is currently low on my TODO list, but I’m still leaning toward adding API where you pass in a pre-allocated buffer, like
- PySequence_IntoArray(PyObject* seq, PyObject** array, size)
- Py_DecrefArray(PyObject** array, size)
Calling vectorcall – I’ll get to it eventually, but if anyone wants to push it further go ahead.
interned strings are not required, but using them is likely to speed things up in- and outside CPython core. Adding docs & helpers would be nice (maybe expose an eq function optimized for interned strings?)

Python sequences are un-specialized PyObject* arrays. I think R’s API would be more suitable for NumPy arrays, or Python array if we wanted it to be more heavy-duty (which we don’t, IMO), or other specialized types.
IOW, specialized list-of-int sounds like a reasonable optimization, but exposing it in API doesn’t sound worthwhile.

If avoiding materialization is a concern, you ideally want an iterator. Or, for N-D arrays, NumPy again.

ronaldoussoren · August 9, 2022, 9:42am

Objective-C/Cocoa has an NSFastEnumeration protocol that provides a middle ground between the current protocol and your proposal: the caller provides a buffer and an iteration state and calls that API in a loop to avoid having to copy the entire contents into a temporary buffer. Basically PyDict_Next, but with more than one item at a time.

encukou · August 9, 2022, 9:49am

Out of curiosity, what happens there if the sequence changes while you’re iterating?

But the bigger question is why. What’s the use case for external API that avoids full materialization of a generic sequence?
For the Limited API specifically, it doesn’t sound like a good fit.

ronaldoussoren · August 9, 2022, 11:38am

To be honest I don’t know what happens when the sequence changes during iteration. The Cocoa protocol does not have a way to signal errors, and in Cocoa exceptions are generally not used other than to signal programming errors.

A use case for avoiding full materialisation is that this might use a lot of memory. E.g. when iterating over all elements in a sequence of a couple of million entries. In the grand scheme of things that’s still not a lot of memory, but using a smaller buffer might allow using a stack-based buffer and hence avoid dynamic allocations.

steve.dower · August 9, 2022, 12:57pm

Windows also uses this pattern fairly frequently (example), mostly because many implementations might have long roundtrip times (e.g. there’s a “find files” one somewhere that might involve network calls), so it lets the caller decide how much latency they care about.

A “PyIter_GetNextFew” API would probably fit the bill, and should certainly be useful. If we wanted to specialise by taking a (single) ParseArgs-style format character to extract raw values as well, that could be nicely efficient for a lot of common cases.