Dear CPython team,
I am currently in the process of making the nanobind C++ ↔ Python binding library ready for an eventual switch to Py_LIMITED_API
. nanobind is essentially a faster minified rewrite of pybind11, hence this can also be interpreted as a proof-of-concept for eventually making similar changes in pybind11.
This is to …
-
be resilient to changes in internal data structures that are likely going to be pursued by the faster CPython team.
-
dramatically simplify the wheel distribution situation. The effect of this won’t be felt for some years due to the latency between merging something into CPython
main
and this feature actually being present on the user’s side. Still, the eventual possibility of a shipping a single wheel per platform would be magnificent and makes this a worthy pursuit.
Of course, switching to a limited API would be a shame if this means that the extension library runs slowly. I would like to start a discussion about changes in CPython that would make it possible to have a forward-compatible ABI and retain performance.
Hence, here is a “pie in the sky” wish list for Python 3.12, in order of perceived importance, with some commentary.
-
A way to create types with a custom metaclass using the
PyType_FromSpec
- style interface. A PR for a proposed new functionPyType_FromMetaclass
is here: gh-60074: add new stable API function PyType_FromMetaclass by wjakob · Pull Request #93012 · python/cpython · GitHubWhy needed: this allows the binding tool to dynamically create a type without messing with opaque
PyTypeObject
internals. The ability to specify a metaclass is important because the binding tool may, e.g., need to intercept type-related operations like subclassing to update its internal data structures. -
A way for this function to extend the
__basicsize__
of types (includingPyTypeObject
itself). Petr Viktorin had a proposal that would legitimize this and provide a safe API for obtaining a pointer to the added storage region: Mailman 3 C-API for extending opaque types - capi-sig - python.orgWhy needed: this will allow the binding tool to store its type data structures directly in the Python type object, which significantly reduces pointer chasing and improves performance. Without this, any type-related binding code will require indirections through a hash-table (to map
PyTypeObject *
to an internal data structure), which is just not a good idea (imagine if every call toPy_TYPE(..)
performed a dictionary lookup…) -
A way for custom callables to receive PEP 590 vector calls.
PyType_FromSpec
allows creating types with a__vectorcalloffset__
– nice! Is this usage allowed as part ofPy_LIMITED_API
? The documentation is not super-clear on this. There are certainly some obvious omissions… For example, a type providing__vectorcalloffset__
should also specify aPy_tp_call
slot with the compatibility dispatcherPyVectorcall_Call
, but this function is not part of the limited API. Similarly, decoding a vector call requires some constants/inline functions that aren’t included:PyVectorcall_NARGS
, potentially alsoPY_VECTORCALL_ARGUMENTS_OFFSET
.Why needed: The most performance-critical function in a binding tool like nanobind or pybind11 is the function that receives a Python function call and then figures out what to do with it (it has to decode positional/keyword arguments, identify the right overload, perform implicit type conversions, handle exceptions, etc.). Binding frameworks normally implement custom callable objects that co-locate information needed to dispatch the function calls, which is why the traditional
PyMethodDef
-style interface is insufficient.This code is very “tuple/dictionary-heavy” when implemented using the old
tp_call
-style interface, and that mixes badly withPy_LIMITED_API
especially because the tuple memory layout is opaque. The vector call API is a huge improvement here because it get rid of most tuples and also the dictionaries and lots of CPython API calls related to them. Being able to efficiently receive vector calls is (IMO) much more important to issuing vector calls, but I have a wish list entry about the latter as well – see point number 6. -
Add a limited API function that provides access to a
PyObject**
representation of a sequence type, but in a future-proof way. Concrete example code:PyObject *const *PySequence_Items(PyObject *seq, PyObject **owner_out, Py_ssize_t *size_out) { if (PyTuple_Check(seq)) { *size_out = PyTuple_GET_SIZE(seq); *owner_out = Py_NewRef(seq); return ((PyTupleObject *) seq)->ob_item; } else if (PyList_Check(seq)) { *size_out = PyList_GET_SIZE(seq); *owner_out = Py_NewRef(seq); return ((PyListObject *) seq)->ob_item; } else { PyObject *temp = PySequence_List(seq); if (!temp) return NULL; *owner_out = temp; *size_out = PyList_GET_SIZE(temp); return ((PyListObject *) temp)->ob_item; } }
Why needed: type conversion become more expensive when lists/tuples are opaque, especially when working with very large lists/vectors. The idea of this change is to provide a read-only interface akin to
PySequence_Fast_ITEMS
which provides direct access to aPyObject **
-style representation that the binding code can quickly iterate over.PySequence_Fast_ITEMS
is not part of the limited API because it relies on implementation details (for example, the fact that lists/tuples actually have a contiguousPyObject **
-based representation). The proposed function also uses that in this specific implementation to be fast, but it could also be implemented in numerous different ways differently. The important thing is that it returns an owner object of an unspecified type (via theowner_out
parameter). Its only purpose is to hold a strong reference that the callee must continue to hold until it is done accessing the return value. For example, the function couldmalloc
an array ofPyObject *
pointers and then set owner to aPyCapsule
, whose destructor willfree
the memory region. Note that it is illegal to write to the returned array of pointers, since this may not update the original sequence.This feature is related to a discussion by Victor Stinner here: Mailman 3 (PEP 620) C API for efficient loop iterating on a sequence of PyObject** or other C types - Python-Dev - python.org
-
In addition to the functions in point 3, I propose adding
PyObject_Vectorcall
andPyObject_VectorcallMethod
to the limited API.Why needed: For binding code that implements callbacks (e.g. a C++ GUI library like Qt that wants to dispatch a button push to a Python handler), it is also useful to be able to issue vector calls. This will significantly cut down tuple/dictionary construction which, again, is costly when operations like
PyTuple_SET_ITEM
cannot be inlined and dictionaries potentially have to be created to pass keywords. This is (IMO) less important than the receiving end mentioned in point 3, but making both of those changes would be nicely symmetric. -
Clarify the status of keyword arguments in PEP 590 vector calls. It would be helpful if the spec would explicitly say that keyword arguments names must always be interned strings.
Why needed: to resolve keyword arguments, the binding library would need to perform many calls to the slow
PyUnicode_Compare
function. Python internally often uses a pattern where it first tries a pointer equality check and then falls back to_PyUnicode_EQ
(not part of the limited API). It would IMO be much easier to simply establish a clear rule that all extension libraries must abide by.
Anyways, this was a long thread — thank you if you read all the way to the end. It would be wonderful if some of these features could be included in Python 3.12 so that it could provide a foundation for extension modules that are both forward-compatible and fast.
Some of these ideas might be controversial, and the purpose of this post was to stimulate some discussion. I don’t think that all of these features are critical, but especially the first 2-3 offer a big bang for the buck.
Thanks,
Wenzel