Concerns about PEP 620 ("Hide implementation details from the C API")

wjakob · May 18, 2022, 9:10am

Dear Python team,

I would like to raise a discussion about PEP 620 entitled “Hide implementation details from the C API”. I am consciously not posting this to the PEP discussion sub-forum since that is AFAIK where PEP writers eventually submit their finished work for discussion; the context here is different.

This message is written from the perspective of a (co-) maintainer of pybind11 and nanobind, which are binding libraries bridging C++ and Python. pybind11 is widely used in numerical/ML frameworks including SciPy, Tensorflow, PyTorch, JAX, and others. Google is currently in the process of transitioning to it as default binding tool for C++ projects.

For context: PEP 620 sets out to hide many CPython implementation details that extension libraries rely upon – this includes layout of core data types like PyObject, PyTupleObject, PyTypeObject, etc. The main motivation is the complexity of implementing alternative interpreters like PyPy that need to expose a conforming interface. That motivation makes sense.

However, there is also a flipside. The purpose of this message is to communicate the team’s significant unease about PEP 620. We’re worried about the fallout that this set of changes will have on pybind11, nanobind, and on the larger scientific python ecosystem. We fear that these changes, if realized as proposed, would come at a significant performance and implementation cost.

Just two a few data points:

An opaque PyObject or PyTupleObject would mean a dramatic increase in the number of API calls for very basic steps like reference counting and unboxing tuples for function call dispatch. Every C/C++ ↔ Python call will be affected by this. With the current API, very common constructions like Py_INCREF/PyTuple_GET_ITEM/PyTuple_SET_ITEM can be inlined by the compiler, and it is important that this continues to be possible. There is a more recent question of whether such functions should be implemented as macros or inline functions, and we don’t have strong opinions on that. It’s the prospect of them eventually becoming non-inlineable that seems concerning.
Related: function calls using the classic CPython tp_call API are very tuple/dictionary-heavy, which adds even more overheads to every function call if PEP 620 is realized. There is a new PEP (vector calls) that has the potential to address this, but it appears to be considered an implementation detail (not part of the limited API, no mention in PEP 620)
pybind11/nanobind accesses PyTypeObject internals all over the place. Alternatives construction methods like PyType_FromSpec lack critical functionality. Even if it was possible to adapt to a fully opaque PyTypeObject (I am doubtful), somebody would have to sit down for months to figure out how to rearchitect pybind11/nanobind. And that’s just two libraries within a vast ecosystem of CPython extensions.

The introduction of PyPy states

While PyPy is way more efficient than CPython to run pure Python code, it is as efficient or slower than CPython to run C extensions.

One could argue that PEP620 creates a level playing field between interpreters by removing an advantage that native CPython extensions have previously enjoyed (direct access to data structures). In other words, everything will run slower, but it will be consistent. This seems unfortunate given the huge ecosystem of scientific libraries that have been developed for CPython in the last decades.

Generally PEP 620 appears highly aligned with the “limited API”, and our suggestion and request would be that these drastic changes are made under the umbrella of the limited API without shutting the door to CPython internals.

Thanks!

vstinner · May 18, 2022, 9:50am

Hi, thanks for your feedback. First of all, PEP 620 is a draft. It was not accepted nor rejected yet. For my own usage, it’s more a design document than a concrete change. PEP 620 seems to be too large to be accepted, moreover as you noticed it contains multiple controversial changes.

I started to split this large PEP into smaller PEPs: PEP 670 (convert macros to functions) and PEP 674 (Disallow using macros as l-values).

The performance impact of replacing macros and static inline functions should be measured. I’m not convinced that it’s significant. But. It should be measured The HPy API is fully based on function calls and it has good performance on CPython.

Related: function calls using the classic CPython tp_call API are very tuple/dictionary-heavy

I don’t understand what you are referring to. The C API has many functions to call a function or a method. PEP 620 doesn’t change that.

pybind11/nanobind accesses PyTypeObject internals all over the place. Alternatives construction methods like PyType_FromSpec lack critical functionality.

Would you mind to list which functionalities are missing? Over the last 2 or 3 Python releases, the API was completed multiple times.

Type Objects — Python 3.12.1 documentation lists fields which cannot be set at all using PyType_Spec and PyType_Slot. For example, it’s now possible to set tp_weaklistoffset, tp_dictoffset and tp_vectorcall_offset using PyMemberDef.

Generally PEP 620 appears highly aligned with the “limited API”, and our suggestion and request would be that these drastic changes are made under the umbrella of the limited API without shutting the door to CPython internals.

If the limited API is incomplete, it should be completed.

encukou · May 18, 2022, 10:21am

@vstinner, if you are not going to submit PEP 620 for acceptance soon, could you mark it as Withdrawn or Deferred?

(I’m writing as an individual core developer only, not on behalf of any higher power)

sorcio · May 18, 2022, 10:53am

On the matter of inlineability^[1], can you comment on what kind of stability guarantees would work projects such as pybind11, @wjakob?

The question maybe is whether it would be acceptable, for performance-oriented modules, to renounce the Stable ABI. An API similar in principle to PEP 620 could exist without enforcing one ABI. Each Python implementation could expose the C API not just as headers linking to a binary interface, but as a small^[2] source library that can be compiled into the extension module. This would be similar to the limited C API, but implementing an abstraction that is shared across different interpreters (and ideally throughout multiple releases).

In fact I believe that PEP 620 does not forbid this in its current form, although I don’t see it explicitly endorsed either. And in part this is similar to what HPy explored with the concept of multiple ABIs, with hpy.universal acting as the abstraction layer, but built into the interpreter implementation. Different implementations can have different trade-offs, e.g. CPython could decide to delegate everything to binary linking with the exception of refcounting and anything else that is deemed to be hot.

This would be a regression for the stability of extension modules binaries that opt into this interface because they would be bound to a specific ABI, but again, maybe this is acceptable for performance-critical code? And would this be a valid alternative to exposing interpreters internals directly? Mainly a question for pybind11 to clarify the expectations, rather than a concrete proposal.

assuming that the performance impact of inlining vs calls is measured to be significant ↩︎
“small” is relative, but since this could duplicate interpreter code into each extension module, it’d better be contained in size ↩︎

timfelgentreff · May 18, 2022, 11:06am

Just as an additional note on that - HPy even offers a “CPython ABI” layer in addition to the universal layer. The idea is that with the same source, you’ll compile against this on CPython if you want maximum performance. This compilation target then (as before) can use any number of tricks to be fast on CPython, including not limiting itself to the Stable ABI or even the public API.

From porting the Kiwi solver and matplotlib to HPy, we found that the universal mode (which does prevent inlining and e.g. all reference counting operation are through an indirection of a function pointer of the HPyContext argument) is indeed a few percent slower than the “CPython ABI” mode. The latter, however, is as expected just as fast as before.

So for a project like pybind (or Cython or others), I would think it’s just a matter of whether or not binary compatibility across many versions is worth the tradeoff to lose a bit of performance. If it is not, then these projects should maybe simply accept that they use internal APIs and build against an ABI that may likely break each release?

timfelgentreff · May 18, 2022, 11:10am

One additional thought here - one idea behind hiding more implementation details is to make it easier to evolve CPython itself also, and make it faster. So the hope is also that not everything will run slower, but indeed that CPython also can implement some of the more interesting optimizations that are currently impossible, because they would leak through what is exposed.

wjakob · May 18, 2022, 11:12am

Hi Victor,

thanks for the quick response. A few responses to different parts of your message:

I started to split this large PEP into smaller PEPs: PEP 670 (convert macros to functions) and PEP 674 (Disallow using macros as l-values).

This all sounds good to me. If newly created functions in PEP 670 are provided as part of the CPython header files and can be inlined, there should be zero impact of this on performance. Using macros as lvalues seems like a code smell in any case, I don’t think we are doing that anywhere.

Related: function calls using the classic CPython tp_call API are very tuple/dictionary-heavy

I don’t understand what you are referring to. The C API has many functions to call a function or a method. PEP 620 doesn’t change that.

To clarify: my statement was about the situation of the callee, not the caller. The most performance-critical piece of code in any C++ ↔ Python binding library is the dispatch function that receives a Python call and figures out what to do with it on the C++ side. This entails parsing ordinary, keyword, and default arguments and converting them into C++ counterparts, doing implicit type conversions, translating exceptions, etc., etc. Any simplifications in this central function will have a direct performance impact on the binding tool as a whole.

There are two ways in which to expose such a callable in Python. By implementing a tp_call style interface that takes a tuple and optional dict, or using the new vector call API. The vector call API is really nice, I am a big fan of it. But it is not exposed in the Limited API and requires writing to CPython data structures directly. The nanobind library relies on the availability of the vector call interface and will no longer work if these features are rendered inaccessible to extension modules.

pybind11 has existed for a longer amount of time and uses the older tp_call interface, which is more costly since it involves traversing tuples and dicts. The issue that I see here is if PyTuple_GET_ITEM becomes an exported function or a dummpy wrapper that just forwards to an exported function like PyTuple_Get_Item. In more complex projects, C++ functions will often in turn invoke Python code, which may call into C++ once more. So we are constantly building and traversing tuples – this is a really hot part of the API. Hiding such a core API as an opaque shared library symbol will come with tangible costs. The way I understood it, this isn’t planned in PEP 670 or 674, but is something that at least seemed to be considered in the larger-scoped PEP 620.

Here is another point on something that seems difficult with the limited API, and with default API if deprecations/removals listed in PEP 620 are realized:

pybind11 and nanobind map C++ types to python. For this work, they need to stash some extra type-related information beyond what is available in PyTypeObject. Unfortunately PyTypeObject doesn’t have an extra pointer field that a binding library could use to stash this information. So, what can be done? pybind11 solves this issue by creating a hash-table that maps PyTypeObject * to its internal data structures, which comes with significant overheads (usually a whole bunch of extra hash table lookups for every function call to handle the types of the class and function arguments).

Nanobind does this more efficiently by making a larger allocation and storing the PyTypeObject in the first part and its own data structures in the second part. The resulting memory region is passed to PyType_Ready(). There isn’t a way to do something similar with PyType_FromSpec and related API.

Thanks,
Wenzel

timfelgentreff · May 18, 2022, 11:18am

There is an open issue for accepting a metaclass in PyType_FromSpec (PyType_FromSpec should take metaclass as an argument · Issue #60074 · python/cpython · GitHub) - the patch that was attached to bpo simply uses the tp_alloc from the metaclass to create the memory. Wouldn’t this be a clean solution to that particular problem - Nanobind simply provides a larger structure for the types?

wjakob · May 18, 2022, 11:23am

There is an open issue for accepting a metaclass in PyType_FromSpec (PyType_FromSpec should take metaclass as an argument · Issue #60074 · python/cpython · GitHub)

That’s a relatively old issue (10 years). If eventually realized, I agree that it sounds like promising way to address the particular issue mentioned above.

antocuni · May 18, 2022, 11:47am

I agree with everything @timfelgentreff said about HPy, stable ABI and the related tradeoffs.
Moreover, I also wanted to clarify one more thing:

The non-opacity of PyTypeObject is just a tiny piece of what makes C extensions slow on PyPy. It plays a role, but it’s definitely not the biggest issue, see this blog post for more details.

So, PEP 620 surely does not level the field.

encukou · May 18, 2022, 3:28pm

I recall PySide has a similar issue.
I guess there could be a slot for PyType_FromSpec to request an extra piece of memory. But that would only be useful for the library that creates that particular class, and would not be inherited.
Is that enough for pybind11’s use cases? And perhaps more importantly, would it be useful to other projects as well?
(Both managing several “scratch spaces” per class and making inheritance work look like tough issues to solve, but might make this relevant for other use cases.)

IMO, the ideal would be to try sticking to the stable ABI when possible, and reach for CPython-specific API as an optimization for specific CPython versions. Perhaps even as a compile-time option, so other versions & implementations can still be supported (with worse performance).

wjakob · May 18, 2022, 5:44pm

I recall PySide has a similar issue.
I guess there could be a slot for PyType_FromSpec to request an extra piece of memory. But that would only be useful for the library that creates that particular class, and would not be inherited.
Is that enough for pybind11’s use cases?

It’s relatively common that one would extend bound C++ classes from Python and then pass instances of those derived types again to C++ code. To support this efficiently, nanobind needs to be able to stash data structures following the PyTypeObject even for types inherited within Python. We currently handle this using a metaclass which intercepts tp_init so that it can, again, instantiate a type object that is sufficiently large and copy/initialize binding-specific fields. Getting both of these use cases to work with a modified/extended version of PyType_FromSpec would be very intriguing.

vstinner · May 19, 2022, 7:49am

Because the PyTypeObject structure is exposed as part of the public C API and because C extensions actually use it, it is really complicated to add new members, remove members or change the meaning of a member. When allocating a heap types, Python uses a diffrent PyHeapTypeObject to get more members and it uses a hack to add secret members after the PyHeapTypeObject structure. I really hate all of this. I would prefer having a single structure for all types and put “secret members” there.

Also, as you say, the PyType_FromSpec() caller should have a way to request extra space for custom members without having to use an external storage like a hash table.

IMO the most reliable migration path to solve this issue is to move away from static types setting directly PyTypeObject members, move to the PyType_FromSpec() API (and variants), and get members with the opaque PyType_GetSlot() function call. Member some more specialized getter functions should be added to complete PyType_GetSlot(). For example, it would be convenient to have something like super() in C to call a parent method like tp_new and tp_init.

vstinner · May 19, 2022, 7:52am

First, I wrote METH_FASTCALL to optimize function calls in Python: avoid the cost of creating a temporary tuple to pass positional arguments (and creating a temporary dict to pass keyword arguments). Then PEP 590 added a clean public API on top of it, and it also supports calling methods. If you want this API to be part of the limited C API, you can start by opening an issue to request it.

vstinner · May 19, 2022, 7:57am

PEP 674 leaves PyTuple_GET_ITEM() unchanged since PyObject **items = &PyTuple_GET_ITEM() remains a common code pattern to quickly access tuple items. IMO we should provide an API to expose a “view” of tuple items as an array of PyObject**. My latest attempt to design an API for that went nowhere.

vstinner · May 19, 2022, 8:14am

IMO we should not underestimate the advantages of being able to provide a single binary compatible with multiple Python versions. Today, a new Python version is released each year. Debian, Ubuntu LTS and RHEL releases are supported for multiple years, up to 10 years for Ubuntu LTS and even longer for RHEL. Users want to get a new Python, but not only the pythonX.Y program: a full working ecosystem around it: pip, numpy, PyTorch, etc. Currently, it’s just too expensive for a Linux vendor to maintain one binary package of numpy per Python version, and so basically all Linux distributions only target one main Python version and then stick to it. Obviously, as soon as pip and a C compiler is available, pip install can pull dependencies, but that’s outside the Linux DEB/RPM packages and not supported by the distribution.

So yeah, maybe there is a little performance overhead, but that’s not significant compared to the advantages.

Shipping wheel packages on PyPI is also complicated, you must build one binary per OS, per architecture and per Python version. Currently, numpy provides 6 packages per Python version:

Windows/x86-64 (64 bit)
Windows/x86 (32 bit)
Linux/x86-64
Linux/AArch64
macOS/x86-64
macOS/AArch64

numpy provides 20 files:

source distribution
6 binary packages x 3 Python versions (18)
PyPy: Linux/x86-64 (only)

Shipping one package per Python version doesn’t prevevent to build one package per OS and per architecture, but it makes it available to old (is possible) and new Python versions. For example, if you use the stable ABI, you get Python 3.11 support without any effort.

The trade-off between performance and the stable ABI remains an open discussion We can maybe fix a few performance issues.

wjakob · May 19, 2022, 8:21am

First, I wrote METH_FASTCALL to optimize function calls in Python: avoid the cost of creating a temporary tuple to pass positional arguments (and creating a temporary dict to pass keyword arguments). Then PEP 590 added a clean public API on top of it, and it also supports calling methods. If you want this API to be part of the limited C API, you can start by opening an issue to request it.

As far as I know METH_FASTCALL is specific to Python’s method object. In my case (nanobind library), the binding library implements its own method object that implements the vector call protocol. This avoids an extra roundtrip through cfunction_vectorcall_FASTCALL_KEYWORDS and has the benefit that we can stash all metadata that is needed to dispatch the function on the C++ end. For example, a function can have multiple overloads accepting different types. To avoid needless pointer chasing, this information is all co-located with the callable Python object.

wjakob · May 19, 2022, 9:01am

Dear all,

thank you for this very enlightening discussion.

I agree with Victor that restricting to the limited API/ABI has tremendous benefits from a distribution point of view. I remain concerned about the performance impact but am happy to be proven wrong – in any case, this should be benchmarked, and the slowdown may not matter for many users given the benefits.

I was just going through all of nanobind to see what we would really need to be able to pull this off, and the following things would be needed that are currently not possible.

Ability to allocate types via PyType_FromModuleAndSpec() while specifying a custom metaclass. This would be needed to allocate co-located space for the binding tool and intercepting the construction of derived classes within Python.
Ability to specify tp_vectorcall_offset via PyType_FromModuleAndSpec when constructing the nanobind method object. (EDIT: I realized that this is actually possible via Py_tp_members)
Ability to perform vector calls from the extension module: PyObject_Vectorcall, PyObject_VectorcallMethod, PY_VECTORCALL_ARGUMENTS_OFFSET, PyVectorcall_NARGS
Ability to query type fields like tp_basicsize, tp_itemsize, tp_name, tp_doc, tp_setattro, tp_descr_set, tp_alloc, tp_new, tp_free. I understand that these should be accessible through PyType_GetSlot but then noticed quite a bit of discussion about tp_name not being compatible with that interface and requiring special workarounds. So it’s not clear to me if all are available.
In a few rare places, we need the ability to construct method objects and query then via PyMethod_New, PyMethod_Check, PyMethod_Function.
A public alternative to _PyType_Lookup.

Do you think that it would be possible? Objections? What would the process be to introduce such changes into a future version of Python?

Thanks,
Wenzel

markshannon · May 19, 2022, 10:28am

PEP 590 (vectorcall) was written specifically for tools like pybind11. Please use it.
PEP 590 says that the API was provisional for 3.9; the API is fixed now.

If you are interested in performance, fast access to internal function pointers like tp_setattro, tp_descr_set, tp_alloc is likely to be mistake.
As we improve performance, we are likely to avoid these function pointers in favour of using flags.
Ultimately, I expect the best route to high performance is to let the VM construct types and provide it with the custom C++ methods that you need, and leave allocation, de-allocation, etc. to the VM.

If you want to create a class using a custom metaclass, why not call the metaclass?

PyObject *args[3] = { name, bases, locals };
PyObject *my_class = PyObject_Vectorcall(metaclass, args, 3, NULL)

wjakob · May 19, 2022, 10:44am

PEP 590 (vectorcall) was written specifically for tools like pybind11. Please use it.

It is already used in nanobind and may end up being added to pybind11 as well. However, in contrast to other call-related functions, it is not part of the limited API. I found it curious that the documentation even includes explicit remarks of the form " This function is not part of the limited API" which I have not seen anywhere else. This makes me wonder if there is it is explicitly considered out of scope for the limited API, which would then be very unfortunate if we are using it for the most important construction (dispatching function calls between languages)

If you want to create a class using a custom metaclass, why not call the metaclass?

That’s an interesting idea, but is it low-level enough to accomplish all goals? For example, how would I set tp_basicsize of the newly created type object? What if instances of the type hold references requiring tp_traverse? It’s often necessary to override this type of low-level functionality for bindings that has no counterpart in the interpreter.