Ideas for forward-compatible and fast extension libraries in Python 3.12

These can be set using PyMemberDef.

Those are coming in 3.11!

1 Like

It sounds like a copy is unavoidable in some cases. I do like the idea about tuples – that would at least give a copy-free option for one important case.

If the input type is not a tuple, it could be converted into one temporarily using PySequence_Tuple that is returned as the owner of the PyObject** pointer. Performing that copy + mass incref within CPython should be more efficient than getting out PyObject * pointers individually using the limited API.

1 Like

This is still making assumptions about memory layout: consider an implementation with tagged pointers and specialized small-int-only tuples which don’t store a PyObject* array at all.

My question might have been buried:

Yes, that sounds like a good idea. So now perhaps all we need is an API that takes a tuple and returns a pointer to its array of items, as part of the stable ABI. The user can call PySequence_Tuple prior to making that call if desired.

As @encukou’s mentioned, a hypothetical Python implementation might not represent tuples internally as PyObject** (tagged pointers, etc.), and so a hypothetical function

PyObject ** PyTuple_Data(PyObject *)

specific to tuples might not be sufficiently future-proof solution.

Yes, that is not a bad solution, but I think we can do better. If the underlying implementation has tuples stored as contiguous PyObject* pointers, then this copy is superfluous, and it would be nice to avoid it. In a more exotic implementation, the function could still malloc a dedicated output buffer, mass-Py_INCREF pointers, etc.

So my suggestion would be a slightly reduced version of my previous proposal that would be compatible with both of these requirements:


/**
   PySequence_Items()

   Take a sequence 'seq' and return a pointer to an
   `PyObject *` array representing its contents. The returned
   object array is immutable: it may neither be changed by the
   caller, nor will it reflect future changes performed to 'seq'.

   The 'owner_out' parameter is used to return a Python object 
   owning returned memory region. It may or may not equal 'seq',
   and no assumptions should be made about its type. Its only
   purpose is lifetime management -- in particular, the caller
   should `Py_DECREF(..)` this object once it no longer needs
   access to the object array returned by this function.
   The 'size_out' parameter returns the size of the returned
   sequence.
 */


PyObject *const *PySequence_Items(PyObject *seq,
                                  PyObject **owner_out,
                                  Py_ssize_t *size_out) {
    if (PyTuple_Check(seq)) {
        *size_out = PyTuple_GET_SIZE(seq);
        *owner_out = Py_NewRef(seq);
        return ((PyTupleObject *) seq)->ob_item;
    } else {
        PyObject *temp = PySequence_Tuple(seq);
        if (!temp) {
            *owner_out = NULL;
            *size_out = -1;
            return NULL;
        }

        *owner_out = temp;
        *size_out = PyTuple_GET_SIZE(temp);
        return ((PyTupleObject *) temp)->ob_item;
    }
}

A more exotic python implementation simply would not have the special case for tuples. It would always allocate new memory for a pointer array, mass-incref, and then return a Python capsule or similar via owner_out, which performs a mass-decref and releases the memory upon expiration of the capsule.

The caller would use this function as follows

PyObject *seq = ...;

PyObject *owner;
Py_ssize_t size;

PyObject **items = Py_Sequence_Items(seq, &owner, &size);
if (!items) { .. error handling .. }

for (Py_ssize_t i = 0; i < size; ++i) {
    .. something involving items[i] ..
}

Py_DECREF(owner);

What do you think?

One possible downside I see is that copying a Python list to a std::vector<PyObject*> would involve two copies – one for the temporary tuple, and another to the vector. (For those not familiar with std::vector: it’s a resizable array much like a CPython list).
I don’t know if that’s a use case worth optimizing for.
OTOH, I also don’t know if zero-copy for tuples is worth it, since vectorcall might replace most uses of large tuples.

In reply to the numbered points:

1,2 as Petr said.

  1. Just use the vectorcall protocol, already :slight_smile:
    All the macros and function pointer typedefs should be part of the limited API; they are specified in PEP 590 as final.
    Feel free to copy and paste them if necessary, they aren’t going to change.

  2. Not going to happen, sorry. Apart from not being safe, it is likely to be an obstacle to future improvements.

  3. See 3. The protocol works both ways. If we add the functions to the limited API (which I have no problem with), then you’ll only be able to use them in 3.12. The underlying protocol is good for 3.8 onward.

  4. I don’t think that we should force the strings to be interned, as that is a breaking change.
    I’d be happy documenting that performance is likely to be better if they are interned, though.

PEP 590 says they’re not part of the stable ABI (and so, the limited API).
If we add them we need to make sure calls stay consistent when __call__ on a class from Python code. The discussion is in gh-93274.

Could you expand on this? I think that the point that doing an actual API call to get each element one by one is going to cause performance issues may be valid. Is this comment about that idea in general, or about the details?

I would like to better understand your concerns, because HPy is considering providing similar APIs.

Note that the API is not meant to force the implementation to use some specific storage strategy, the array of PyObject* is just a way to exchange the data between the caller and the Python engine.

I think that providing a buffered access would be better to avoid materialization of some optimized compact representation of otherwise large list/tuple – the idea is to provide something like PySequence_Items(PyObject *seq, PyObject **buffer, Py_SSize_t start, Py_SSize_t size) and the user would iterate the sequence in two loops, outer loop fetching a buffer of some smallish size like 256 elements, inner loop would iterate over that. In such way, the inner loop would be free of the API calls to fetch elements, i.e., can be better optimized by the C compiler. This is API that R provides to iterate over its vectors. The issue is, however, that what useful thing can one do with opaque PyObject* other than do some other API calls. @wjakob do you have any use-case in mind? In R the vectors are typed, so you’re getting out, e.g., a buffer of plain C integers.

  1. PyType_FromMetaclass is in 3.12
  2. Extending opaque types is next up on my TODO list
  3. Receiving vectorcall is now in 3.12
  4. sequences as arrays is currently low on my TODO list, but I’m still leaning toward adding API where you pass in a pre-allocated buffer, like
    • PySequence_IntoArray(PyObject* seq, PyObject** array, size)
    • Py_DecrefArray(PyObject** array, size)
  5. Calling vectorcall – I’ll get to it eventually, but if anyone wants to push it further go ahead.
  6. interned strings are not required, but using them is likely to speed things up in- and outside CPython core. Adding docs & helpers would be nice (maybe expose an eq function optimized for interned strings?)

Python sequences are un-specialized PyObject* arrays. I think R’s API would be more suitable for NumPy arrays, or Python array if we wanted it to be more heavy-duty (which we don’t, IMO), or other specialized types.
IOW, specialized list-of-int sounds like a reasonable optimization, but exposing it in API doesn’t sound worthwhile.

If avoiding materialization is a concern, you ideally want an iterator. Or, for N-D arrays, NumPy again.

2 Likes

Objective-C/Cocoa has an NSFastEnumeration protocol that provides a middle ground between the current protocol and your proposal: the caller provides a buffer and an iteration state and calls that API in a loop to avoid having to copy the entire contents into a temporary buffer. Basically PyDict_Next, but with more than one item at a time.

Out of curiosity, what happens there if the sequence changes while you’re iterating?

But the bigger question is why. What’s the use case for external API that avoids full materialization of a generic sequence?
For the Limited API specifically, it doesn’t sound like a good fit.

To be honest I don’t know what happens when the sequence changes during iteration. The Cocoa protocol does not have a way to signal errors, and in Cocoa exceptions are generally not used other than to signal programming errors.

A use case for avoiding full materialisation is that this might use a lot of memory. E.g. when iterating over all elements in a sequence of a couple of million entries. In the grand scheme of things that’s still not a lot of memory, but using a smaller buffer might allow using a stack-based buffer and hence avoid dynamic allocations.

Windows also uses this pattern fairly frequently (example), mostly because many implementations might have long roundtrip times (e.g. there’s a “find files” one somewhere that might involve network calls), so it lets the caller decide how much latency they care about.

A “PyIter_GetNextFew” API would probably fit the bill, and should certainly be useful. If we wanted to specialise by taking a (single) ParseArgs-style format character to extract raw values as well, that could be nicely efficient for a lot of common cases.

1 Like

No repeated dynamic allocations. If size is not known in advance, it’s not a sequence. Maybe iterators need a PyIter_GetNextFew, but that’s quite different from handling lists and tuples.

The usage I have in mind here is:

  • use PySequence_Size()
  • allocate the buffer with that size
  • pass the allocation size to PySequence_IntoArray, as a check against modification or a Python subclasss that overrides __len__ to not match C data. (In both of those cases you get an exception).

Haven’t we defined that as undefined behaviour? So basically “don’t do it” but also “you’ll find out when it happens”.

The “get many” style APIs usually return the number of actual items retrieved, so if the end of the sequence is reached early then you just don’t get a full result (like a stream.read()), and if the sequence changes then you get weirdness dependent on how we’re internally iterating.

But we have to leave it undefined, because there’s no way to batch iterate without changing the behaviour of this situation. If we clarify the exact thing that’s undefined though, users can work with that.

Hm, it’s becoming clear that we’re talking about different things. IMO a PyIter_GetNextFew-style API isn’t a good fit for the Limited API or lists/arguments tuples, but might be helpful for streams/iterators.
Could you start a new topic for it? (It’s possible to ask admins to split this one but that would be confusing in this case.)

I don’t see how it isn’t, but maybe you’re assuming that it would look different from your IntoArray suggestion, when both Ronald and I are saying that the existing patterns for this exist and are basically the same as that proposal.

The only difference is that they contain a bit of state about the start point (which could equally become a parameter, with some loss of generality), and don’t require you to convert the entire sequence in one go.

2 Likes

Why isn’t a PyIter_GetNextFew-style API suited for the limited API? It can have a well defined semantics that’s suitable for other implementations as well.

The only clear advantage I see right now for PySequence_IntoArray is that it can be implemented as an atomic operation for builtin sequence types, while a PyIter_GetNextFew-style API could see updates to the sequence during iteration.

There are other differences as well of course, but for most of them there IMHO is no clear winner either way (“it depends rulez!”).

3 Likes

Out of curiosity, could you elaborate on that? I don’t immediately see why such an API it isn’t a good fit.

1 Like