Equivalent of _PyObject_GC_Malloc in Python 3.11

jbradaric · April 17, 2023, 12:57pm

Hello,

In the process in updating our simulation software suite (1M+ lines of Python, 100+k lines of C and C++ code in Python extensions) to use Python 3.11, we run into an issue due to _PyObject_GC_Malloc being removed in Python 3.11.

We have a custom metaclass and base class implemented in C++ that add additional functionality to our Python objects. One of the most important functionality this brings is having fixed “slots” for attributes on the object, something similar to what __slots__ does in pure Python. Except our slots also have additional functionality. For example, writing into them emits events on the object that can be observed by other objects to get notified when some attribute is changed. Slots on an instance can also have their values “delegated” to some other object in the data model. Another crucial difference is that slots in our classes can be defined even after the class is created, as long as no instances or classes inheriting from it were ever created.

To ensure the best performance, the data for these “slots” is stored at the end of the Python object, after tp_basicsize bytes. The size of this extra data corresponds to the total number of slots defined on that class and its base classes. Previously we used _PyObject_GC_Malloc to allocate the necessary memory for our objects: tp_basicsize + additional memory for our data. As of Python 3.11 this function was removed and there is no non-hackish way for us to allocate the required memory and still have support for GC.

Unfortunately we cannot use VAR objects because our objects are ordinary Python objects, for example, they can have __dict__ and __weakref__. But VAR objects do not allow this. Calculating tp_basicsize at the moment of class creation also didn’t work because tp_basicsize of base classes already takes into account the slots defined on that base, which means that each subclass uses more memory than the base class, even if no additional slots are defined on the subclass.

Would it be acceptable to add a public function with the same functionality _PyObject_GC_Malloc previously had? Or at least a version of PyObject_GC_New that also accepts additional memory to be allocated on top of tp_basicsize? If so, I can submit a pull request with the new function and all the necessary changes if that would be acceptable.

encukou · April 17, 2023, 1:34pm

Not to Python 3.11, and time for getting features into 3.12 is very tight.

Since _PyObject_GC_Malloc wasn’t documented, could you [explain] what it did?

By VAR you mean PyVarObject, right? Those should support __dict__ and __weakref__ just fine.
Note that nowadays you can use Py_TPFLAGS_MANAGED_DICT and Py_TPFLAGS_MANAGED_WEAKREF to get __dict__ and __weakref__, letting the interpreter to allocate space for them.

Could you take that into account when calculating __basicsize__, so you only add space for new slots?

jbradaric · April 17, 2023, 3:01pm

I assumed as much for 3.11, this is more of a request for an enhancement in the future. If it can be added to 3.12, great, but if not, we’ll just have to live with our hackish solution until then.

Unfortunately we can’t just use the previous implementation in our code, it called the static function gc_alloc in gcmodule.c.

Yes, I mean’t PyVarObject, sorry if it wasn’t clear. They do support __dict__ and __weakref__, but it’s not possible to add other attributes in __slots__ or specify __slots__ = ('__dict__',) to get __dict__ on the class. But more importantly, these are not standard PyVarObjects, every instance has exactly the same size (nitems).
I guess in 3.12 we can use Py_TPFLAGS_MANAGED_DICT and Py_TPFLAGS_MANAGED_WEAKREF and clear __dict__ and __weakref__ from __slots__ in class dict before creating the type. With the limitation that nothing else can be present in __slots__, otherwise we get an “nonempty slots not supported for subtype …” error. But that shouldn’t be very problematic, in that case we could change __slots__ defined there into our own slot definitions.
I see that Py_TPFLAGS_MANAGED_DICT is also available in 3.11, but it’s not documented so I don’t know the status of it, can it safely be used? I need it to make sure that subclasses only grow a __dict__ when explicitly requested in __slots__.

In theory I should be able to do that, but I would have to check what exactly typeobject.c does when calculating basicsize for subtypes. I already tried to do this, but it’s very tricky so I either end up with allocation too much memory for each subclass instance or I get memory corruption issues. I guess I would need to track the original tp_basicsize value before taking slots into account and after that and then calculate the minimal requirements to make sure everything fits before the slots.

jbradaric · April 18, 2023, 7:47am

I tried the approach with using PyVarObject, it almost looked like it would work, but I ran into issues when trying to add __dict__ to an inherited class created from Python code. It cannot be added using __slots__ = ('__dict__',), I need to set tp_dictoffset manually after calling PyType_Type.tp_new. Unfortunately I cannot use Py_TPFLAGS_MANAGED_DICT because at the time of type creation in tp_new I cannot set any flags. And after PyType_Type.tp_new creates the type, the flag can be set, but there is really no way for me to also set the tp_dictoffset to a correct value because I don’t really know where the VM would have placed it when calculating offsets in type creation.

After looking into the implemenation details, it doesn’t really seem like this approach is really supported. There are many places that check whether tp_itemsize is set and either don’t allow something (e.g. even having __slots__ set to anything other than an empty tuple) or behave completely differently in case of PyVarObject types.

Our previous approach with _PyObject_GC_Malloc was much simpler, we didn’t have to mess around with any of these low-level VM implementation details. We simply got our extra piece of memory after the object and everything was cleanly separated. It would be great if something like this could be added to the C API:

PyObject *
PyObject_GC_NewWithData(PyTypeObject *tp, size_t data_size)
{
    size_t presize = _PyType_PreHeaderSize(tp);
    PyObject *op = gc_alloc(_PyObject_SIZE(tp) + data_size, presize);
    if (op == NULL) {
        return NULL;
    }
    _PyObject_Init(op, tp);
    return op;
}

encukou · April 18, 2023, 9:06am

Maybe it can. But the cost is not in the lines of code.

Making this public API would mean we need support objects with data that’s not tracked in tp_basicsize + n * tp_itemsize. I’m not sure about the implications. It would affect future plans, and it would need wider discussion (which you’ve started, but we’d need to discuss this concrete proposal rather than the options for your use case).
It’s very different from an internal function which – as you found out – we can remove when its simplicity depends on something that needs to change.

As a maintainer of the API, I’d like to make it to the point where we can rely on tp_basicsize + n * tp_itemsize. Currently, the n is not known reliably, but there are (vague) plans to fix that. I worry that with PyObject_GC_NewWithData(PyTypeObject *tp, size_t data_size), we’d eventually need to store data_size on the instance, which would be rather unfortunate.

At this point I think I know too little about your specific use case to be helpful. Is the following close to the memory layout you want?

A base class:

                  │
                  ▼
            ┌─────┬────────────────┬─────────────────┬────────────────────┐
            │ GC* │ PyObject_HEAD  │ Base class data │ Base class "slots" │
            └─────┴────────────────┴─────────────────┴────────────────────┘
             * managed internally by the interpreter

and a subclass derived from it, with __dict__:


                  │
                  ▼
┌───────────┬─────┬───────────────┬──────────────────┬───────────────┬────────────────────┬──────────────────┐
│ __dict__* │ GC* │ PyObject_HEAD │  Base class data │ Subclass data │ Base class "slots" │ Subclass "slots" │
└───────────┴─────┴───────────────┴──────────────────┴───────────────┴────────────────────┴──────────────────┘

Are these Python __slots__, or your own ones? (Do you support Python __slots__?)

Well, if the nitems can change after the class is created, then it sounds like PyVarObject is the right tool. Even if you disallow resizes once an instance/subclass is made.

Right, that sounds like a dead end. Setting a flag after a class is created is asking for trouble :‍(

But you don’t need to set tp_dictoffset, which is meaningless when the dict is managed. A dict might not even exist on each instance, as it’s created when needed.

I believe that is something we could fix in 3.13, for “VAR” types where the interpreter knows the memory layout. (I’m adding Py_TPFLAGS_ITEMS_AT_END now, more “known layouts” might come.)

jbradaric · April 18, 2023, 10:12am

I understand that completely, the function proposal was more of an attempt to clarify what kind of a memory layout we need.

The unfortunate thing is that we actually do know the value of n, it’s just that PyVarObject comes with other limitations, but more on that below.

Yes, something like that. With a difference that there is only one piece of memory for slot data storage, subclasses inherit the base class slots and add their own. So we always have something like the picture below. There is never any data between base class “slots” and subclass “slots”.

            ┌─────┬────────────────┬─────────────────┬───────────────────────────────────────────┐
            │ GC* │ PyObject_HEAD  │ Base class data │ (any data added by VM)* | subclass "slots"│
            └─────┴────────────────┴─────────────────┴───────────────────────────────────────────┘

These are Python __slots__. We do support the Python ones, everything in standard Python is supported, we just append our piece of memory at the end. We can live with the limitation of only allowing __dict__ and __weakref__ in __slots__ when defining subclasses. But at the moment that cannot be used in combination with PyVarObject.

That actually sounds like it could work for our use case, at least the part about extending PyHeapTypeObject sounds similar to what we need. If PyVarObject allows __slots__ = ('__dict__', '__weakref__') (or a subset of that, of course) and we can specify that variable data is at the end of the instance, I guess it could work.

How would this work in multiple inheritance cases? For example, in the code below using our base class and slot() object to define “our slots” on the class:

class A(Base):
    x = slot()
    y = slot()

class B(A):
    z = slot()

class C(B):
    w = slot()

In an instance of class C, would we be able to set Py_TPFLAGS_ITEMS_AT_END and always get nitems = num_slots(A) + num_slots(B) + num_slots(C) items at the end of the instance? I assume it’s enough to set the flag on the base class and all subclasses would inherit it? With that and being able to control whether our PyVarObject has __dict__ and __weakref__ or not, I guess everything should be covered.

encukou · April 18, 2023, 12:01pm

Yes, something like that. With a difference that there is only one piece of memory for slot data storage, subclasses inherit the base class slots and add their own. So we always have something like the picture below. There is never any data between base class “slots” and subclass “slots”.

            ┌─────┬────────────────┬─────────────────┬───────────────────────────────────────────┐
            │ GC* │ PyObject_HEAD  │ Base class data │ (any data added by VM)* | subclass "slots"│
            └─────┴────────────────┴─────────────────┴───────────────────────────────────────────┘

Great! That’s compatible with the Py_TPFLAGS_ITEMS_AT_END layout, and API-compatible with Mark’s more comprehensive idea.
Currently, and in current proposals, all data added by the VM goes before the object (like the GC*), but it could be also be added at the end, or, as in this diagram, in the middle. You use PyObject_GetItemData(obj) to get the “slots” memory. (The function is being added now as part of PEP 697; if the initial version is too slow we can speed it up by storing the offset in the class object.)

OK! Could you open an issue and ping me (@encukou) on it? If you know of other PyVarObject limitations, add them too.

Py_TPFLAGS_ITEMS_AT_END and tp_itemsize would be set on Base, and inherited by all the other classes.

And since there’s a plan for a long-term solution:

For Python 3.12, we could add an unstable function (named PyUnstable_Object_GC_New, subject to removal without deprecation warnings in 3.x.0 releases). The PR would need tests and docs, and docs would need to explicitly say the concept of reserving extra data after an instance is unstable, rather than a particular API for it (people will need to switch to PyVarObject when its issues are resolved).
Not sure if it’d be worth your time, since you said you already have a version-specific hack, but it’s a possibility.

jbradaric · April 18, 2023, 1:27pm

That sounds like exactly what we need. As long as we get our chunk of memory we don’t really need to know where VM stores its internal data. I added the middle part just to make it clear that our data currently gets added to the end.

As for the speed, I’ll have to run some benchmarks, but we can always cache it ourselves if it isn’t cached by VM in the class object.

No problem. Should I create a minimal example with some PyVarObject base type or is it enough to show that inheriting from e.g. int doesn’t allow adding __dict__ and/or __weakref__ in __slots__?

That sounds great. It’s worth the time, I would like to get rid of the hack as soon as possible The function would have the same prototype as my proposal at the beginning, with some additional size_t data size? Or did you have something else in mind?

I’ll probably need a few days to get it finished, is there a deadline after which the PR wouldn’t get accepted for 3.12?

encukou · April 18, 2023, 2:24pm

I’m on the fence on this particular memory/speed trade-off, so if you need to cache it it’s probably best to cache it for everyone.

It’s enough to explain in prose, but including something that can be reused as a test would save me work.

I guess something like

PyObject *
PyUnstable_Object_GC_NewWithExtraData(PyTypeObject *tp, size_t extra_size)

Time is pretty tight. The deadline is 2023-05-08, including review. I imagine @markshannon will have things to say.

markshannon · April 18, 2023, 7:57pm

So you want to create a class who’s objects have size core_size + slots * size_of_slots, and all objects of that class have the same number of slots, correct?

Why must you set tp_basicsize to core_size?
Why not set tp_basicsize to core_size + slots * size_of_slots?

You need to know core_size in both cases.

jbradaric · April 19, 2023, 5:42am

Correct.

The problem is that we don’t have a single class, but a base class that allows using multiple inheritance of classes with different slots.

I tried updating tp_basicsize every time a slot is defined on the class, which in effect gives us tp_basicsize = core_size + n_slots * slot_size. But that quickly fails on the following case with a “multiple bases have instance lay-out conflict” error when trying to create the class C.

class A(Base):
   x = slot()
class B(Base):
   a = slot()
   b = slot()
class C(A, B):
   pass

Using Python __slots__, this is of course not allowed. But in our slot implementation, this actually works fine and is well defined. Creating the class C in effect takes the slot definitions from base classes and defines the same ones on C as well. So C ends up with 3 slots, 1 from A and 2 from B.

Having tp_basicsize also account for slot storage could perhaps be made to work by reducing it on base classes before calling PyType_Type.tp_new to get around the base class compatibility checks and restoring its value on the base classes after the type is created. But that seems like a dangerous thing to do, I would basically have to replicate the same algorithm type uses when collecting candidate base classes and only reduce the sizes on the same base classes it takes into account when checking for compatibility. But even then I’m not sure whether simply reducing tp_basicsize by slot storage size would work in all cases, especially with multiple inheritance of different base classes with potentially the same base class somewhere in MRO.