May types use PyVarObject.ob_size to store arbitrary data?

grothesque · February 22, 2023, 3:21pm

Hello,

I am the author of tinyarray, an extension module that provides small arrays of numbers that behave like built-in tuples of numbers (immutable & hashable), but offer numerical operations à la NumPy. In addition, tinyarrays are significantly faster and more memory efficient than both tuples and NumPy arrays.

Tinyarrays, like tuples, are variable-length objects and contain a PyVarObject structure. This avoids a second memory allocation per object.

In current Python, PyVarObject is just PyObject with one additional field: ob_size. It seems that this field is supposed to hold the number of elements of the variable-size object, and this is how this field is used by tuples for example. However, other types like ‘PyLongObject’ use to to store, among other things, the sign of the number. From the point of view of the interpreter this is just arbitrary data.

Inspired by longobject, tinyarray uses ob_size in similarly creative ways and this has been working great for many years so far. Indeed, as far as I can tell the CPython interpreter does not use ob_size in any way by itself. It seems to be just a field that may be used in any way a type likes. This seems to be confirmed by the (somewhat scarce) documentation.

Now for my question: I noticed a (laudable) ongoing effort to Make structures opaque in the Python C API. As far as ob_size is concerned the aim is for all access to this field to go through the functions/macros Py_SIZE and Py_SET_SIZE. Is my understanding correct that it is still OK to store arbitrary data there as long as it fits into a Py_ssize_t?

In other words, in the process of making objects opaque, why not eventually get rid of ob_size completely and make storing the number of elements of a variable-size type an implementation detail of that type, as it (informally?) seems to be the case anyway?

I would like to change the internal layout of tinyarray’s data and before I do that I would like to verify I understand how ob_size is supposed to be used.

Thanks for any hints.

encukou · February 22, 2023, 4:15pm

We can’t prevent you from putting anything there, but if you’re getting creative, it would be best if you use your own field for this – there’s no advantage to letting others see the value if they can’t interpret it :‍)

AFAIK, Python assumes the field contains the size in some non-critical situations, like the default implementation of __sizeof__. (You do override that, right?)

That documentation is quite misleading, sorry. It’s for type objects, rather than instances. Docs for instances are missing entirely :‍(

I’m trying to become an expert on these matters, but I don’t think I can document ob_size well – though I doubt anyone can. This post is just my opinion, but if anyone thinks it’s incorrect I’d love to hear it. Eventually I will find the time and courage to document ob_size :‍)

guido · February 22, 2023, 4:15pm

This is fine. We’re considering doing just that for a future refactoring of the int type (internally PyLong). The use of ob_size to store the size (what __len__ returns) is just a convention. You may have to override a few behaviors whose default implemenation uses ob_size, you can use the current int type as an example (since it already abuses that field).

grothesque · February 22, 2023, 5:50pm

Thank you both for the quick and helpful replies.

Well, yes, but then the space taken by ob_size will be wasted, at the very least the sign bit. (And it’s not about a single bit: due to alignment storing an additional bit takes up to 8 bytes.) Or are you suggesting implementing something equivalent to PyVarObject without incorporating that structure and not calling PyObject_NewVar in the constructor?

Sure, it has been working like this since 2012 or so. It’s just that now I’m looking for a way to cheaply store another bit somewhere, and seeing the effort for making structures opaque I started asking myself whether I’m not tricking too much.

encukou · February 23, 2023, 10:50am

It sounds like that would be best, but, the API is not quite there to make it easy. Let’s treat it as a CPython API design idea rather than than a recommendation to you :‍)
I mentioned it in a recent related discussion.

Topic		Replies	Views
PyVarObject_HEAD_INIT vs PyObject_HEAD_INIT in "Defining extensions" C API c-api	1	783	October 17, 2022
Add portable way to extract all digits of a `PyLongObject` from C? C API c-api	21	1700	January 26, 2023
Equivalent of _PyObject_GC_Malloc in Python 3.11 C API	10	923	April 19, 2023
PyUnicode_FromKindAndData memory ownership semantics clarification C API	5	649	August 25, 2023
Instantiating a C Extension Subclass in C Core Development documentation , help , c-api	5	758	May 15, 2023

May types use PyVarObject.ob_size to store arbitrary data?

Related Topics