Deprecating the direct use of str internals, e.g. PyASCIIObject, PyCompactUnicodeObject, PyUnicodeObject structs

encukou · February 28, 2025, 5:31pm

Hello,
In the C API, I’d like to deprecate direct use of str internals – specifically:

PyASCIIObject, PyCompactUnicodeObject, PyUnicodeObject structs
the PyUnicode_IS_COMPACT macro

Allowing direct access to the structs is starting to block development of new features/optimizations:

The free-threaded build needs to change the memory layout to allow atomic access to the interned field (which can be done in an API-compatible way), and it needs to ensure that the access is atomic (which a struct field can’t do)
There’s been some talk of string objects where UTF-8 would be the primary representation (and PEP 393 KIND/DATA filled on demand); such changes can’t really be done with the current structs.

Does that sound reasonable?
Is there any use case that would need a new function?

kumaraditya303 · February 28, 2025, 6:16pm

I think you meant bitfield, as struct fields can be accessed atomically and those accesses can be wrapped in function calls.

ngoldbaum · February 28, 2025, 6:22pm

NumPy uses PyUnicodeObject to define the scalar type for the np.str_ dtype, in C:

>>> isinstance(np.str_(1), str)
True

See numpy/numpy/_core/include/numpy/arrayscalars.h at da268d45aab791023c8d953db6f4597019f770cb · numpy/numpy · GitHub

It’s already excluded from the public C API for limited API builds, but NumPy uses PyUnicodeScalarObject internally. I don’t know offhand if we can avoid using PyUnicodeObject here, suggestions are very welcome.

It looks like we still have an old vendored version of PyUnicode_FromUCS4 in a file named ucsnarrow.c (that tells you how old this is) that explicitly returns PyUncidoeObject, but I think we can replace that with PyUnicode_FromKindAndData these days.

ngoldbaum · February 28, 2025, 6:46pm

See: MAINT: remove legacy ucsnarrow module by ngoldbaum · Pull Request #28404 · numpy/numpy · GitHub

ronaldoussoren · February 28, 2025, 7:12pm

PyObjC uses PyUnicodeObject to create a subclass in C with an additional C field in the subclass. That’s not really possible without using the PyUnicodeObject and setting fields in it to initialize a value.

BTW. I don’t mind adjusting to changes to the internals of strings.

da-woods · March 1, 2025, 8:47am

Cython uses a few of these, but has fallback #ifdefs that avoid them in all cases (so it wouldn’t break anything in a way that’s hard to fix):

PyCompactUnicodeObject for fast access to a length field (but only in versions <3.12 anyway)
PyASCIIObject to look up the ->hash attribute either for interned names (where we know it’ll be set) or as an optimization.

The latter is something that might be nice to keep in some form - just fast access to the cached hash without forcing it to be calculated if not yet cached.

encukou · March 3, 2025, 2:58pm

The NumPy and ObjC cases are the same: adding an extra field in a subclass.
Looks like the first breaking change we should do is to add a supported way of doing that.

I don’t mind adjusting to changes to the internals of strings.

Looks like I should add aliases like PyUnstable_ASCIIObject, so you can avoid build warnings. That would be a formal promise that we won’t change the API in a bugfix/security release, but we can change it in 3.x.0 and y’all should we prepared to adapt.

(My current draft implementation uses fully-private aliases like _PyASCIIObject, to allow the macros to reach into the structs; I’ll just rename those and document them.)

fast access to a length field

PyUnicode_GET_LENGTH() should have you covered. (It’ll keep working. There’s no promise that it’ll remain the fastest way to get the length, but perf issues should be easier to adapt to.)

PyASCIIObject to look up the ->hash attribute either for interned names (where we know it’ll be set) or as an optimization.

So, do you want one or both of these?

PyUnicode_GET_HASH with an inlined fast path, only doing a function call when it sees -1?
PyUnstable_Unicode_GET_CACHED_HASH? (The unstable behaviour being not the function itself, but relying on which operations set the hash.)

da-woods · March 5, 2025, 8:32am

This one matches what we currently do.

malemburg · March 20, 2025, 6:14pm

How would people create str subtypes in C without access to PyUnicodeObject ?

encukou · March 21, 2025, 8:36am

(I assume you’re asking about suptypes with extra C data; creating a subclass with just __dict__ is straightforward.)

They can’t.

Today, they need PyUnicodeObject and they also need to rewrite most of PyUnicode_New.
Users of PyUnicodeObject also don’t get any warning when CPython adds a new complication (like for example: in the free-threading build, access to PyASCIIObject->interned needs to be atomic).

That’s why I’m proposing PyUnstable_UnicodeObject for this use case. I hope to get people that currently need PyUnicodeObject to remove it if they can, and share more of their use cases if they can’t.

malemburg · March 21, 2025, 9:04am

The current approach requires access to the type object you want to subclass.

And this is a documented feature of CPython: 2. Defining Extension Types: Tutorial — Python 3.13.2 documentation

We can’t just go about removing this feature for arbitrary types without first offering a stable working new method of doing the same.

What do you propose as an alternative method ?

encukou · March 21, 2025, 9:35am

Subclassing as documented in the tutorial currently does not work with strings, since PyUnicode_New assumes that there is no extra state, and will allocate string data at the same location as your extension data. (It will also pick one of the three instance structs for str.)
To make this work, you need to write a modified version of str allocation, making sure you don’t miss anything (including new fields/invariants that CPython can add in new versions).

But, if you want a no-underscored-API version, keep your PyUnicode_New replacement, and add something like this (untested):

typedef struct {
    int state;
    bool flag;
} MyExtraData;


static PyTypeObject SubUnicodeType = {
    ...
    .tp_basicsize = sizeof(PyUnicodeType->tp_basicsize) + sizeof(MyExtraData),
    ...
};

static MyExtraData *get_extra_data(PyObject *mystring) {
    MyExtraData *data = (MyExtraData *)(((char*)mystring) + PyUnicodeType->tp_basicsize);
}

I do recommend to use PyUnstable_UnicodeObject instead.

ronaldoussoren · March 21, 2025, 9:41am

I’ve shared my use case, and cannot move away from accessing PyUnicodeObject unless there’d be another way to create a subclass with additional C fields, or there’d be a different way to have string-like types (in a way that’s supported by both Python code and native extensions). The latter would be preferable for me due to a slight mismatch between the semantics of Objective-C and Python strings.

For completeness sake: There is a way for me to drop using PyUnicodeObject, but that would break code for users of PyObjC.

malemburg · March 21, 2025, 10:25am

Yes, I know. But that’s not the point.

The point is that if we want go ahead hiding type structs from C extension writers, we need to provide an alternative way of adding more data to such objects at a C level, both for object types which do extend the size of the object to store variable sized data and for the more common ones which don’t.

If I understand correctly, you want to put the new data between the end of the static entries in PyUnicodeObject and the variable sized part, right ?

I don’t think that’ll work, since the standard Unicode APIs will happily overwrite your added data, since they believe the variable sized part starts right where you just put your new data.

A better way is to not touch the initialization logic and add your data at the end, after PyUnicode_New() has done its work and the object has been finalized by adding data to it. This will require replacing the object type with the subtype (aka subclass) and then possibly reallocating the object to make room for the extra data after the variable sized part (this can be avoided by asking for some extra room in the object when allocating it).

But regardless, we need a generic non-hacky solution for these things.

encukou · March 21, 2025, 11:31am

I agree! We do! I’d love to do this for strings.
I believe there is a generic way to do this for types whose instance structs are not public. If we can’t hide the structs, we can add a solution specific for strings. I started on that; hopefully it’ll work.

An extra constraint is that the alternative way can’t break the existing hacky ways. That requires knowing what the existing ways are. This thread has been very helpful for that.
So far, I haven’t seen a use case that doesn’t rely on other undocumented details.

And after reallocation, don’t forget to update all the internal pointers. But, we don’t need to redesign that: PyObjC/NumPy have it covered (but they don’t show all subtleties they needed to avoid).
AFAICS, we either need to keep their proven approach working (by not changing PyUnicodeObject), or design a generic non-hacky solution (which, I currently believe, would be far easier if we can change the layout).

BTW, it looks like: the free-threading build will change the semantics of PyASCIIObject.interned. As far as I understand, before free-threading is made default, at least that field should be hidden behind the accessor function – or it should be documented as one of the things to check before setting Py_MOD_GIL_NOT_USED. (That’s largely unrelated to my goal here, I’m pointing it out to anyone interested in removing roadblocks for free-threading.)

malemburg · March 21, 2025, 1:48pm

Why make a special case for strings rather than come up with a clean design for all Python objects ?

If this needs changes to the basic PyObject structs, then now is a good time for this, since AFAIK the free threading code does need such changes as well (but could be wrong - I only remember reading such comments occasionally).

There’s no need to rush any of this, so let’s take time.

I think you should call this implementation strategy rather than use case. It is pretty clear that there are lots of use cases for subclassing str, e.g. in order to hold references to objects in other applications (think interfaces to other programming languages, libraries or systems), to extend strings for particular purposes (think path and URL objects), to reference alternative content (think translated strings), etc. etc.

encukou · March 21, 2025, 5:52pm

Here’s a Python 3.12+ extension module mymod that defines a type MyStr with settable attributes number (int) and flag (bool), and a get_data method, without using PyUnicodeObject et. al.:

#define Py_LIMITED_API 0x030c0000 /* 3.12 */
#include <Python.h>
#include <stddef.h>

typedef struct {
    int number;
    char flag;
} MyStr_Data;

static PyTypeObject *MyStr_Type = NULL;

static MyStr_Data *
mystr_get_data(PyObject *self) {
    return PyObject_GetTypeData(self, MyStr_Type);
}

static PyObject *
mystr_get_data_meth(PyObject *self, PyObject *dummy)
{
    MyStr_Data *data = mystr_get_data(self);
    if (!data) {
        return NULL;
    }
    return Py_BuildValue("nl", data->number, data->flag);
}

static PyType_Spec mystr_spec = {
    .name = "mymod.MyStr",
    .basicsize = (int)-sizeof(MyStr_Data),
    .slots = (PyType_Slot[]) {
        {Py_tp_members, (PyMemberDef[]) {
            {"number", Py_T_INT, offsetof(MyStr_Data, number),
                Py_RELATIVE_OFFSET},
            {"flag", Py_T_BOOL, offsetof(MyStr_Data, flag),
                Py_RELATIVE_OFFSET},
            {0}  /* sentinel */
        }},
        {Py_tp_methods, (PyMethodDef[]){
            {"get_data", mystr_get_data_meth, METH_NOARGS,
              "get the extra data as a tuple"},
            {0}  /* sentinel */
        }},
        {0}  /* sentinel */
    }
};

int
mymod_exec(PyObject *mod)
{
    if (MyStr_Type) {
        PyErr_SetString(PyExc_ImportError,
                        "cannot load module more than once per process");
        return -1;
    }

    MyStr_Type = (PyTypeObject *)PyType_FromSpecWithBases(
        &mystr_spec, (PyObject *)&PyUnicode_Type);
    if (!MyStr_Type) {
        return -1;
    }
    if (PyModule_AddType(mod, MyStr_Type) < 0) {
        Py_DECREF(MyStr_Type);
        return -1;
    }
    Py_DECREF(MyStr_Type);

    return 0;
}

static PyModuleDef mymod_def = {
    .m_name = "mymod",
    .m_slots = (PyModuleDef_Slot[]) {
        {Py_mod_exec, mymod_exec},
        {0}  /* sentinel */
    },
};

PyObject *
PyInit_mymod(void) {
    return PyModuleDef_Init(&mymod_def);
}

This of course ignores a lot of complications, but, most are orthogonal.

AFAICS, PyObjC would want a bit of new API: PyUnicode_NewSubtype, a function like PyUnicode_New that additionally takes a “type” argument, and returns an uninitialized string that’s “fillable” with the same caveats as PyUnicode_New.

I have a draft (!) branch adding that, with a test type that exercises some of the complications.

I didn’t get PyObjC tests to run yet, I didn’t try porting NumPy, I didn’t audit the edge cases, but so far this seems viable.

(Subtype support in PyUnicode_Writer would also be nice, later.)

Do you have an example of one that doesn’t?

malemburg · March 21, 2025, 7:11pm

I guess I wasn’t clear enough. I meant standard PyObject object types, which are not variable sized (i.e. not PyVarObjects); not cases where you want to subclass a type in C, but don’t add extra data fields, which seems to be what you’re asking for.

That said, I can imagine such a use case as well, e.g. if you just want to add additional methods to an existing type without changing the original type.

malemburg · March 21, 2025, 7:45pm

Looks like you’re using PEP 697 – Limited C API for Extending Opaque Types | peps.python.org

To make this more generic, we’d need a new type slot for this purpose, e.g. tp_newsubtype.

encukou · March 24, 2025, 9:36am

I still don’t think I know what you’re asking for, unfortunately.

As the title suggests, that’s API for the case where the struct is opaque (or not available at all). And it turns out that it works for str, better than I thought! :‍)
The issue for PyObjC’s use case is that you can’t construct such strings from C arrays (UCS2 in this case): the existing API (PyUnicode_New or the new PyUnicodeWriter) give you exact str.

What would be the signature of a tp_newsubtype function?
Obviously it can’t be the str-specific PyObject *PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar).
It could require boxing, and take Python args tuple & **kwargs dict – that’s tp_new, we have it already.
It could take void* and require you to pass in a correct structure depending on the type – but that seems rather fragile.