C API: How much private is the private _Py_IDENTIFIER() API?

Hi,

I removed 181 privates functions in Python 3.13 C API. A code search on PyPI top 5,000 projects shows that the private _Py_Identifier API is widely used. Examples of lines using removed functions:

  • _PyObject_CallMethodIdObjArgs (35)
  • _PyUnicode_FromId (34)
  • _PyObject_CallMethodId (14)
  • _PyUnicode_EqualToASCIIId (2)

I didn’t look for _Py_IDENTIFIER() usage. Python 3.12 made strings “immortal” and in CPython code base, _Py_IDENTIFIER() was replaced with statically initialized objects: GH-90699. What is the status of the _Py_Identifier API in Python 3.13?

Should we help C extensions to move away this API? Or do we want to make some of these APIs public?

Currently, PyPy doesn’t support the private _Py_Identifier C API.

Python 3.12 provides the following provide API:

  • _Py_static_string()
  • _Py_static_string_init()
  • _Py_IDENTIFIER()
  • _PyDict_ContainsId()
  • _PyDict_DelItemId()
  • _PyDict_SetItemId()
  • _PyEval_GetBuiltinId()
  • _PyImport_GetModuleId()
  • _PyObject_CallMethodId()
  • _PyObject_GetAttrId()
  • _PyObject_LookupAttrId()
  • _PyObject_LookupSpecialId()
  • _PyObject_SetAttrId()
  • _PyType_LookupId()
  • _PyUnicode_EqualToASCIIId()
  • _PyUnicode_FromId()

Victor

This seems a question for @eric.snow (who is still on vacation for another week IIRC).

We should definitely work toward getting rid of _Py_IDENTIFIER, etc. The fact that the community uses it, which I had not anticipated, is the only reason I didn’t remove it at the time.

That said, clearly there is a demand for an efficient API along those lines. If there isn’t already an issue open to pursue that then we should open one. Especially, we still need to reach out to users of the private API to find the best way forward. (My availability is still limited so I won’t be able to follow up right away.)

1 Like

Ok, but I did remove them in Python 3.13 C API since they were marked as private, and I removed as many private functions as possible.

So what should we do for Python 3.13? Force people to use slower “allocate string, use the string, delete the string” pattern? Suggest a replacement: which one? Expose the API to make it public (and so endorse and support it)? Revert my change and go back to the status quo (not public, you should not use it, but oops, people actually use it): I really want to avoid this situation.

Yeah, we definitely kicked this can down the road when we dropped internal usage of _Py_IDENTIFIER() last year. At the time, I mentioned how we could open a new issue to deal with public usage of the API but I don’t think that ever happened.

When we started working on dropping _Py_IDENTIFIER(), you pointed out the projects you found that are using the API, and I took a closer look. My assessment at the time was: “All of them should be trivial to drop _Py_IDENTIFIER() without any real performance impact or mess.” Hence, we might just need to work with the handful of projects to get them off that API.

That said, a public API may still be worth adding. For example, it sounds like numpy would use it. Either way, it would probably be worth opening a new issue.

FWIW, the python-dev thread from Feb 2022 has a lot of great additional ideas and commentary.

This works fine for performance insensitive cases. That might cover all the public usage, but we’d need to make sure. However, there might be broader needs that would be make it worth doing more.

We’d need to figure that out, based on feedback from the existing _Py_IDENTIFIER() users and other projects like numpy. Again, there’s some good discussion about it in the python-dev thread, including your suggestion to perhaps make argument clinic public.

I’m pretty sure the existing API isn’t quite what we’d want users using long-term.

Yeah, I’d like to avoid this too. We may want to revert until we have a better plan though.

Update in this area: Argument Clinic now supports the limited C API, see:

The C code generated by Argument Clinic is currently less efficient than the code targeting the internal C API (faster APIs), but there is room for improvement (since we know how to make it faster with the internal C API).

I wrote gh-106320: Remove private _Py_Identifier API by vstinner · Pull Request #108593 · python/cpython · GitHub to remove the private _Py_identifier from the public C API (move it to the internal C API). This change is part of my larger plan on removing private API: C API: My plan to clarify private vs public functions in Python 3.13.

If the number of affected projects is low, the solution is to guide them towards static/global variables to cache strings (call PyUnicode_FromUnicode() only once), and then use this string. Or just use the bytes string functions, variants with the String suffix like PyDict_GetItemStringRef().

If there are many affected projects, another option is to expose the bare minimum to the public C API:

  • Py_IDENTIFIER(name) macro
  • PyUnicode_FromId(&PyId_name) function
  • Py_Identifier structure with its members (needed by Py_IDENTIFIER() macro)
  • and maybe also the Py_static_string(name, "...") macro for strings which are not valid C identifiers.

I dislike exposing the Py_Identifier structure :frowning: Its members already changed when I added support for sub-interpreters, so it’s not implementation-agnostic :frowning:

As I wrote before, if possible, I would prefer to not add this API to the public C API.

Maybe a brand new API should be design? Or PyUnicode_FromString() should be optimized? Would it be possible to design a LRU cache on PyUnicode_FromString() which would be more efficient… than not using a cache? Computing a cache key requires to hash the byte string which is not free in terms of performance.

That would be incompatible with multiple interpreters. Could you guide them to use module state instead of C statics/globals?

Maybe we do need an API to atomically create/get an immortal string that’s interned across all interpreters.

1 Like