I removed 181 privates functions in Python 3.13 C API. A code search on PyPI top 5,000 projects shows that the private_Py_Identifier API is widely used. Examples of lines using removed functions:
I didn’t look for _Py_IDENTIFIER() usage. Python 3.12 made strings “immortal” and in CPython code base, _Py_IDENTIFIER() was replaced with statically initialized objects: GH-90699. What is the status of the _Py_Identifier API in Python 3.13?
Should we help C extensions to move away this API? Or do we want to make some of these APIs public?
Currently, PyPy doesn’t support the private _Py_Identifier C API.
We should definitely work toward getting rid of _Py_IDENTIFIER, etc. The fact that the community uses it, which I had not anticipated, is the only reason I didn’t remove it at the time.
That said, clearly there is a demand for an efficient API along those lines. If there isn’t already an issue open to pursue that then we should open one. Especially, we still need to reach out to users of the private API to find the best way forward. (My availability is still limited so I won’t be able to follow up right away.)
Ok, but I did remove them in Python 3.13 C API since they were marked as private, and I removed as many private functions as possible.
So what should we do for Python 3.13? Force people to use slower “allocate string, use the string, delete the string” pattern? Suggest a replacement: which one? Expose the API to make it public (and so endorse and support it)? Revert my change and go back to the status quo (not public, you should not use it, but oops, people actually use it): I really want to avoid this situation.
Yeah, we definitely kicked this can down the road when we dropped internal usage of _Py_IDENTIFIER() last year. At the time, I mentioned how we could open a new issue to deal with public usage of the API but I don’t think that ever happened.
When we started working on dropping _Py_IDENTIFIER(), you pointed out the projects you found that are using the API, and I took a closer look. My assessment at the time was: “All of them should be trivial to drop _Py_IDENTIFIER() without any real performance impact or mess.” Hence, we might just need to work with the handful of projects to get them off that API.
That said, a public API may still be worth adding. For example, it sounds like numpy would use it. Either way, it would probably be worth opening a new issue.
The C code generated by Argument Clinic is currently less efficient than the code targeting the internal C API (faster APIs), but there is room for improvement (since we know how to make it faster with the internal C API).
If the number of affected projects is low, the solution is to guide them towards static/global variables to cache strings (call PyUnicode_FromUnicode() only once), and then use this string. Or just use the bytes string functions, variants with the String suffix like PyDict_GetItemStringRef().
If there are many affected projects, another option is to expose the bare minimum to the public C API:
Py_Identifier structure with its members (needed by Py_IDENTIFIER() macro)
and maybe also the Py_static_string(name, "...") macro for strings which are not valid C identifiers.
I dislike exposing the Py_Identifier structure Its members already changed when I added support for sub-interpreters, so it’s not implementation-agnostic
As I wrote before, if possible, I would prefer to not add this API to the public C API.
Maybe a brand new API should be design? Or PyUnicode_FromString() should be optimized? Would it be possible to design a LRU cache on PyUnicode_FromString() which would be more efficient… than not using a cache? Computing a cache key requires to hash the byte string which is not free in terms of performance.