For CPython internal usage, we’ve got _PyModule_GetState, which is a fast variant of PyModule_GetState, the module check in the latter is simply an assert in the former.
For PyType_GetModuleState, there are three ifs (two of them implicitly in PyType_GetModule):
check that the given type is a heap type
check that the given type has an associated module
check that the result of PyType_GetModule is not NULL
For stdlib core extension modules, all of these conditions are always true (AFAIK). We could create a fast static inlined variant, _PyType_GetModuleState with a fast variant of PyType_GetModule inlined, where all three conditions are assert()ed. It would speed up a lot of heap type methods that need to access module state.
Unless there are objections, I’ll create an issue and a PR.
In the end it depends in which context the API is used. If it’s likely to be used in a tight loop, such as PyTuple_GET_ITEM, it makes sense to have an extremely trimmed down macro or inline function. But I’m not sure in which circumstances PyModule_GetState performance would really need to be trimmed down to the last nanosecond.
So, unless PyModule_GetState does something heavyweight such as a dict lookup (does it?), this doesn’t seem a very attractive proposition.
asyncio C implementation was recently converted to heap types and all and this function is called for almost everything. I would expect to have the best performance out of it, the entire point of that module is performance otherwise we have a pure python implementation on hand.
I would say as far as that such boolean checks for API misuse should not have been added in the first place, everyone has to pay for the price for such effective useless checks for API misuse.
FYI, PyModule_GetState already has a fast variant (_PyModule_GetState). My proposal is to add a fast variant for the slightly more complex PyType_GetModuleState API.
Did you try to benchmark the module before/after the conversion to check if something changed? Or are you just making intuitive assumptions about performance?
Last I benchmarked, there was a 3-4% slowdown. Microbenchmarks don’t tell the whole story here because of branch prediction and cache behaviour. Also if there wasn’t slowdown then _PyModule_GetState (which does less work) and exists already would not have been added in the first place. The slowdown might be more apparent in other modules but I don’t have bandwidth to investigate further.
I believe Kumar linked to that issue/discussion not specifically because of _PyModule_GetState, but generally as it is an example of why optimising how to fetch the module state matters.
Regarding benchmarking, I can try to find time for that in a couple of weeks; I’m currently on the road with limited time for CPython stuff.
For the sake of avoiding any confusion, I understand fully well that optimizing the module state matters. I was just expressing skepticism about the specific change that’s proposed here.