Stable ABI for PyCriticalSection

With PEP 793 (PyModExport), it’s possible to build Stable ABI extensions compatible with both free-threaded and “GIL-ful” builds.

We’re still not there:

  • We need a way to select name/describe the ABI and the associated API limitations. Two draft PEPs do that: PEP 803 and PEP 809. Without those, you need to patch Python.h (or use internal API that won’t stay until the 3.15.0 final release), and without tool support you need to build/install manually. Even given all that, there’s enough demand for a free-threaded stable ABI that people are experimenting with the opaque PyObject ABI.

  • To take advantage of free-threading – as opposed to extensions just being loadable by free-threaded CPython – we need to expose some locking primitives. Both 803 and 809 say:

    Limited API to allow thread-safety without a GIL – presumably PyMutex, PyCriticalSection, and similar – will be added via the C API working group, or in a follow-up PEP.

So, how should a stable ABI for PyCriticalSection look like, if/when we have a stable ABI for free-threaded Python?

I’m starting with PyCriticalSection for a few reasons:

  • Mutexes are provided by the PyThread_*_lock API, which is already part of the Stable ABI. (They’re documented as obsolete – possibly for reasons that actually make them good Stable ABI, but that’s getting off-topic here.)
  • Rust and C++ have mutexes in their standard library. PyO3 and Cython both expose “safe” wrappers for platform mutexes that can’t deadlock with the interpreter. It’s really only C on platforms that don’t have threads.h where PyMutex is really convenient, otherwise people have developed workarounds.
  • It’s been identified as a pain point by early adopter testing in PyO3. PyCriticalSection allows extension authors to control the per-object PyMutex locks in the free-threaded build and fundamentally cannot be emulated without some sort of hook into the interpreter runtime to achieve the same deadlock-protection characteristics and locking semantics.
  • The most natural way to use some C APIs is by holding a critical section. For example, dict iteration with PyDict_Next.

The current PyCriticalSection API uses a stack-allocated structure. Allocating it dynamically on each use makes it slower (7.283 vs. 3.93 ns in microbenchmarks), which isn’t all that terrible, but still best to keep heap allocation only as a contingency option (for the possible future when the mechanism changes, but the ABI needs to stay).

Stack allocation by the caller means that the size of the struct needs to be part of the ABI; it can’t change in future versions.

There’s another constraint that I’d like to keep: the size of the struct should be the same in Stable ABI as in the “full”, version-specific ABI. That is, Stable ABI should continue being a subset of any “full” ABI. This doesn’t matter in practice if the struct is only ever stack-allocated, but starts mattering as soon as someone puts it in their object. That limitation is hard to explain, and essentailly impossible to enforce.


The current API is:

  • structs with private fields: PyCriticalSection (2 pointers), PyCriticalSection2 (3 pointers)
  • functions: PyCriticalSection_Begin, PyCriticalSection_End, PyCriticalSection2_Begin, PyCriticalSection2_End
  • C convenience macros, each of which contains unpaired { or }: Py_BEGIN_CRITICAL_SECTION, Py_BEGIN_CRITICAL_SECTION2, Py_END_CRITICAL_SECTION, Py_END_CRITICAL_SECTION2

The macros are the preferred API. Non-C wrappers like PyO3 need to reimplement the macros, ideally as their language’s flavour of “context manager”.

I think we can put this in the Stable ABI as-is (and to non-threaded builds as no-ops).
If we later need to make an incompatible change, we:

  • add new structs and functions with new names (possibly just a _v2 at the end);
  • make the convenience macros call the new functions;
  • keep the old functions working, even if they need a malloc/free to get at a larger size.

We will not add build-time aliases like #define PyCriticalSection_Begin PyCriticalSection_Begin_v2. Since the old PyCriticalSection_Begin function needs to stay (for old Stable ABI extensions), the C PyCriticalSection_Begin would refer to a different function than ctypes.pythonapi.PyCriticalSection_Begin. This is confusing and error-prone.


That leaves one thing: fallibility. I don’t think we can guarantee that PyCriticalSection_Begin can never fail (which includes never emitting a warning, since -Werror is a thing).

What comes to mind is changing Py_BEGIN_CRITICAL_SECTION(op) to:

{
    PyCriticalSection _py_cs;
    int PyCriticalSection_result = PyCriticalSection_Begin(&_py_cs, (PyObject*)(op));

to be used as:

Py_BEGIN_CRITICAL_SECTION(o);
    if (PyCriticalSection_result < 0) {
        return -1;
    }
    ...
Py_END_CRITICAL_SECTION();

In the non-limited API, we can add if (PyCriticalSection_result < 0) Py_UNREACHABLE(); – tell compilers to elide any checks when not targetting the Stable ABI.

How does that sound?


Thanks to @ngoldbaum for help drafting this!

I’m not sure it’s a good idea to spell this out as a hard requirement, although I guess it’s fine to mention it as a possibility. In the theoretical case where we add functionality to critical sections, there’s a substantial chance we would do so without changing the API signatures, so it would be pretty desireable to not have the API changes, and I think in that theoretical case we shouldn’t rule out the build-time aliases. It’s a decision we should make at that point, basically comparing the convenience of no source changes with the inconvenience of a small discrepancy between the compile time define and the runtime symbol – which, I will point out, we already have in quite a few cases.

However, changes to the critical section struct are so hypothetical, I’m not particularly worried. We just need to make sure we have safeguards set up in the form of ABI checks, so we don’t accidentally/unintentionally change the struct.

Why can’t we guarantee it can never fail? It can’t fail right now.

Overall looks good to me, except the fallibility point:

That leaves one thing: fallibility. I don’t think we can guarantee that PyCriticalSection_Begin can never fail

I feel like I’m making the same arguments over and over again against this style of API design. APIs that can fail are not a good fit for mutexes (or critical sections). We’ve seen this in practice with Py_thread_type_lock and pthread_mutex_t:

  • Lots of callers don’t handle error cases either because it’s inconvenient or impractical in the context of lock acquisition. We want to be able to add mutexes (or critical sections) to existing code without transforming functions or code paths that can’t fail into ones that can.
  • This was true even for CPython internals. A quick grep through 3.12 shows that most calls to PyThread_acquire_lock don’t handle the error return.
  • The consequence of this is that you can’t actually add new error returns in the future without breaking lots of code. It doesn’t buy us much flexibility because breaking lots of callers is unacceptable, even if the documentation says the API may return an error code.
3 Likes

Whit’s neat about PyCriticalSection is that the primary API is macros, which are buildtime-only already. There’s much less need of aliases for the underlying functions.

But yeah, it’s a decision for later. We may decide aliases are worth it; the important thing is that the decider knows the downsides.

Same here from the other side :‍)

I’m arguing for the stable ABI only. Note that Stable ABI extensions don’t (necessarily) get rebuilt, so they don’t get build-time deprecation warnings required by PEP 387. To be able to remove (a version of) a function, we need the possibility of runtime deprecations, and that needs fallibility.

You do that by calling Py_FatalError.
But in quite a lot of cases – including Clinic’s @critical_section – you can and should let the exception bubble out.

Yep. But that’s the opposite of stable ABI: it’s the one codebase where a PR that changes PyThread_acquire_lock’s internals can also update all callers :‍)

Yes, that’s a problem. Adding to the Stable ABI is the point where we can do something about it.
Let’s add __attribute__((warn_unused_result)) (GCC) or [[nodiscard]] (C23), in stable ABI only.

Agreed, and I would insist that whatever design it takes has the option to switch to an API-compatible implementation that uses dynamic allocation (which implies some kind of initialization function today that may be a no-op, but could do an allocation in the future and store the pointer inside the struct). That protects us from having to invalidate old modules if the size increases.

Also agree.

I think the most key point here is that “fallible” doesn’t necessarily mean recoverable. There may not be anything a caller can do in case of failure other than call os._exit() or equivalent, but the important things is that the caller decides what they’ll call. If we don’t even have a way to indicate failure, then we have to decide for them, and there’s nothing worse than trying to embed Python and having to somehow intercept TerminateProcess()/kill() or equivalent. By making it fallible, we make failure the responsibility of the implementer, so they can log/signal/message/etc. whatever they need.

1 Like

I’m not convinced complicating the API – and all users in perpetuity – is an acceptable price to pay for the option of a more gentle deprecation of something this fundamental and stable. I would much rather say “this API can’t be deprecated, so it can only be removed when the Stable ABI breaks backward compatibility with the last version that exposed it at build time”. That’s definitely not the case for most APIs! But for this one, yeah.

The problem with this is that we can’t know if the caller is doing anything sensible in the error return case, and problematic cases will not show up until we do start deprecating through loud warnings. Only then will we know if the mere act of raising the warning is disruptive (and potentially destructively disruptive) to user code. Given how people tend to use C APIs and warnings from compilers, I think there’s a significant risk of that happening.

If we don’t have a way of deprecating it through warnings, we will never be in that position. It will make it much harder to do a deprecation if we need it, but it will be a much cleaner break when we do.

3 Likes

This is only useful if the callers actually handle the error returns. They frequently won’t!

Let’s say you add PyErr_WarnEx to the critical section implementation in the future. If many (or most) callers do not handle it properly, then you end up in a really bad situation:

  • Code proceeds as if holds the lock, but it doesn’t. Race conditions! Crashes!
  • The exception gets mishandled or swallowed by future code.

This is much worse than something like a PyErr_WarnEx+PyErr_WriteUnraisable.

Clinic is one of the few places where it’d be easy to bubble exceptions out, but it’s internal-only so it’s not particularly relevant to the stable ABI.

Most C API extensions aren’t using code generation.

No, you’ve misunderstood my point: we were not relying on known semantics of CPython’s Py_thread_acquire_lock – that function passes up error returns from the platform mutex. We were ignoring the error return because it’s often difficult or annoying to handle.

CPython core development tends to be much better at handling error return paths than extensions. My point is that many – probably most – callers won’t handle the error returns properly. If you don’t want to look at CPython, look at the use of Py_thread_acquire_lockin extensions [1] [2] [3] [4] [5].

If we want to add a failure modes in the future (which I don’t think we will), then should add a new function at that time. We shouldn’t ever change an in-practice not failable API to have new failure modes because that will break callers.

1 Like

If this is the concern, then let’s add both:

void PyCriticalSection_Begin(...)
int PyCriticalSection_BeginEx(...)

So if we decide to add error return codes then at least we know that the callers are probably actually handling them.

1 Like

Thanks for raising this Petr!

From PyO3’s perspective I’d personally prefer to simply expose the API that’s already available in 3.13 and 3.14. There hasn’t been a need to change the critical section ABI for 2.5 years – 5 years if you count Sam’s Python 3.9 fork. That said, PyO3 can handle all this inside its implementation of the critical section wrappers.

From a NumPy maintainer perspective, where I’m working with the macros, I’d prefer not to need to add limited-api-only checks inside every critical section for an error case that can only happen in a theoretical future where CPython wants to generate a warning.

Also I want to share that I have a branch of PyO3 that works with an experimental branch Petr shared with me yesterday that passes all PyO3 Cargo tests on the abi3t abi as more-or-less proposed here. I’m pretty confident that PyO3 will be able to support abi3t without too much more effort, assuming this comes in.

Maturin and other packaging tools will need to wait on the resolution of PEP 803 and 809, to teach them about the new abi tag. I’ll probably work on branches to enable end-to-end tests.

I still think we can combine them.
For the non-Stable ABI, as a CPython implementation detail, promise that the function will never fail. And use Py_UNREACHABLE to allow the compiler to optimize away any error handling that the user might add.
And for the Stable ABI, use warn_unused_result/nodiscard to make compilers warn if the result is unused.
Then, only Stable ABI users pay the price for possible deprecation or switch to malloc.

Are you counting Cython, PyO3, nanobind, etc? IMO, those would work great with a @critical_section flag for method definition.
(So would PyMethodDef, by the way.)

1 Like

This is not an excuse for deliberately making our own API unmaintainable. We can easily define it as “if you don’t handle the error return, you get undefined behaviour”, or even “you get per-version behaviour that may change between releases”. This doesn’t have to be super complicated.

Obviously if we have helper macros/inline functions then those need to handle it, but can simply do Py_FatalError as that’s the best response we have at that level. Internally we don’t have to check since we can update our own calls when a failure mode exists. But refusing to let users do their own handling is just… well… why? Why would you actively try to block them being able to handle cases? Or why would you so confidently assume that it will never change when literally everything changes? Do you really want your name to be attached to the reason we can’t change behaviours in the future? All it needs is a return value and documentation that if it ever fails, you should bail out asap and assume that all your state is broken.

2 Likes

Anyway, yeah, this would work too. (As PyCriticalSection_BeginWithError rather than Ex.)

1 Like

You can also add enough padding in the struct to make room for realistic size increases. If the struct is typically stack-allocated it doesn’t matter if it has 4 words instead of 2.

Unfortunately I definitely think we need need critical sections in the Stable ABI if it’s to be usable in freethreading.

Also unfortunately they do feel like a lot like an implementation detail that in an ideal world shouldn’t be part of the Stable ABI. (i.e. there’s a lot of weird implicit rules about when they can be broken that kind of make sense bu don’t feel completely universal).

From that point of view the failure paths kind of make sense to me just because it isn’t completely obvious that they’ll want to survive unchanged forever. As a code-generator Cython’s in a relatively good place to handle them (via a mixture of exceptions or aborting, depending on the context). I’d rather not deal with failure, but given we have an attached thread-state any time we have a critical section then it’s fine.

I’d argue that PyMutex[1] is probably a simpler and more fundamental building block - we pretty much know that we can implement a mutex without breaking the ABI - so maybe we can make those unfailable. The only real complication is their interaction with critical sections (given that critical sections are maybe more of a moving target).


  1. which I appreciate this topic isn’t about… ↩︎

So, we’re looking at stable ABI additions:

struct PyCriticalSection;
struct PyCriticalSection2;
int PyCriticalSection_BeginWithError(PyCriticalSection *sc, PyObject *obj);
void PyCriticalSection_End(PyCriticalSection *sc);
int PyCriticalSection2_BeginWithError(PyCriticalSection *sc, PyObject *o1, PyObject *o2);
void PyCriticalSection2_End(PyCriticalSection *sc);

Does that sound like a reasonable compomise?

The functions/macros should be enough for generators/language wrappers to get started.

I’m not sure about macro versions – this seems like the way to go:


I don’t think we want to increase it for all users (not just stable ABI).

Is that a problem really? Unless people are nesting critical sections (which is certainly not recommended), adding a total N <= 2 words to the CPython stack footprint sounds like a non-problem.

Okay, I know I pushed fairly hard for the error result, but I think I’ve persuaded myself out of it. Here’s my logic:

  • anyone who needs a fatal error other than Py_FatalError is going to have to rebuild CPython anyway. This is totally reasonable - plenty of embedders already do it.
  • Someone who’s patching in an alternate “tear it all down” behaviour can’t do it easily if every 3rd party extension is doing their own handling. Most of these extensions won’t care - they assume they’re only ever going to be used from “python.exe”[1].
  • If the macro contains the error handling, you can at least modify the macro and rebuild the downstream dependencies.
  • However, if the error handling is always calling Py_FatalError, and we make sure that it always calls back into libpython, then you can just change the “tear it all down” behaviour by patching one function. And my assumption is that anyone who cares about this is probably already patching that function, or at least is patching enough in the core runtime that patching this function is little/no/less work.

So I’ve come around to the idea that, since failure to acquire a critical section is unavoidably a “unknown corrupted state” error and is unrecoverable up to the top of the Python stack (which in an embedded context doesn’t mean the entire process has to terminate[2]), we should probably just call a non-inlined Py_FatalError from inside of the runtime, and document that that’s what we call.

Perhaps one day, if people start having to deal with this more frequently, we’ll find a need to add some way to hook Py_FatalError for “recovery” purposes. No need to rush that in, and I don’t think it’s a stable ABI concern anyway[3], but everyone will be happier in the future if we’re already sending all these signals through a single path.


Also, I want to apologise to Sam for my earlier reply, which was definitely too sarcastic and disrespectful. I do still think the arguments I was making were stronger than the others up to that point (and I think I’m the first to bring up the arguments in this post), but the way I phrased it wasn’t so good.


  1. For want of a better way of naming “running Python’s main.c directly”, as opposed to being embedded into another process, or even hosted but still run in its own process (e.g. like Davinci Resolve does). ↩︎

  2. Or at least might want to first trigger auto-save or crash reporting, etc. ↩︎

  3. The embedder would hook it, and an embedder who doesn’t know at compile time what Python runtime they’ll be using is in way more trouble than a hook is going to help. So I assume embedders use the full API, while extensions use the limited API. ↩︎

3 Likes

But, is it really unavoidable, in all possible futures/implementations?
I don’t understand why this is special. How is this so different from a failed malloc – which we take care to convert to MemoryError whenever possible?

Going back to the original reasoning:

This is fair – sometimes Py_FatalError is the way to go, or at least good enough to go to main today.
But in lots of cases we’re ignoring the error not because it’s difficult or annoying, but because we’re just lazy. or because the code we’re inspired by didn’t do it – which is a problem, but a linter or compiler warning can help here.
A lot of cases can have normal error handling – and I’d expect that stable ABI uses will be skewed in this direction, compared to CPython internals.


For PyObject, it’s been decades – but still, the time came.

It’s not a problem. It’s also not really necessary, since we can switch to dynamic allocation (and add new ABI, so if the performance hit is too much, users can migrate/recompile).

I don’t think that we should introduce an error case in the critical section API, it’s already very complicated to get it right without error handling, and as Thomas wrote, the API cannot fail currently.

I don’t think that PyThread_acquire_lock() is a good example, since this API can fail and so it has different constraints. PyThread_acquire_lock() can fail because it uses pthread and Windows API in Python 3.14 and these APIs can fail. But PyMutex API is different: PyMutex_Lock() cannot fail. PyThread_acquire_lock() was modified in Python 3.15 to use PyMutex internally.

The critical section API uses PyObject.ob_mutex which is already allocated and so (allocation) cannot fail.

Would you mind to elaborate? Why would it start emitting a warning or failing?

How do you get the PyThreadState? Is it implemented in PyCriticalSection_BeginWithError() and PyCriticalSection_End() by calling _PyThreadState_GET()? I’m worried that a developer can mess up the thread state between “begin” and “end” and so PyCriticalSection_End() will use the wrong thread state.

The current macros only get the thread state once and so ensure that we use a consistent thread state in “begin” and “end” whatever happens in the middle:

# define Py_BEGIN_CRITICAL_SECTION(op)                                  \
    {                                                                   \
        PyCriticalSection _py_cs;                                       \
        PyThreadState *_cs_tstate = _PyThreadState_GET();               \
        _PyCriticalSection_Begin(_cs_tstate, &_py_cs, _PyObject_CAST(op))

# define Py_END_CRITICAL_SECTION()                                      \
        _PyCriticalSection_End(_cs_tstate, &_py_cs);                    \
    }

If possible, I would prefer the macro approach to get the thread state only once: so macros with implicit thread state, but (private) functions with explicit thread state.