How to get an extension module working w/ free threading?

Indeed, PyList_GET_ITEM and PyList_GET_SIZE are probably not ok in free-threading code, because they directly access the internals of PyList without locking whatsoever.

They are okay if you have a lock yourself. Accessing internals is not the issue because you would have similar problems if using PyList_Size and PyList_GetItem or whatever. The issue is time of check to time of use: you can’t check the length in a separate call if a lock is not in place to prevent the length changing.

There’s quite a bit more than that. (But there’s no documentation right now, sorry.)

I believe that cereggii.AtomicRef can be very useful for C extensions.
Right now, the code only handles correct concurrent reference counting for nogil-3.9, and soon only for Python 3.13 (both free-threading and default).
Whenever the internals of CPython’s reference counting system change, I can change the implementation of AtomicRef without changing its external API (which boils down to simple gets and sets) as well as maintaining backwards compatibility.

If you want to manually implement correct concurrent reference counting outside the domain of the interpreter, I believe you’ll find yourself rewriting AtomicRef.

Fair enough but why would I need that (in my particular case or in general)?

I’m assuming that CPython and Cython are going to take care of making the reference counts correct. The main things I need are:

  1. Lock a mutable object when reading/writing.
  2. Lock two objects in a binary operation.

I imagine that in Cython this ends up looking like:

from cython cimport lock_self, ciritical_section
from libmatrix cimport c_matrix_type, c_func1, ...

cdef class matrix:
    # Internal C level data structure:
    cdef c_matrix_type data

    @lock_self
    def __getitem__(self, ...)
        c_func1(self.data, ...)

    @lock_self
    def __setitem__(self, ...)
        c_func2(self.data, ...)

    def __add__(self, other):
        if type(self) != type(other):
            return NotImplemented
        with critical_section(self, other):
            c_func3(self.data, other.data, ...)

So basically I need each object to have a lock and I either need to lock one or two objects at a time when calling C functions to protect the internal C data structure from mutation during concurrent access. There may be some situations where I need to lock more than two objects though…

For locking a single item I could use stdatom.h or I could use CPython’s critical section macros but I would basically expect that Cython provides a higher-level way to spell it like @lock_self. The downside of critical section is no concurrent reads (imagine if the C functions take a long time). The upsides are that the mutex is already there for free and it already solves deadlocks and I assume that CPython is going to make it work across Python versions etc.

It is not clear to me where AtomicRef would fit into this.

The problem that AtomicRef solves is when you have a shared mutable reference to a PyObject.

When the reference is mutated, say from object A to B, it might be that A’s reference count is decremented (while B is incremented because it now sits in the shared reference).
Normally, if A’s reference count reaches 0, the interpreter immediately frees it.

In a concurrent context you may have one thread swapping A for B, and possibly freeing A, while another thread reads A from the shared ref and attempts to incref it.

This is a possible use-after-free bug within the call to Py_INCREF, and two things might happen:

  1. a segmentation fault
  2. the memory associated with A gets reused and you incref, and possibly return, an object that had nothing to do with your stuff.

AtomicRef solves this use-after-free by hooking into the internal QSBR mechanism.
You don’t know whether this mechanism is going to change in a future version of CPython, but the API of AtomicRef isn’t going to change, giving you backwards and forwards compatibility.

Also note that when an object is freed, its lock is freed as well.

Not really. If the list was passed by third party code, your own lock will not prevent the list from being mutated in another thread.

Your own lock actually won’t serve a purpose if you’re not mutating the list yourself.

The fundamental difference is that PyList_GetItem will raise an exception if the index has become too large for the current list length, while PyList_GET_ITEM will produce undefined behavior (and probably crash). The former is much better than the latter.

That said, yes, better sequence APIs for safe and fast access to lists are warranted, as mentioned earlier.

1 Like

I don’t mean a separate lock but rather the per-object lock that is described in the PEP:

This PEP proposes using per-object locks to provide many of the same protections that the GIL provides. For example, every list, dictionary, and set will have an associated lightweight lock. All operations that modify the object must hold the object’s lock. Most operations that read from the object should acquire the object’s lock as well; the few read operations that can proceed without holding a lock are described below.

The internal PyList code mostly uses Py_BEGIN_CRITICAL_SECTION(self) so I assume that that is what you need to do as well if you want to prevent the list being mutated in between operations.

If you are ok with using internal low-level APIs then probably.

Edit: this was a false statement, see below

I’m confused, aren’t the critical section macros in the public API?

Oops, my bad! I stand corrected.

One thing I do think might be nice to make public is _PyOnceFlag (or some other kind of one-time initialization API based on PyMutex and atomic operations). By far the most common use of C globals I’ve run into in real-world C extensions like NumPy is one-time initialization of runtime caches. Exposing _PyOnceFlag will make it easier to simultaneously support GIL-enabled and free-threaded builds since PyMutex has the nice property that if it blocks and the thread owns the GIL, it will release the GIL.

Right now in NumPy we’re using our own hand-rolled one-time initialization code based on C atomics, but I don’t think that’s scalable because writing code using C atomics is unfortunately not yet as easy as #include 'stdatomic.h', at least on platforms NumPy would like to support.

I’m going to propose this eventually but am waiting for 3.13 to be released before starting any C API design discussions.

2 Likes

This would have to be based on module state, most likely, since that’s the “once” level that Python supports/requires.

Anything more global than that is going to hit issues with subinterpreters and/or reinitialization, so I’d expect it won’t get a “convenient” CPython API. If you do have process-wide state, as opposed to interpreter-wide state, that’ll remain your full responsibility.

But you can assume that your module initialization will only be called once per scope/interpreter and will complete before any of your other functions are called.

1 Like

Fair. It doesn’t have to be in the C API either, it could be a header-only project like pythoncapi-compat that provides a nice cross-platform API that C extensions can use.

1 Like

I’ve suggested that a few times but the prevailing opinion is that CPython should provide all the APIs you need to develop for CPython. So it wouldn’t surprise me if we do end up with generic helpers like that in our headers (though I’ll continue to oppose them unless there’s a clear interaction with the interpreter runtime).

You can perhaps take a look at GitHub - nemequ/portable-snippets: Collection of miscellaneous portable C snippets.

3 Likes

I wonder if it would be feasible to come up with an ensurepip style solution to that (which could then also be used to provide the cross-version compatibility headers)