How to get an extension module working w/ free threading?

dpdani · September 9, 2024, 10:06am

I’m sorry you feel like that, but please don’t be unfair.
When you say that here people think package maintainers will magically make everything thread safe, I believe you are being unfair to those who are actively helping in that regard.

Please, believe me when I say that there are a lot of people here who know the problems faced by many extension maintainers.

oscarbenjamin · September 9, 2024, 10:20am

This implies that we have to decide what the solution is to all of those issues before allowing anyone to actually try free-threading out. Quoting again from the document linked above:

Eventually we will need to add locking around data structures to avoid races caused by issues like this, but in this early stage of porting we are not planning to add locking on every operation exposed to users that mutates data. Locking will likely need to be added in the future, but that should be done carefully and with experience informed by real-world multithreaded scaling.

Note that this says locking is added to “every operation exposed to users that mutates data” but actually that is incomplete. If there exists any exposed method for mutating the data (regardless of whether it is ever used) then you need to lock every operation that reads the data as well. You can maybe use a locking mechanism that allows concurrent reads but you still need to check for the write lock on every operation that does anything.

There is the potential to bring significant slowdowns for people who are already using the library in a thread-safe way or who are not using threads at all. We need to allow users to try out the free threading to evaluate the benefits and costs of different approaches. When I say that I want to collect feedback from users it is not just that I want them to discover the expected segfaults. Rather I want them to be able to see whether free-threading is actually useful which is not possible if the wheels uploaded just re-enable the GIL.

If there is an expectation that packages will be uploaded that re-enable the GIL and that this is an expected part of normal use of Python then the big fat warning about enabling the GIL on import needs to be removed. I consider it fine now because free-threading is “experimental” as is very clearly stated in the release notes:

PEP 703: CPython 3.13 has experimental support for running with the global interpreter lock disabled. See Free-threaded CPython for more details.

oscarbenjamin · September 9, 2024, 10:33am

Also the “solution” here likely involves using things that don’t exist yet like ideally there would be a decorator in Cython that can make this all work automatically:

@cython.locked_class
cdef class Matrix:
    @cython.write_lock
    def __setitem__(...): ...
    @cython.read_lock
    def __getitem__(...): ...

As far as I know such decorator does not exist. Note that currently you need to use Cython’s master branch just to be able to build cp313t extension modules at all.

Possibly also support is needed in CPython itself to make things like this work in a performant way. I don’t know but I expect that the language support is not yet there for what will end up being the best way to implement these things. If it is there than someone should write a guide that explains how to do it along with showing timing measurements, benchmarks, multi-threaded scaling etc.

pitrou · September 9, 2024, 3:20pm

By the way, it’s not only accessing state without synchronization, it’s also using useful but thread-unsafe C APIs such as PySequence_Fast_ITEMS (this one was discussed here).

ncoghlan · September 10, 2024, 3:16am

This is the key: for Python 3.13t, uploading wheels that turn the GIL back on is the right thing to do when a project’s thread safety isn’t guaranteed yet (at least to the level of “will not segfault even if you do unsupported things in parallel threads”).

End users will then, by default, receive a package that still works, but doesn’t give any speedups (yet). They may decide free threading is pointless, and go back to the regular build, but that’s OK (the build is experimental for a reason).

Those users that want to actively help find thread safety issues can forcibly disable the GIL, and see if anything breaks. If users don’t want to set an environment variable or CLI flag to do that, they’re expecting a production ready experience rather than an experimental one and are likely better off avoiding the free threading builds for now.

The GIL has been around for so long that it may be years before ecosystem support for removing it has progressed sufficiently to move beyond stage 1 in the rollout plan described in PEP 703 (Making the Global Interpreter Lock Optional in CPython) acceptance

pitrou · September 10, 2024, 8:11am

Thread safety is something that’s difficult to guarantee in most programming languages used for writing CPython extension modules, even with a lot of testing in place ^[1], unless there was a very good development hygiene from the start. I don’t expect CPython to guarantee anything at this point either.

The best you can usually say is “we’ve tried to be careful, and we don’t know of glaring issues at the moment”.

that most packages will not realistically put in place ↩︎

ncoghlan · September 10, 2024, 8:54am

Yeah, I almost put more qualifiers on that phrasing, since even CPython itself doesn’t meet that standard as written (not due to any known threading problems, but due to known segfault bugs that can be provoked via API misuse even in single threaded programs).

From what I understand of the current state of NumPy (read/write array access from multiple threads is fine, mutating the shape of the arrays is not), that mutation is already not threadsafe, even with the GIL. The same is true for builtin lists: if you change their size from another thread, you won’t segfault CPython, but you can absolute make the reading threads start throwing IndexError as previously valid indices become invalid (or return different objects from the expected ones).

Those kinds of “You may get a segfault instead of a Python exception” situations don’t feel like a compelling reason for an extension module to turn the GIL back on by default (the code was going to fail either way, it just fails a bit harder in free-threaded mode). We’d probably want to see the affected libraries get back to raising Python exceptions as part of the progress towards “stage 2” in the free threading rollout plan, though.

oscarbenjamin · September 10, 2024, 9:49am

I am not sure that this is even the best end state to reach in a GIL-free world. It may be that it is better to have some APIs that are not thread-safe and are simply documented as such, especially if normal usage of the library is thread-safe anyway.

I have looked at downstream usage of python-flint and it would be thread-safe because it either does not use the potentially unsafe features or only uses them in way that would be thread-safe (e.g. only mutating an object at construction before it can be shared).

For now I will document how to use the library in a thread-safe way. It is not clear to me yet whether documenting this and then guaranteeing “thread-safe when used as documented” is a better end state than guaranteeing no “seg-faults ever regardless of invalid usage”. Of course it would be nice to prevent the segfaults but we don’t yet know what the cost of that would be e.g. the performance impact it would have for people who are already using it in a thread-safe way.

I have confirmed now that I can crash the free-threaded interpreter with python-flint but only by doing things that would not be thread-safe anyway. The difference is just that you’re getting crashes inside malloc whereas if we put locks everywhere then you would get random data corruption or non-deterministic output instead. Personally I prefer segfaults over non-deterministic output. If the user really wants to get something sensible here then they are going to need to use their own locks anyway.

What this means is that it is not clear to me what the criteria are supposed to be for using Py_MOD_GIL_NOT_USED. So far the only example I know of for wheels on PyPI is NumPy which has already uploaded cp313t wheels and has set the flag so you can use it without the GIL by default. It is still possible to resize an array:

>>> a = np.array([1, 2])
>>> a.resize((10,))
>>> a
array([1, 2, 0, 0, 0, 0, 0, 0, 0, 0])

I assume that there are no locks to prevent other threads from accessing the data during a resize.

The only other example I know of is Cython which has not published wheels yet but if you install the master branch under cp313t and then use it to build python-flint it prints out 50 annoying warnings about enabling the GIL. This is because of Cython’s own extension modules even though it is not being used in any multithreaded way. We use meson for multiprocess parallelisation which is cleaner in a build tool but it runs the cython CLI 50 times, hence 50 warnings. I don’t know if that is something that will be changed by the time Cython puts out an actual 3.1 release.

Is there any way for Cython itself to disable the warning besides setting Py_MOD_GIL_NOT_USED? Note that the cython CLI can guarantee that multiple threads will never be used so that the warning is pointless either way.

mikeshardmind · September 10, 2024, 10:01am

segfaulting is significantly more serious than raising an exception, this can turn from “application handling web requests sometimes fails with error logging when triggering a specific code path” to “a specific code path can be exploited to crash an interpreter serving multiple users (denial of service at minimum)”.

It’s well outside the expectations of python developers that extension code will segfault on “misuse” when misuse is defined in a way that’s possible just by adding threading and only using public apis from python code.

oscarbenjamin · September 10, 2024, 1:55pm

Regardless of segfaults many operations in python-flint can be extremely expensive so you would have to know what you are doing to use it in a context where DOS from user input was a concern. Think of e.g. the decimal string to integer conversion problems but imagine that this is a library whose purpose is to do that sort of expensive operation as fast as possible. Many operations have a complexity scaling that goes way beyond quadratic.

There are different kinds of users. In python-flint’s case there are broadly two groups:

Those who use it directly.
Those who use it indirectly via other libraries.

Those who use it indirectly via other libraries are not exposed to the thread safety issues. For example SymPy can use python-flint under the hood transparently. Users don’t need to know what is happening except that you can install python-flint to “make SymPy faster”. In this context when you use SymPy’s public API everything is automatically thread-safe without caveats.

The users who use python-flint directly are often the sort of people who would otherwise be writing C code and using the underlying library directly. They like the fact that it provides a fairly thin wrapper over the underlying library so that they can see the relationship between the methods of each type and the corresponding operations in C. They also like the fact that in many situations it provides the same performance that you would get from working in C. I think it is reasonable to let them share some of the responsibility for thread safety in exchange for not having artificial limits on performance.

We need to collect experience of real world usage to see what is possible and what is useful. The locks provided in PEP 703 are mutex locks which don’t allow concurrent reads and I think that might rule out some useful things like sharing large data structures on a read-only basis between multiple threads.

ncoghlan · September 11, 2024, 3:06am

For the experimental stage of the roll-out, that sounds like a reasonable level to justify setting “GIL not used”.

The roll-out feedback request would be “What new tools, if any, do you need to allow the library to raise Python exceptions in those cases instead of segfaulting?” (Think things like the “dict changed size during iteration” exception raised by CPython when a dict iterator’s internal state gets invalidated)

Edit: I’m aware this isn’t consistent with my suggestion earlier in the thread. I’ve been persuaded that this is a better approach for stage 1 of the roll-out when it comes to gathering useful feedback.

oscarbenjamin · September 11, 2024, 10:51am

For libraries like python-flint the first thing needed is Cython language support for locking so that implementing memory-safe behaviour is at least easy to do. Then at least we can try it out and measure the performance impact. At that point it could possibly be made a build-time option whether or not to enable the locking. Without that the only way even to test this out would require rewriting a lot of code which we wouldn’t want to do if there is any expectation that better language support is going to arrive in future.

There is some discussion in a Cython issue where a maintainer says:

I definitely think Cython should expose some higher-level locking stuff nicely. Although some of that needs to become public first.

I’m not sure what “some of that needs to become public” means but I assume that it refers to CPython having internal features that are needed for reasonable memory-safe performance in a free-threading build but that have not been exposed for use by third party extension module authors (yet?).

I don’t know how accurate the PEP still is but it describes needing to modify the memory allocator just to make dict and list access reasonable:

There are two motivations for avoiding lock acquisitions in these functions. The primary reason is that it is necessary for scalable multi-threaded performance even for simple applications. Dictionaries hold top-level functions in modules and methods for classes. These dictionaries are inherently highly shared by many threads in multi-threaded programs. Contention on these locks in multi-threaded programs for loading methods and functions would inhibit efficient scaling in many basic programs.

The secondary motivation for avoiding locking is to reduce overhead and improve single-threaded performance. Although lock acquisition has low overhead compared to most operations, accessing individual elements of lists and dictionaries are fast operations (so the locking overhead is comparatively larger) and frequent (so the overhead has more impact).

It seems likely to me that similar considerations apply in many cases for third party extension types. It is not necessarily unreasonable that CPython does not expose e.g. the _Py_TRY_INCREF macro during a transitional phase where the implementation is still being worked out. Ultimately though if CPython can only achieve memory safety without unacceptable performance loss by using some internal APIs then it is unreasonable to expect third party extension modules to do the same if those capabilities have not been made accessible.

I don’t know if _Py_TRY_INCREF in particular is what is needed but there is a CPython issue where someone is asking for some public version of it.

steve.dower · September 11, 2024, 3:11pm

As a general rule this is not true, though of course we evaluate each proposed new public API on a case-by-case basis.

The point of CPython is to achieve these things without delegating them to users, whether they’re using Python code or the C API. If CPython can’t guarantee safety without making users use the low-level primitives, then we’ve failed. Code that relies on internal-only features is code that ought to be part of CPython itself.

Simply saying “CPython needs it, therefore it should be part of the public programming model” is the worst kind of cop-out, and a failure of API design that we treat as a last resort. It certainly should not be the starting point for any design discussions.

dpdani · September 11, 2024, 3:15pm

Since I opened that issue and that I presented the same matter at this year’s language summit, I think I can add to that.

_PY_TRY_INCREF is likely never going to become public API, along with other similar routines. (Actually, I think that now it is even static inline.)

After opening the issue, I later came to completely agree with Sam.

That kind of functions are tightly integrated with the way CPython handles concurrent reference counting, and I think we can all agree that the core devs should be free to change the implementation (and its interface) without deprecation periods.

At the same time, stable APIs for correctly handling references to PyObjects are needed.
In my library, cereggii there’s an AtomicRef class for this exact purpose.
It is intended to be used both from Python and C, but I have some packaging issues now, and I’m not actually shipping the C headers.

The library has not yet been ported to 3.13, but I may get around to do that relatively soon.

If you’re interested in that, let me know and we can try and see if that can help you.

oscarbenjamin · September 11, 2024, 4:46pm

I’m not saying that these particular APIs should be public (especially because there is a reasonable chance they could evolve by CPython 3.14).

The Cython maintainers seem to think though that they need something more from the public C-API and that also seems to be acknowledged on CPython’s side in the linked issue:

I think there are more important and better candidates for free-threading related public C-APIs, like PyMutex and critical sections that should be addressed first.

In answer to the question above then:

It looks like there are a few steps:

CPython needs to provide some stuff in the C API.
Cython needs to be able to use that C API implicitly and also to expose it for explicit use in Cython code.
Cython should probably add some language support for locking somewhat like its current support for controlling the GIL.

I don’t know that any of these has a simple answer like exactly what API it should be, how Cython should use it, or how Cython’s language support could look. I don’t imagine that this is going to be a quick process. It is possible that most pieces could be in place by CPython 3.14 but it definitely won’t happen for 3.13.

So from my perspective it looks like the thing to do right now is really just to wait for upstream changes and see what approaches other libraries take. In the meantime we will put out extension modules that are “memory-safe if used in a thread-safe way” so that people can try out free-threading.

pitrou · September 11, 2024, 6:59pm

Let’s step back a bit. CPython needs dedicated high-performance internal APIs because it can potentially make a lot of fine-grained concurrent accesses to critical types such as dict or list.

Third-party packages do not necessarily share that constraint. In particular, for many data/science-oriented packages, locking will be more coarse-grained around relatively costly native operations (e.g. lock a NumPy array against concurrent mutation before running a matrix multiplication). These packages should often be content with standard synchronization primitives in their implementation language, such as C++ or Rust.

oscarbenjamin · September 11, 2024, 8:07pm

This is true although you don’t just need the lock for expensive operations like matrix multiplication. You would need to lock in any method or function that accesses the data for reading or writing such as just retrieving a single number from an array like a[0]. You also basically have to choose a locking scheme per type of object so that you use the same kind of locks regardless of where the arrays are big or small.

Yes, maybe. The cereggii library mentioned above uses stdatomic.h. It looks like a very nice library for Python code but in C we could just use the header directly.

You still need to use the same locking mechanism as CPython for any objects that CPython is going to mess with though. I see that mutex and critical sections are now in the public C API but it does not look like the Cython-generated C code I have here uses them at all.

pitrou · September 11, 2024, 8:37pm

I’m assuming it’s ok to call most concrete APIs for e.g. PyList without taking a lock explicitly - except those APIs that return borrowed references. Am I wrong @colesbury ?

oscarbenjamin · September 11, 2024, 9:53pm

It looks like the PyList_* functions will take a lock internally:

github.com

python/cpython/blob/3bd942f106aa36c261a2d90104c027026b2a8fb6/Objects/listobject.c#L1318-L1328


      
          static int
          _list_extend(PyListObject *self, PyObject *iterable)
          {
              // Special case:
              // lists and tuples which can use PySequence_Fast ops
              int res = -1;
              if ((PyObject *)self == iterable) {
                  Py_BEGIN_CRITICAL_SECTION(self);
                  res = list_inplace_repeat_lock_held(self, 2);
                  Py_END_CRITICAL_SECTION();
              }

The problem is that you often need to call more than one function. I’m thinking that this sort of thing might be expected to lock the list (although it is hard to imagine it going wrong in a real situation):

github.com

numpy/numpy/blob/129c6e57052eb243241b5d2d8b7ccb8a8bb3d9ae/numpy/_core/src/multiarray/ctors.c#L2130-L2134


      
          _is_default_descr(PyObject *descr, PyObject *typestr) {
              if (!PyList_Check(descr) || PyList_GET_SIZE(descr) != 1) {
                  return 0;
              }
              PyObject *tuple = PyList_GET_ITEM(descr, 0);

I’m not sure where is the actual code that handles np.array([1,2,3,4]) but I was looking to see if it was immune to a resize of the list. Presumably iterators don’t need any external locking so it depends how you access the list.

colesbury · September 11, 2024, 9:58pm

Yes, that’s right. There’s a more detailed description at: C API Extension Support for Free Threading — Python 3.13.0rc2 documentation