PEP 445 (3.4, 2013) added APIs for customizing the runtime’s allocators, defining 3 domains: raw, mem, and object. The “raw” domain allocator must be thread-safe and may be called without the GIL held. However, the “object” allocator does require the GIL to be held, since objects are involved and especially because the default implementation (AKA obmalloc, pymalloc, small block allocator) is not thread-safe.  The PEP specified the same restriction for the “mem” allocator specifically because it defaults to the “object” allocator.
Here’s the problem: 3.12 has a per-interpreter GIL. This means that, as soon as an isolated subinterpreter is created, simply holding the GIL no longer protects custom allocators. (The default “object” allocator implementation internally respects the current interpreter, so holding the GIL preserves safety.) That said, I expect that most custom allocators out in the community aren’t affected.  (It isn’t clear how widely the API is used. I’d expect there are only a handful of custom allocators out there.)
require that custom allocators respect the current interpreter
update/replace the allocator API to take a PyInterpreterState * argument
deprecate/remove the allocators API (or restrict it to wrappers)
The simplest solution is (1), but it isn’t clear how many projects would have to make changes. If we want to play it safe then (2) or (3) may be the right solution for the short-term. I think (4) is inferior to (1) and (5). (5) might be a good idea, but only as much as we’d move the rest of the C-API to consistently take a runtime context argument. (6) might be a good idea regardless.
 A hypothetical future mimalloc-based obmalloc would probably be thread-safe.
 The GIL restrictions are only relevant when a custom allocator has its own state. Thus an allocator that only wraps the default one (e.g. the “debug” allocators) is itself thread-safe and not affected by per-interpreter GIL.
When I implemented PyConfig (PEP 587), I tried to make sure that PyMem and PyObject memory allocators are not called before Py_Initialize(): only PyMem_Raw should be safe to use before Py_Initialize(). So in short, the idea is only to call it once Python “has an interpreter”. Maybe it’s not 100% correct today, but if it’s the case, it shouldn’t be too hard to fix. PEP 587 also adds “pre-configuration” which ensures that Python is “pre-configured” as early as possible and that memory allocators are only set once, and not changed later (I’m not talking about “hooks” on memory allocators).
If we can move pymalloc state into PyInterpreterState (or indirectly make it “per interpreter”), it would be safe to be used under new constraints, no?
About the GIL, tracemalloc makes the assumption that PyMem and PyObject are called with the GIL held, whereas it acquires the GIL for PyMem_Raw hooks. Also, I modified PyMem to now require the GIL: that we mostly to use pymalloc for PyMem, since it made Python faster: Change PyMem_Malloc to use pymalloc allocator · Issue #70437 · python/cpython · GitHub I added debug checks in Python debug build to validate that these API are used correctly (that the GIL is held).
Drama: Then nogil (PEP 703) enters the room with the “mimalloc-or-nothing” elephan
Not quite. Currently the allocators are set globally and are expected to be shared by all interpreters. The object and mem allocators haven’t been required to be thread-safe, since they are only used with the GIL held. With a per-interpreter GIL, there is no single GIL to protect them. Thus a custom allocator must either be thread-safe (or target the current interpreter, like pymalloc) internally, or we have to introduce a global lock to wrap around the custom allocator (like in my PR).
I was actually looking just yesterday whether to expose the possibility for Rust (PyO3) extension modules to set the #[global_allocator] to be the Python allocator.
(I was curious if sharing arenas might lead to lower memory footprint or better cache locality. I hold no evidence to back up this was even a remotely good idea.)
My conclusion was it’s not worth it, because of the same thread-safety issue discussed here. My understanding is the Rust global allocator needs to be thread safe and also can’t guarantee it’ll be called on a native thread which is holding the GIL (or even has a Python thread state). So only the “raw” domain is suitable, which (if I understand correctly) is just a thin wrapper around malloc. There seemed no benefit from directing calls at this layer through the Python API.
From the options (1-6) listed in the OP, I think only (6) goes far enough to make it viable to direct Rust extension allocations through the Python allocators. Though as I mention above, I have no evidence to suggest doing so is a good idea, so please don’t expend effort trying to make this possible unless you think it is a good idea .
Although I am not sure this is what we should do I think given that multiple interpreters are a new feature that already requires some changes in C extensions to support, is still on the table indicating that if you have custom allocators that want to be compatible with multiple interpreters they need to be thread safe. This is not trivial because we will be reducing the amount of complexity and not affecting performance. As @vstinner mentions if PEP 703 is accepted then the whole discussion is not needed because this suppresses custom non-wrapping allocators (although the locking happens UNDER the wrapper so the wrapper still needs to be thread safe if it keeps state).
Also, notice that also custom allocators for memory profilers tend to be thread safe because they also override pure malloc, so at least that area of things would be more or less unnafected.
PyMem_RawMalloc() can be called without holding the GIL, whereas PyMem_Malloc() requires the caller to hold the GIL.
PyMem_RawMalloc() implementation should already be thread safe in Python 3.11. Otherwise, you will likely get into troubles. In practice, it’s just a thin wrapper to malloc() of the standard library. PyMem_RawMalloc() exists to be able to track memory allocated by Python in tracemalloc (get the Python traceback where the memory has been allocated).