Some notable missing pieces ahead of freethreading

Distributing wheels means that you cannot produce binaries that make assumptions about the hardware in the way that you could for people building locally (with e.g. -march=native). The flipside though is that you can make very strong assumptions about the compilers that generate the binaries because you are the one choosing and running them.

2 Likes

In light of reading through some of the linked details and some of what that linked to as well, I didn’t need to read it all to reach a conclusion. I think you are asking too much here, though through no fault on wanting to improve the situation. It’s just incompatible ideas to publish optimized shared libraries that have to care about architecture levels or instruction sets with the current wheel spec, but wrapping this somehow in the interpreter doesn’t seem better than just taking fine grained locks.

2 Likes

That doesn’t sound correct. It took me quite a while to understand this thread, and I think you’re mixing potential performance and correctness issues. By default, wheels built with cibuildwheel and manylinux are safe and portable - it is the job of those projects to guarantee that, not doing so is a bug. So building numpy wheels and putting those up on PyPI will be fine, unless you explicitly take action to not make that the case, like adding export CXXFLAGS=-march=native.

The consequence of that is that performance may be worse than you’d like, sometimes by a lot. It’s the price to pay for portability. I.e. by default you get case (2) from your list.

The rest of what you describe is essentially the same problem covered in Distributing a package containing SIMD code - pypackaging-native . There are two potential solutions to that:

  • Your case (4): building runtime dispatch so you can choose CPU-specific code paths. This is quite complex; numpy has that internally but it’s not hooked up to everything, we’d only do that if the gains are worth the engineering effort.
  • (in development) wheel variant support so projects can start shipping CPU micro-architecture specific wheels. See Wheel Variant Support - WheelNext. This isn’t available today but is in heavy development and a PEP should materialize sometime in the next 3 months.

The manylinux_2_34 issue you link to is more specific: it’s manylinux not being able to provide the portability guarantee for that specific choice of image (base OS + compilers), as a consequence of the distro it’s built on moving to x86-64-v2 as a baseline. As a result, manylinux_2_34 is blocked for release with the most likely solution being the wheel variant design I linked to above. This issue isn’t free-threading specific and will be solved one way or another. For the currently active manylinux versions (we’re in the middle of moving from manylinux2014 to manylinux_2_28) there is no problem.

It’s unclear to me if CPython itself does runtime dispatch for the functionality you’re talking about here. If so, it can at least be used as a guide for how much performance is on the table if one would build a standalone library today that does do runtime dispatch.

7 Likes

Hi, cereggii maintainer here :waving_hand:

Can you please clarify what you mean by this?

If you mean that cereggii needs a runtime check to make sure that the hardware does provide cmpxchg then I can agree with you in principle, but in practice it’s really a negligible check.

IIRC, that instruction has been introduced with Intel Pentiums in 1993. I was implementing the runtime check in cereggii a couple of years ago, and decided it was not worth the time to write the check after realizing how widely available cmpxchg is.

Honestly, I’m not even sure what the CPython policy has about this is. Is i386 still supported? (Genuine question.)

Can you please open a bug report with a reproducer? :folded_hands:
The last time it ran, it did complete successfully.

I don’t really see the relevance of this. No sane compiler would do that. See Intel’s manual, Vol. 2A, p. 3-214 (page 804 in the combined volumes pdf). Without the prefix, it’s just not atomic, which is why I think that the compiler itself would be bugged if it did that.
(Do note that the assembly lock prefix has no implications for the lock-freedom of cmpxchg, it’s just an unfortunate historic labelling.)

Also, double-word compare-exchange has more limitations even in architectures that support it. And many architectures don’t support it, which is why AtomicDict no longer relies on it. (For instance, aarch64 does not support it, except on Apple Silicon.)

I think in general the level of detail of this discussion is really for compiler issues.

My exact thoughts :ok_hand:


I do agree on the general topic title: there are some notable missing pieces ahead of free-threading.

I think cereggii is a project that could fill that gap and provide the real-world use information we collectively need before these features can be integrated into the standard library.

If you’re interested in working on this topic, cereggii is looking for contributors!
And even if you didn’t want to contribute to cereggii, I’d be very glad to chat about this stuff!

2 Likes

This has to do with locking at the hardware level for mutual exclusion of access to cache and memory; It’s not an unfortunate label. This is also not safe to assume. Compiler developers often receive information from processor companies and their employees directly that may not match 1:1 what is in the manuals, yet they do rely on it. gcc received information from AMD relevant to this discussion years ago.

all 128b instructions, even the *MOVDQU instructions, are atomic if they end up being naturally aligned.

I do think it is unlikely a compiler would ever emit this, as there are better options than compare exchange in a loop with access to AVX instructions, but I also agree with @mikeshardmind that it is not currently safe to assume that the instructions emitted for one platform are safe on another compatible platform with wheels.

It would be interesting if these projects explained why they “can’t or don’t want to use C++” (other than “I dislike C++, Rust and every other language that may better serve me than plain C”).

2 Likes

The python-flint package provides bindings to a C library and bundles several C libraries into the wheels. The bindings are Cython-generated C code. I’m not sure that python-flint can’t or doesn’t want to use C++ but if it is necessary to use C++ for atomics then that would be the only reason for using C++. Currently the Python package and all of its non-Python dependencies (including CPython) are built with only C compilers.

Building the extension modules with MSVC requires /experimental:c11atomics because of some things in flint.h so C atomics are already required. Is there a reason to prefer using C++ atomics in that case?

If atomic support is something that most Python packages will need and CPython already needs to implement and maintain a cross-platform C implementation of them then it seems reasonable that Python packages could make use of that as well.

1 Like

Only if you don’t want to pass an experimental compiler option I suppose?

Besides atomics, the OP mentioned shared-exclusive (or read-write) locks which are also provided by C++.

Just for the record, I have no issue with using other languages than C, so I can’t provide any direct perspective to the people that would need it to be in C. I do think there’s potential value in being able to ask the interpreter for the best lock available on the platform, and having the effort and design of that be centralized[1], but I’m not ready to argue that line further without more exploration.

For now, in the case I can speak to, I’m not going to provide prebuilt wheels and have a slower pure python fallback available. The target audience at this stage should have access to build tools for their system.

Later on, if there’s demand for prebuilt wheels, runtime dispatch should be fine in this case, I was just hoping to avoid that.

Having now seen the proposed way they will work, I can say that I don’t find them to be a reasonable solution to the problem, they appear to be intentionally designed to trade the convienence of library authors against the interests of their users. While finding the right balance is always going to be difficult, I think this leans too far into a trade of flexibility against the existing ecosystem expectations of wheels and properties of wheels that people do rely on.


  1. Not just for maintaining related reasons. There are options in this direction that are close analogs to the linux kernel’s futex’s, or if people are more familiar, with webkit’s parking lot. ↩︎

1 Like

I should have been more explicitly precise, I realize how the word possible could be read in multiple kinds of possibility here. It’s possible to build a wheel that auditwheel and other current tools believe is fine to call abi3 x86-64 manylinux1 (or any other combination of python abi, cpu arch, and os base), but when used on another machine that set of labels applies to, due to differences not accounted for currently, the other machine has an issue with it.

With the word “possible”, I meant in theory, based on current, reliable guarantees across tools and hardware, not “I have a specific case where I know it can happen”. I’m trying to avoid ever being in a situation where it happens by finding a way to stay within what is guaranteed by the tools I’m building upon, and therefore also making this something I can confidently support publicly[1].

I’d be interested in discussing various related topics given the right timing, but I am hesitant to commit to much currently. I’m pretty sure I will not be contributing code directly to cereggi in the near future, but not because I disagree with it or don’t find it valuable. I have been cutting back on the amount of my personal time that I spend looking at code for other reasons. I may be more able to contribute directly once I regain a better baseline of balance there for some time.


Thanks to everyone who commented for various perspectives. I’m content with the level of agreement that there is here and don’t feel the need to push for anything more to be done, in cpython or externally. I have things I’d like to go back to thinking about design on. I still think there are some missing pieces as in the title, but maybe there’s more narrowly targeted answers than I was originally considering.


  1. At this point, the stuff I want for work already exists. I just want to feel better about how I would be able to open-source this for others. As-is, I’m not ready to open source certain things even though I have permission to do so, as I view releasing software intended for a purpose as something people are morally responsible for to an extent beyond what is true legally, based on licenses; I don’t currently feel like I would be up to my own standards here. ↩︎

2 Likes