What about adding a #define/-D that can be used to remove the unsafe APIs from the header files?
I could then opt into removing the unsafe APIs and get compiler errors when migrating to free-threading.
What about adding a #define/-D that can be used to remove the unsafe APIs from the header files?
I could then opt into removing the unsafe APIs and get compiler errors when migrating to free-threading.
That implies your project doesnāt compile until you fix all potential issues, which seems a bit too severe. Also, some APIs like PyDict_Next
are apparently safe/unsafe depending on the context. A lint-like utility that would flag potentially dangerous constructs would be more flexible.
It would be a reliable way to find unsafe usage.
I can always remove the -D from the compile once I have the list of code to fix.
Then I can fix incrementally and retest with the -D to confirm Iāve fixed all unsafe calls.
Iād like to give an example of why I (as a āpure Pythonā programmer) find it so hard to understand what free-threaded Python means to me - and hence, what needs to be part of the documentation Iām referring to above. I apologise if this is a bit long, but I think context and a concrete example are important here.
I have a casual interest in simulating various dice and card games using Python. This typically takes the form of simulating a game turn a large number of times, and summarising the results. To give a relatively simple example, working out how many spaces you move on average in the game Monopoly. The simplest way is just to define a function that calculates the result, and run that a million or so times. Thatās often sufficient, but for more complex problems, it can be pretty slow. To speed it up, I thought of using threads, as that would share the work across my multiple cores. Unfortunately, as my calculations are CPU-bound (a number of calls to random.randint()
plus arithmetic and conditional tests), the GIL prevents them being run in parallel, and the threaded version is significantly slower than the serial version (because of the overhead involved in setting up worker pools, etc).
Given that the GIL is my problem here, I expected free-threaded Python would give me the speedup I was hoping for. However, that doesnāt seem to be the case - in a simple test, my serial run took about 1 second, and the threaded (with GIL) run took 12 seconds. The free-threaded interpreter took 25 seconds - twice as long as with the GIL!
Iām not doing anything clever here, my thread loop is simply:
def main():
results = Counter()
with ThreadPoolExecutor() as exc:
futures = [
exc.submit(one_result, results)
for _ in range(1_000_000)
]
wait(futures)
print(results)
There are two fundamental problems I have here.
If we are switching to a free-threaded model, I think part of that switch has to include re-educating Python users who have internalised the āthreads are for IO-bound workloads, the GIL makes threading useless for CPU-bound tasksā message of the past. And part of that re-education has to be for us to understand, and be able to articulate, what the new message actually is.
By the way - I donāt want advice on how to improve my code above. Itās a throwaway example, knocked up in a couple of minutes. Itās intended as an example of sloppy code written with a mindset of āI have 8 cores, if I use them all, Iāll get 8x the speedā. Maybe thatās not the target use case for the free-threading build - but if so, how do we make that clear to the average user?
First, thanks all for the pings in this thread. Itās an honor that my opinion is valued
Iām writing a longer reply to all the points brought up, but wanted to quickly reply to this:
Is there any chance you can share your code anyway? Iām curious to see where the performance bottleneck is.
IMO your simple example probably should have led to the speedup you were expecting, unless thereās something else going on. The fact that it doesnāt deserves investigation.
Thatāll likely have a global seed state thatās shared between threads and needs locking. So if that dominates I wouldnāt necessarily expect speedup.
You might be interested in this note in the random
module documentation. This program, or one very similar to it, was discussed previously and you received a number of helpful suggestions on how to write it in a way that it scales well.
Is it possible to detect improper concurrent access to resources in python code programmatically? Like having a -x warnmultiaccess
or something that warned could be helpful to find these bugs (in python code itself).
Even looking around company code, I once in a while see things that work because of the GIL. Things like adding to lists in threads without an explicit lock.
Then we have a plethora of historical answers, external documentation, etc. that rely on it as well. People saying that we donāt need a lock here because of ā¦
Without some sort of detection or linting that sees it, Iām not really sure how we āstop the bleeding.ā
Some kind of performance counter to highlight contention (count of parking lot uses? I trust Sam &co to choose better than me) would be very useful. Even if itās just a printed statistic at the end of execution, itās at least a not-entirely-fictionaltheoretical way to measure whether code is parallelisable or not.
For the stable ABI: I see a way forward, but with how things are interconnected Iām having trouble pinning it down and breaking it into pieces that are PEP-sized or smaller. See Pre-PEP: PyModuleExport -- a new export hook for modules for where i am now.
Given what I emphasized there, I think Phase II should be started in an early alpha release. That is, announce it now, but for 3.15.
Sorry for pushing back! I fully realize the sorry state of stable ABI for free-threading is partly my fault ā looking back, I should have made it a bigger priority.
Sure - itās at montecarlo.py Ā· GitHub
I didnāt share it because thereās very little in it beyond the main function I posted
Iām glad to hear that your intuition was the same as mine. That suggests that thereās not as much education needed as I feared.
Yes! Weāve discussed this briefly a while ago in a Faster CPython Sync, but I hadnāt written it up as an issue. Iāve filed Use pystats for free threading performance statistics Ā· Issue #131253 Ā· python/cpython Ā· GitHub to outline some ideas.
Yes. I hadnāt recalled that discussion, so thanks for the link, but the details there arenāt really my point here. I routinely write this same bit of code from scratch, and itās never something Iām thinking of in terms of scalability. So I donāt try very hard to optimise it, and thatās really what Iām trying to get at here - how do we make sure that people donāt do the wrong thing when they arenāt thinking beyond āsplit the load across my coresā.
Some immediate takeaway thoughts:
Random
on multiple threads is OK. If it performed badly on threads under the GIL as well, Iād be OK with this, but having previously-working code slow down this much is far from ideal. Particularly when itās far from obvious how to use separate instances of Random
per thread, for example when using a thread pool like I am. At a minimum, a recipe in the docs would be helpful, and maybe even a runtime warning[1].As I said, Iām not demanding that this exists now, Iām simply saying it should be part of the criteria for when free threading is ready to be marked as āsupportedā.
FWIW, I changed my code to put a RNG instance in thread-local storage on worker startup and use that. It didnāt improve performance at all
For what itās worth, thereās a whole bunch of other issues with sharing a RNG across threads, around potential degradation in randomness - but I donāt care about that for this application, and I donāt think we need to worry about educating anyone who is writing code that cares about those issues ā©ļø
Also, Iāll note that my main point then was that thereās no information on how to write such code correctly under free threading, and that wasnāt addressed - which is why Iām reiterating the same point now
This is why I donāt like offering code examples. People naturally advise on how to fix the code, and the point that I had no means to find out how to fix the code without asking people gets missedā¦
This is my last day at work before I go on vacation for a week, so aplogies in advance if I take a while to get back to replies to this but my goal on this trip is to not be on my phone or computer very much.
Iāve been working on community support for free-threaded Python for about a year now. I started with relatively little multithreaded programming experience, beyond a teeny bit of OpenMP coding via Cython. Iād never used the python threading
module. It was daunting to have to learn an entirely new programming paradigm, but Iāve come out the other end convinced that itās worth it and that I can help others by communicating what Iāve learned and building community knowledge.
IMO, the raw computational power unlocked by the free-threaded build on modern processors more than justifies the work that will be necessary to get everything working and make multithreaded parallelism a first-class tool in Python.
As of now, we have the ābaseā of the scientific python stack in Numpy, SciPy, matplotlib, pandas, and scikit-learn working, code generation and language interop tools like Cython, PyO3, pybind11, and f2py are all working. Weāre still waiting on a Cython release, but the others all ship releases supporting the free-threaded build. For the past few months Iāve been focusing on projects that ship Rust native extension and depend on PyO3 to help unblock that corner of the ecosystem.
Iām really excited about the possibilities available when people start migrating from process-pool based parallelism to thread pool parallelism. Take a look at this NumPy PR which fixed a multithreaded scaling issue for ufunc evaluation that was reported against the first release of NumPy to support the free-threaded build. The details of the code are less important than the graph in that PR, which clearly demonstrates that doing more or less the exact same workflow using NumPy is substantially faster using a free-threaded thread pool. Process pool parallelism is leaving a ton of CPU performance on the table. Of course that was the primary motivation for starting the experiment, but it was really exciting to actually generate that graph on my laptop and first-hand see how much faster free-threaded Python can be.
Iām also excited about the morning of the first day of talks at PyCon where there will be a series of talks in Hall A on free-threaded Python, including one from myself and @lys.nikolaou which aims to cover the content at https://py-free-threading.github.io in talk form, to the extent thatās possible. Iām hopeful that our talk will be a lasting community resource for those who learn about programming via videos and lectures rather than reading documentation.
This is a great point, one I fully agree with.
While up until yesterday the content at py-free-threading.github.io was very native extension focused, I was prompted by your reply in this thread to spend most of yesterday updating py-free-threading.github.io to add content to make it clearer what users and project maintainers need to do to get their code working. See here: py-free-threading
I also split our original single porting guide page into three pages that focus on thread safety issues in pure python code, multithreaded testing, and then a third page that focuses on native extensions. That way projects that donāt have native extensions donāt see this content and get scared away.
Weāve been working on py-free-threading.github.io since last summer and our hope is that it will be the go-to place for questions about free-threaded Python, at least for content that doesnāt make sense in the CPython documentation. Please please please open issues telling us what needs to be improved or letting us know about mistakes. Contributions are also very welcome.
It depends a little on what you mean by āworkā. If the module already has extensive multithreaded tests under the GIL-enabled build, then being able to tell whether or not free-threading introduces new kinds of bugs will be easy to see.
If the code is for a CLI app or really any kind of user-controlled application where the user decides whether or not to create a thread pool and use code in a multithreaded context, the user or CLI app author can do testing to make sure their internal use of threads is safe. Tools like pip
that do not have a public Python API also donāt need to worry about making internal Python code thread-safe, unless they would like to use threading internally.
Library authors will need to do a little more work. This is particularly acute for libraries that do not have good multithreaded tests. However, as @thomas points out later in the thread:
I ran into exactly a situation like this last week working on the cryptography
library. It turns out the _ANSIX923PaddingContext
is implemented in Python and uses an internal bytes
buffer to store state. If two threads simultaneously update the context, there is a possibility that the threads can race to update the bytestring, leading to silently incorrect results.
Itās easy to trigger this on the GIL-enabled build by doing e.g. sys.setswitchinterval(.000001)
before running a multithreaded test: `_ANSIX923PaddingContext` isn't thread-safe Ā· Issue #12553 Ā· pyca/cryptography Ā· GitHub. That doesnāt mean it canāt happen on the default configuration, just that it requires an āunluckyā thread switch that happens to trigger the race. I wouldnāt be surprised to learn about rare crashes in production systems on the GIL-enabled build using thread pools due to issues like this.
This is also a good point. Iāve opened an issue to track adding examples demonstrating how the free-threaded build makes things that were previously impossible possible: Add a page to the guide showing why disabling the GIL is awesome Ā· Issue #149 Ā· Quansight-Labs/free-threaded-compatibility Ā· GitHub
Thereās subtlety here too. One example is that (at least according to @cfbolz, I think this is what heās getting at in his reply in this thread) in Python 3.10 a change to how bytecodes release the GIL means that pure python code release the GIL far less often than Python 3.9 and earlier as well as PyPy: Races on appending bytes are much easier to trigger in PyPy than CPython Ā· Issue #5246 Ā· pypy/pypy Ā· GitHub
There are likely more cases like this. We need better tests to find them all.
Thatās true! IMO adding multithreaded tests of objects with mutable state are important to add before you can call your module thread-safe.
I also want to point out that itās completely valid (at least IMO) for a project to not guarantee any thread safety and leave it up to the user to evaluate whether or not their code is thread-safe.
For example, NumPy does not guarantee (and never has) the thread safety of ndarray: Thread Safety ā NumPy v2.3.dev0 Manual. Weāre also aware of a couple of free-threading-specific thread safety issues in NumPy that weāre not yet prioritizing because we havenāt gotten any reports about problematic uses yet.
IMO making ndarray thread-safe is not the correct way forward. We could add locking, but doing it in a way that allows scalable multithreaded performance (e.g. without causing any regressions to existing multithreaded workflows in the GIL-enabled build) would be a large engineering effort. NumPy also exposes a number of C API functions that allow direct access to all kinds of low-level memory buffers allowing unsafe mutation in a multithreaded context.
Even with all those issues though, people have been using multithreaded parallelism with NumPy for years. Dask, for example, supports thread pools as a parallelization strategy, and NumPy has fixed a number of issues seen by dask users over the years.
Even though users are allowed to do unsafe things, they are still productively using NumPy in a multithreaded context by avoiding multithreaded mutation. Many, many production workflows end up as an embarrasingly parallel operation that uses read-only access on shared ndarrays.
IMO, long term, they way forward is for NumPy to offer better support for immutable ndarrays. This may eventually involve adding a borrow-checker to NumPyās ndarray buffer to enforce at runtime rust-like write XOR reads pattern on data owned by an ndarray.
Why would it be unpopular to have it in the stable ABI? In the GIL-enabled build the begin and end critical section macros are just open bracket and close bracket. Of course theyād be more complicated in a free-threaded stable ABI, but also theyād be necessary for thread safety in many C extensions. I donāt see what the problem is.
Mutating a shared ndarray isnāt thread-safe and right now we donāt plan to make it safe, via Python or the C API.
Weāre doing our best to report and fix crashes or memory unsafety due to mutation of shared arrays, but I donāt think itās possible to fix all the data races one could trigger by mutating a shared array. The same is true on the GIL-enabled build and is relatively easy to trigger because mutating ndarrays releases the GIL (except for object arrays or other limited exceptions).
What I would like to see is better support for immutable ndarrays or maybe an alternate array type that has better guarantees about that immutability and less legacy baggage.
If PyArrow has bigger needs for a thread-safe ndarray or making the NumPy C API thread-safe then Iād love to start a discussion about that on the NumPy mailing list or issue tracker. Iām very interested in having some design discussions about this to inform long-term plans in NumPy.
This is true. Iāve been trying to add documentation and examples of how to write multithreaded stress tests.
Here are some recent examples Iāve come up with on projects Iāve added free-threaded support:
I think we should make it easy to run a native profiler against CPython so you can see for yourself where the issue is. We have an open issue to better document this as well as to add tooling to discover multithreaded scaling bottlenecks: Documentation and tooling for detecting multithreaded scaling issues and regressions Ā· Issue #117 Ā· Quansight-Labs/free-threaded-compatibility Ā· GitHub
I also agree with @steve.dower that a built-in way to identify contention on resources that are protected by PyMutex locks would be neat.
Finally, we also opened an issue to track the need for documentation on debugging multithreaded performance in the free-threading guide: Add a multithreaded performance section Ā· Issue #151 Ā· Quansight-Labs/free-threaded-compatibility Ā· GitHub
It might make sense to change the random module to use a thread local Random instance by default rather than a global one with threadsafe access as a result of free threading specifically to avoid this kind of contention, but I donāt know if such a change would have unexpected or unacceptable side effects.
Generally speaking, any shared mutable state will continue to have contention, and code that has it should try to avoid sharing it across threads when possible to see the most benefits. This seems obvious to me, so Iām not the right person to know where explicitly stating this would be useful to other people, but this kind of guidance for those who havenāt had to consider it before should definitely exist somewhere.
In their current form, critical sections are stack allocated so canāt be opaque. That probably means you have to nail down at least of some the details of the current implementation. Iād be surprised if that didnāt raise a few objections. It may be that the implementation is sufficiently obvious that people are happy to go with it though.
Thatās certainly one approach. Like you, I donāt know if this sort of change would be considered backward compatible, though.
An alternative that Iād be perfectly OK with is a twofold approach:
random
documentation to make using an explicit Random
instance, and calling methods on it, the preferred apprach, with the module-level functions being presented as convenience functions, explicitly not recommended in a multi-threaded context.concurrent.futures
.Regarding the latter, the current pattern appears to be:
data = threading.local()
def init_state():
data.rng = Random()
with ThreadPoolExecutor(initializer=init_state):
...
Thatās not exactly complex, but you need to look in a few places (the threading
docs and the details of the ThreadPoolExecutor
constructor) to find all the pieces. Plus, having to use a global variable to hold the thread local values feels less than ideal. If moving shared global state to thread-local is likely to be a common pattern in free-threaded Python, it would be useful to make it easier. (Itās a good practice even with the GIL, but having the GIL makes it less risky, so you can get away with not doing it - this is one of those āyour code is already brokenā cases that people with limited multi-threading experience wonāt necessarily think of).
Some of these details donāt have to be dealt with on day one, but IMO we should be striving to make peopleās initial impressions of free-threaded Python (once itās declared āready to useā) as good as we can manage. And a few quality of life changes could make a lot of difference.
for 2 I think contextvars ā Context Variables ā Python 3.13.2 documentation is preferred? (although definitely a module I only became aware of recently). Definitely feels like something that needs broader teaching around.
Iāve been looking at reducing critical sections, system calls, and locking in BufferedIO (Ideas thread / proposal soonā¢, 3.15 timeline). That would mean a lot less overhead and potentially no userspace / Python locking in read only and write only I/O (stdin
, stdout
and stderr
). I know techniques and have tools like atomics to do what I need in the CPython C internals, but have no idea how I could implement the same thing Python native (ex. _pyio
). I know _pyio
is contentious but the delta in what can be expressed / capabilities is to me a problem. Ecosystem often grows faster once itās part of batteries included, but it doesnāt feel like there are basics yet, and definitely not general directions / guidance with patterns for adopting.
Watching the PRs for free threading going through CPython modules, is a lot of often intricate work relying on a lot of systems tools and general systems programming knowledge (ex. ThreadSanitizer, understanding lock contention, etc). All of that is teachable if people really want to learn it, but most Python code I write I really donāt want to think about it. Recently things like cgi
vs. http.server
module have come up, and I donāt know what the right thing for free threading is, nor how Iād measure why things are getting slower / tools to help me write better code and understand what is happening/why. Itās great to have a range of options, but in terms of āI want a to write simple Python code to handle HTTP requestsā I donāt have a clear picture where I need my code to be headed, let alone where I would go from a simple āflaskā app. CGI ā WSGI I did quite a while ago. Do I do asyncio
now? free threading? both at the same time? How do I make sure what performance I had before doesnāt regress? especially in product code bases that often donāt have comprehensive test and performance suites.
TL; DR: I think there needs to be clear paths for why and how maintainers of systems get to free-threading (might not be for a couple more versions), how ongoing app development can adopt it, and how new projects can write simpler/faster code from day one with it. All very solvable, but today both the tools and the education is at the very least not something visible to me.
I know you donāt want advice on how to improve your code, but thatās not the pattern I would use. There are lots of valid ways to write most Python programs. Keeping data private to a thread doesnāt require threading.local
. In may cases itās simpler just to use local variables: montecarlo.py
Regarding your other concerns about Random
:
random.randint
from multiple threads. (random.gauss
is not thread-safe, but thatās not new nor specific to free threading).random.randint()
scales well across multiple threads before going from āexperimentalā to āsupportedā. This same issue exists in Java. It also existed in Go for many years.In general, if youāre not thinking about scalability and donāt try to optimize your code, I donāt think you should expect your code to run efficiently.
Reframe the
random
documentation to make using an explicitRandom
instance, and calling methods on it, the preferred apprach, ā¦
If we are considering reframing the official Python documentation so that the preferred approaches align with efficient free threading practices, maybe weāve gone beyond āPhase 1: experimental, not for production useāā¦