While I agree with this statement in general, I don’t think having code run 100% slower is ok either. You probably need to add some sort of tool finding these degenerate cases and warn people about them, or make sure these cases are fixed. Might be worth adding this example to the test suite just to understand how free-threading affects normal yet unoptimized code.
Well, looking at montecarlo.py · GitHub, I think the main explanation is quite simple: your task function runs in an extremely short time. Most of the time it’s breaking out in the first iteration. It probably doesn’t take more than a couple µs (microseconds) to run. If you run it under ThreadPoolExecutor, then probably most of the time is taken by the ThreadPoolExecutor logic, not by your task function.
(you can validate this hypothesis by using a ThreadPoolExecutor with only one worker, GIL or not)
My intuitive rule of thumb is that you probably want a single task invocation to be on the order of a millisecond or more, for parallelization to not suffer from such effects.
In this case, this is of course easy: instead of doing one simulation round for each task invocation, do N of them (with N being in the thousands, perhaps).
And, of course, you’ll want to take care of the random shared state. And perhaps the Counter too.
Edit: performance apart, your code is probably not thread-safe. results[total] += 1 from several threads (with results being a shared Counter object) may miss some increments. Instead you probably want your task function to return its own Counter, and then you merge all results at the end of your main function.
To give a simple example, with 8 threads, each thread only sees every 8th value from the RNG stream. For some RNGs, that can be significantly less random than looking at every value. It’s not for Mersenne Twister, which the random module uses, but that’s technically an implementation detail. And this isn’t important for a toy problem like this, of course, but if you’re doing serious simulation work you should consider it.
All I was trying to do by mentioning this was to note that I do understand the problems involved with running simulations across multiple threads, and to reiterate that this was a toy solution to a toy problem, intended to demonstrate what (in my opinion) a naive user enthusiastic about the new free threading model might come up with. I clearly failed with that, so I should probably never have mentioned this.
I’m sorry. I tried to make it clear that this was a toy example, and optimising it wasn’t the point, but I failed. Yes, multithreading is a loss here (performance with the GIL proves that). And yes, that’s not surprising. But free threading being slower than the GIL is both surprising and disappointing.
In real examples, I’d typically either just run serially and not worry about runtime, or I’d have a far more complex calculation to simulate. Or (more likely than any of the above) I’d reframe the problem so that I can use numpy. But reframing a simulation problem in array-based terms isn’t always easy, and these are only ever casual problems, so “split the problem into 8 pieces and run them in parallel” is the sort of thing that feels like it should be a quick win.
Ultimately, my goal here is to have a recipe that it makes sense to apply to problems like this. That idea does make sense to factor into such a recipe, and I’ll try to remember to use it in future.
Yeah, I was very conscious of that. I don’t really know enough about the rules to accurately judge how to write thread safe code in Python[1]. And judging by the debates I’ve seen over the years, I’m far from alone in this. Even with the GIL, people had very vague ideas of what was thread safe and what was not, and how to write thread safe code without locking - adding locks is rightly percieved as very hard to get right, and most people’s (incorrect) intuition is that the GIL saves them from needing to do so.
The problem here is that if we’ve done our job right, people will be enthusiastic to try free threading when it becomes “supported” - and if we don’t want that experience to be a bad one, we need to help give them better mental models to work with. The alternative, IMO, is to present free threading as “a specialised tool that will help people skilled with threading to use it more efficiently”, so that people don’t have unrealistic expectations.
Anyway, I apologise to everyone for hijacking the discussion with code reviews of a script that was never intended to do more than illustrate a point. I suggest we drop this digression and go back to the original discussion about PEP 779.
I have a number of patterns I use if I’m writing production-quality code, but for a throwaway example like this I didn’t want to obscure the main point, or spend too long crafting the code ↩︎
Anything so highly correlated has no business advertising itself as a RNG. For reference, see this deep dive on how numpy upgraded its PRNG so that it would not be vulnerable to scenarios where, out of millions of streams drawing billions of samples each, the probability of two streams becoming correlated would become non-negligible.
I agree that it’s good to protect users against such pitfalls (and by extension, if CPython had a RNG implementation susceptible to this, it would need to be upgraded), but I find the argument “users might run into auto-correlated RNGs” to be purely hypothetical, given the orders of magnitude that are necessary to trigger issues in modern RNGs implementations.[1]
as an aside, using numpy’s RNG doesn’t need an array-based rewrite, you can just continue to just draw singleton samples ↩︎
I did some additional experimentation with that particular example, and frankly found the results a bit baffling (testing on a 8 core/16 processor system): montecarlo.py · GitHub
One thing was clear: instantiating random.Random() in each thread is expensive (tripled the threaded runtime on the GIL-enabled build). Beyond that, things I expected to help a lot (like switching to a map-reduce execution style) only helped a little (and were still an order of magnitude slower than the naive serial processing implementation).
So it looks like the “work per thread” needed to justify the threading dispatch overhead is genuinely quite high, and this specific example isn’t complex enough to reach it. Edit: which is the point @pitrou already made earlier in the thread)
Edit 2: considering the suggestions from @pitrou and @colesbury above to partition the map/reduce problem appropriately based on the number of cores available finally gives the kinds of free threading performance gains we would expect to see (wall clock time reducing by a factor of 5):
$ time ./python3.14 ./montecarlo.py --map-reduce
Counter(... snip ...)
real 0m0.967s
user 0m0.967s
sys 0m0.010s
$ time ./python3.14t ./montecarlo.py --map-reduce
Counter(... snip ...)
real 0m0.188s
user 0m1.696s
sys 0m0.031s
Edit 3: My takeaway from this is that we may want a brief parallel task partitioning tutorial somewhere on py-free-threading.github.io. The following is the approach I used to get the results above, but it’s “the first solution I came up with” code rather than necessarily being the best way to do it:
NUM_CORES = os.get_process_cpu_count()
NUM_RESULTS = 1_000_000
def map_reduce_main():
results_per_core, extra_results = divmod(NUM_RESULTS, NUM_CORES)
partitions = [results_per_core+1]*extra_results + [results_per_core]*(NUM_CORES-extra_results)
assert len(partitions) == NUM_CORES
with ThreadPoolExecutor() as exc:
results = Counter()
for partition_result in exc.map(n_results, partitions):
results.update(partition_result)
print(results)
I’m personally fairly comfortable with this as a policy - by necessity concurrency has a lot of pitfalls. I definitely don’t think that the absence of pitfalls is necessary for success - (although if it failed to offer an improvement on every example then that’d be different).
The main argument for moving to step 2 seems to be encouraging third-party packages to test it. Is there a huge number of people putting it off because it’s experimental? Certainly from what most of the core numeric packages have jumped fairly enthusiastically.
The other thing that I don’t think is there yet is the Windows pyconfig.h header situation, where typically extensions build on Windows will immediately and mysteriously crash because there’s one set of headers shared between both builds. We certainly failed to get Cython set up testing freethreading+Windows on the CI because of it.
I don’t know how many people that affects, but it seems like a significant issue that’s kind of been overlooked.
I think that this is the only argument for it. There are two sets of people downstream from CPython: library maintainers and end users.
For end users it does not make sense at all to say that free-threading is “supported”. It is very clearly new, experimental, and untested. Many packages won’t have binaries or will have binaries but that are themselves untested and experimental. If you are talking to a class of new Python users you are definitely going to tell them not to install the free-threading build because it clearly is experimental regardless of what any PEP says.
For library authors it makes sense to say that free-threading APIs are considered to be stable now. I’m not sure that describing the free-threading build as “supported” is necessarily the way to do that. Anyone who has been following Python development at all will be aware that the intended direction of travel is for no-GIL to be standard.
I don’t think there are library authors putting it off because it is experimental. Rather it takes a lot of work to make all the packages in the ecosystem compatible with the free-threading build and that takes time. The core numeric packages have jumped at this but have also had significant resources and help to make this happen. The rest of the ecosystem will take a lot longer.
As I am sure you are aware the latest release of Cython does not work with the free-threading build at all. It gives an immediate build failure that cannot be worked around at all except by using either prerelease or master branch Cython. I have tried to follow the Cython master branch to test free-threading but it is difficult because other breaking changes keep cropping up.
I also don’t think that there is a widespread agreement or understanding about what it means for a package to be thread-safe under the free-threading build. A basic question is this:
Should extension module authors guarantee that it is impossible from pure Python code using only public interfaces to do anything that results in memory corruption or undefined behaviour?
Or should it be documented that certain things should just not be done in the free-threaded build like mutating particular extension types if they are shared between multiple threads?
For python-flint I don’t know the answer to this question yet. Cython has only just added the criticalsection context manager which is what we would want to use if attempting to guarantee no memory corruption ever. I have not tested it yet to see what overheads it introduces but I assume that it is that sort of thing that makes the free-threaded build 15% slower so I am not sure that we would want to use that lock in all cases. I think it essentially amounts to slowing everything down just to give nicer error messages for users who are probably using threading wrong anyway.
This basic question is what @ngoldbaum alluded to above:
I am not sure that there is widespread understanding or agreement about what this means. For now I think it is fine to put out binaries for the free-threading build that are not fully “threadsafe” in the sense that some people would expect. I think that it is fine to put out these binaries because the freethreading build is “experimental” and I will still consider it to be experimental regardless of this PEP.
Conversely, I do think that many library authors (specifically authors of pure Python libraries) are doing nothing either because they think free threading doesn’t affect them, or because they have no idea how to establish whether free threading affects them or not. Maybe this is fine (such libraries probably don’t worry much about threading under the GIL, either) but stating whether ignoring such libraries is fine seems to me to be precisely the sort of thing this PEP should be explicit about.
I’m not sure there necessarily has to be agreement. I think both options are acceptable and you’re completely allowed to pick the one that works for you.
Taking python-flint as an example - it’s probably unacceptable if it crashes when two threads call fmpq.bernoulli(64) (copied from your readme…) at the same time. They really should be independent (or at least you should be locking to hide the dependence if not). But crashing if someone’s trying to modify the same fmpz instance from two threads at once might well be OK.
So yeah - I don’t think this requires everyone to converge on the same decision.
Maybe worse, because you’re not doing the fancy atomic trickery that the Python interpreter is doing internally.
Yes probably true. For now Cython is choosing not to do this type of “auto-guard everything”. Am I expect a lot of people will come to that conclusion themselves too.
Can’t it be both? Certainly, that’s been my experience.
I put it off at first because it looked like things were all over the place. Then I gave it a go and discovered I was blocked by numpy support. Then 3.13.0t and cp313t numpy wheels came out and I discovered I was blocked by Cython support. I’m past that and now and since then I’ve been blocked by not knowing what the existing thread safety guarantees are. I only found a concrete answer to my biggest question yesterday:
It’s recommended to use the threading.Lock or other synchronization primitives instead of relying on the internal locks of built-in types, when possible.
The existing advice on free threading is too distracted by concurrency in general to explain the specifics of free threading or even threading in Python. I still have no idea what the rules are for ctypes…
Elevating the support status increases the pressure on those who are trying to accommodate free threading but does nothing to unblock them.
Thanks - that would mostly work and I think we did try something like that. The difficulty is that almost every Cython test involves compiling something whereas I suspect you just needed to cover 1 compilation at the start. About 90% of them just use a standard way of doing it (and so that would work) but the remainder would need changing individually. I’m sure it isn’t unsolvable but it outsmarted us at the time
I think you’re right but there does at least need to be understanding of what is happening. Python has a reputation for being a memory safe language but in the free-threading build memory safety guarantees are weakened significantly. It is likely that making things memory safe in the way that Python users typically expect requires new interfaces (e.g. immutable arrays). It will take years to make things thread-safe or to add thread-safe interfaces or to clearly document what is not thread-safe and ensure that at least different libraries are not using each other in unsafe ways.
In the mean time the messaging around whether this is “experimental” or “supported” risks confusing actual users if even package authors who are providing packages are in reality treating the whole thing as experimental.
That kind of thing would be fine because the elementary types are immutable at the Python level (much like int, float, etc). The problems are with matrices and polynomials since those are mutable containers and in most cases hold references to resizable heap objects. You can think of these as something like a PyListObject which in CPython internals is guarded by the critical section macro. If you set the same element of a matrix from two different threads (M[i,j] = 10) at the same time then it can lead to memory corruption if the buffer for the integer is resized by either thread.
I’m not sure that there is a good use case for mutating the same elements of a matrix in multiple threads though and the ways to make it “safe” impose costs on every operation that accesses the matrix. In fact using critical section would prevent parallel read-only access meaning that it precludes some obvious good uses of thread parallelism like computing many different matrix products M*a, M*b, etc in different threads.
From the perspective of end users, this is converging on the same decision, namely Oscar’s option 2 (“certain things just should not be done”). What it would mean to converge on option 1 ( “it is impossible from pure Python code using only public interfaces to do anything that results in memory corruption or undefined behaviour”) is that that would become an explicit expectation. In other words, if you don’t guarantee everything, then by default you’re in the position of not offering an overall guarantee.
I haven’t really done anything meaningful with the free-threaded build but I agree with @oscarbenjamin that we should consider what kinds of expectations and guarantees are intended by terms like “experimental” or “supported”, and I also agree with @pf_moore that some level of documentation should be consider as part of “supported”.
I think it’s important to be explicit about these things, like by saying “This build is experimental, which means [whatever] is not guaranteed” or “This build is supported, which means you can rely on [whatever]”. The PEP has some of these in the form of hard performance targets, but for my tastes too much is swept under the rug with “proven, stable APIs”. In particular everything I see in the PEP seems to be talking about internals, but I’d be a bit less nervous if some of this were surfaced to end-user documentation (e.g., “in pure Python code you can do this with lists in threads and it is guaranteed to work but you can’t do this other thing”).
Perhaps it is worth coming up with some more deliberate terminology for three levels of status: 1) wild-west experimental; 2) stable internals, ready for work by people who know C and are writing libraries, but likely not useful for those who don’t know or care about Python internals; 3) supported for general use.
This also makes me wonder if it would be worth making a more explicit section of the docs specifically for things that are guaranteed by CPython but not Python as a language. There has always been some stuff like time complexity of various operations that is more or less assumed but technically is not part of the language definition. My hunch is that once free-threading is officially labeled as “supported” there will be a significant surge in people trying to speed up their code in “obvious” ways like @pf_moore’s example. It would be good to have something to point them at that is an official guide indicating what will and won’t work, what pitfalls exist, and son, while making it clear that these statements are separate from the actual language definition.
I was just looking at freethreading support again and ran into an issue from before that the actions/setup-python GitHub action still does not provide the freethreaded build. Apparently a PR from @colesbury (thanks!) fixed this two weeks ago but I guess a new release of the action is needed before it can be used. In the meantime as mentioned in the PR Quansight-Labs has a fork that provides it. Last time I looked at this I ended up using deadsnakes to set it up on Ubuntu but that doesn’t work for other operating systems.
Python 3.13 was released 5 months ago but the standard way of setting up Python in the most widely used CI provider still does not (quite) provide 3.13t. Fixing that sort of thing will go a long way towards convincing package authors that the build is worth supporting.
A shared-exclusive lock (also called read-write lock) would typically work fine for this. It’s readily available from the stdlib in C++ (and certainly in Rust too).
Yeah, but this cuts both ways. Once it is considered “supported” by CPython, users will loudly demand support in every third-party library under the sun. In a lot of cases the maintainers will have to say: “sorry, our dependencies are not available/bugfree yet”, and it just leads to lots of spilled ink and no-one being happy.
The ecosystem (by which I mean first and foremost: library authors, of which the vast majority is unpaid) unequivocally needs more time to digest this huge change. The fundamental pieces needed to even attempt nogil support are barely starting to be ready (and some are still just on the horizon, like being able to specify nogil-only dependency constraints). But what we actually need is for libraries to build on top of that, publish a first version with nogil support, get feedback what’s broken, etc. before we can determine the real impact.
I have to take something back. I said that races in objects created by an extension are just as important as races in an extension’s global data.
But in case a library says “don’t share my objects across threads or else you may see data races or even crashes”, the extension doesn’t have to worry itself about data races within its objects – users have to play by its rules. This is probably the state that numpy and ctypes will be in forever (and code generated by Cython, IIUC).
However, if it also has some global state (could be as simple as a counter of how many of its objects are alive), that state should not get corrupted by free-threading use.
So my conclusion is that the minimum thing an extension should do is ensure its global data structures are not corruptible by free-threaded access.
Separately, I think this discussion has run its course and we may well trust the SC to decide.