But in other ways, using rayon is a heck of a lot like using ThreadPoolExecutor. A big difference is that the Rust compiler won’t let you share a mutable data structure across your threads, so you can’t make the mistake you outlined above. Of course that was your point–disallow the pattern to make the error impossible.
I think there’s some disagreement about what would qualify as “experts-only” level of sophistication. To me, expert-level is when you need to reason about lock ordering or something, whereas “mapping jobs need to be independent” is pretty simple.
I’m not sure if it’s possible, but I wonder if static analysis tools could identify common pitfalls, even if Python itself will happily execute them. A lint rule that warns “you’re modifying a dictionary from multiple threads, don’t do that” would be useful for teaching people stuff like this[1].
An uncomfortable aspect of this discussion is that this is partially a social issue–there’s a lot of enthusiasm to work on free threading and much less for subinterpreters, and that’s not something that can be changed by fiat.
There’s also a lot of corporate resources for free-threading but I don’t know if that’s all that separate from the social side–I don’t think those resources would be allocated if there wasn’t broader community support.
similar to how mutable defaults are possible, but warned against by linters ↩︎
I’d probably disable such a rule. There’s cases where you can rely on the interpreter’s guarantees without a data race and without additional locking here. An example of this would be a case with no iteration by keys/values/items from any thread, and a division of work that ensures no two threads need to access the same key simultaneously.
This might seem like something that you can just safely split to before/after threading, but there are cases where you shouldn’t, such as with caching decorators backed by a dict.
My answer was explicitly and concretely about subinterpreters, not some general argument about concurrency. Yes, Rust can do it, but so what? Rust is so much different from Python that you can’t really derive anything concrete from that observation [1].
I gave an example above: data passed to a subinterpreter has to be efficiently picklable, unless it’s part of an extremely restricted set of “shareable” types (and those “shareable” types can come with their own warts).
That’s really a false dilemma. Nothing prevents you or any other proponent of subinterpreters to make improvements that make subinterpreters easier to use, more efficient and more compatible with the ecosystem.
But conversely, the distant possibility of efficient and easy-to-use subinterpreters should not be used as an argument to prevent people from making free-threading safer, more performant and more compatible with the ecosystem.
The Rust type system is a large part of what allows Rust to be safe and efficient at the same time ↩︎
I think that’s a good example of why a lint rule is a good fit–if you know why it’s okay to disable, that’s fine. If you aren’t sure what might go wrong, it’s there to warn you.
He said that if it did use shared variables (e.g. putting the downloaded data into a dict rather than using return) it wouldn’t be safe. ThreadPoolExecutor allows you to use either of those approaches, while InterpreterPoolExecutor (and ProcessPoolExecutor) require you to use the safe one.
At $work I’d have use cases for both approaches, so I’d very much appreciate having fully isolated subinterpreters. I’d like to use them if appropriate for the current problem to solve, since “share nothing but immutable & immortal objects” is so much easier to reason about than “share everything (and appropriately protect) the things that are really shared”.
Both approaches have their pros and cons, so it would be fine to have both of them and let the user decide to use the best tool for their problem and accoring to their expertise.
Even with free threading enabled, it is still possible to use subinterpreters to avoid unintended race conditions, so it is unclear what the topic of this thread is.
Are there any drawbacks to removing the GIL from Python in the future when free threading becomes stable?
Or is the problem that free threading is so widely advertised that people overlook the per interpreter GIL that already exists?
The topic of the thread (to put in context of this quote) is “should developing Python code be an exercise in avoiding unintended race conditions, or is there a better way”.
Not to derail but this kept bugging me so I went and checked the source for dict.__setitem__. It looks to me like this is thread-safe, and in fact pushing all the results into a dict would work perfectly fine? I see Mike commenting something to the same effect later.
Yes, but assuming that each worker pushes to different keys and does not need to read the dict and see it in any consistent state. It is also fine to append the results to a list as long as it doesn’t matter what order the items end up in.
Every time I have a discussion like this, someone tells me that it’s not safe (often without a clear explanation of what “it” is). I’ve learned to simply assume my intuition is wrong (and reasoning about all the possible thread switch points is beyond me - again, I keep getting jumped on by people telling me things like the GIL actually offers no additional protection[1] to normal Python code).
Net result - I’ve been conditioned to assume that I’m too dumb to understand the risks involved in threaded code in Python. This is essentially why I characterise “raw” threaded code as “expert only”.
Let’s get back on topic at this point. I don’t think this is a productive line of discussion.
This is where most people who start doing threading in C end up
The reason it’s not “safe” is because Tin is talking about literal thread safety, while what is usually intended is consistency.
shared_dict["a"] += 1 is thread safe if it doesn’t crash,[1] but it’s only consistent if the entire read-update-write operation is uninterrupted. Saying “dict is thread-safe” doesn’t actually imply that every operation involving a dict is going to have consistent results, but it’s very tempting to think it does mean that.
So in practice, “nothing crashes when I do that” isn’t enough to write correct programs. You just need to be ultra-cautious[2] to avoid it.
Or leave a partially overwritten value, which in Python would be an invalid pointer and hence a crash, but in C might be the top 32-bits of one 64-bit int combined with the bottom 32-bits of another. ↩︎
Or, in certain other languages, use extra annotations and deal with an ultra-pedantic compiler. ↩︎
I’m confused what you’re saying here. The GIL makes most things run sequentially, so there is very little speedup from having multiple threads[1]. There’s a very large impact from removing the GIL in that setting.
“free-threading” means multithreading without the GIL and without using subinterpreters.
unless you’re doing a ton of IO, or other stuff that releases the GIL ↩︎
Multithreading improve performance not only I/O, but also CPU/GPU heavy workload releasing GIL.
Multithreading is widely used already. If problem is “should developing Python code be an exercise in avoiding unintended race conditions”, the problem exists already with GIL. That is what I am saying.
We can remove (per-interpreter) GIL AND improve subinterpreters too. No need to keep GIL even if free threading is problem.
In the “use algorithm / code structure to make it so more code is fast enough” bucket, one of my longer term projects is working towards exposing usage of kernel/system multi-threading without requiring the Python/userspace multi-threading for cases like this:
def gather_configs(folder: Path):
with io.context(collect_errors=True):
return {
path: path.read_text()
for path in folder.glob("**/*.yaml")
}
There io.context would serve as an explicit to the reader “switch” in the I/O stack to use an “awaitable” model, gives somewhere to stash Exceptions that happen and raise them on exit of the context rather than the current behavior. Ideally that would use tools like io_uring / kqueue / WaitForMultipleObjects under the hood to dispatch and wait for events asynchronously from the kernel without needing a threadpool implementation in Python.
In effect that should mean the kernel is able to work on more requests simultaneously which should mean higher throughput for a lot of cases (ex. reading lots of small files with a high-latency network filesystem). Even for local SSD / in-memory cached files I suspect will help some on kernels/systems with good multi-core I/O scalability because the hardware/kernel gets more work sooner.
Ups, I accidentally hit delete on my post wearing the “weak glasses” early in the morning and the post was deleted without confirmation. Don’t find a possibility to bring it back
Update: found the restore button on the PC - I really should get stronger glasses and stop doing things early in the morning on the phone without the second coffee