Hi, I wanted to have a more public discussion of multiprocessing's default start method of fork()-without-exec() is broken · Issue #84559 · python/cpython · GitHub, since it’s been open for a while without a final decision.
The problem
Right now, on Windows and macOS the default multiprocessing
context is “spawn”. On Linux and other POSIX platforms, it’s “fork”.
The problem:
-
fork()
withoutexecve()
is fundamentally broken when threads are in use (see below). - This is an implementation detail many users aren’t aware of.
- Many libraries use threads under the hood. Anytime you import NumPy for example, there’s a thread pool in the background. So even in the unlikely event people notice this warning in the documentation, they might still have deadlocks and not know why. Most recent examples are PyTorch and grpc: see the comments in the issue for links.
The result then is users who have unexplained, mysterious deadlocks, possibly due to third party code they didn’t even write themselves. Things that break at a distance are no fun.
Why is fork() without execve() broken?
When fork()
happens, all threads from the parent no longer exist in the child. This means that:
- Any locks that don’t specifically handle this situation may now be locked, leading to deadlocks.
- The data protected by that lock may only be partially updated, so it may no longer be semantically valid even if the lock is manually released post-fork(). Conceivably it could just be complete garbage.
So if you have C libraries that start thread pools you might for example end up with a locked, corrupted static work queue. When the subprocess tries to start things up again, it won’t go well.
A solution
Switching to “spawn” as the default method would fix this.
The cost, of course, is that this will result in some backwards incompatibility:
- There is some performance impact in some situations.
- “spawn” has some requirements re
if __name__ == '__main__'
if you’re using a single script to run everything; “fork” doesn’t have this requirement.
However, the error in this case is fairly straightforward:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
So the experience would be “upgrade to Python 3.12 3.14, get an error message, grumble, follow the instructions and fix the code, move on”.
In contrast, the current failure mode is your Python process deadlocking at random, with no explanation. This can be impossible to debug for some people.
So both in the current situation and in the proposed change some people have problems, but the kind of problem will be much less significant and much easier to fix. And for those people who really want “fork”, it will still be available.
(As a side benefit, in the current situation code written on Linux might fail with that RuntimeError on macOS/Windows; that will no longer be the case.)
What do you all think?