Multiprocessing spawn (default on macOS since Python 3.8) is slower than fork method

xistence · November 26, 2020, 10:20am

I am being fairly vague here, but it is mainly because I don’t know the best way to reduce the test cases down to pinpoint the problem.

I maintain Waitress, a pure Python HTTP server, and ever since Python 3.8 there’s been a marked slowdown of the functional tests.

Python 3.6 on my MacBook Pro M1 Rosetta: 15.91s to run the entire test suite, similar slow tests as Py37. So not included.

Python 3.7 on my MacBook Pro M1 Rosetta: 15.57s to run the entire test suite

Slowest tests on Py37 is:

3.10s call     tests/test_functional.py::SleepyThreadTests::test_it
1.01s call     tests/test_channel.py::TestHTTPChannelLookahead::test_client_disconnect
0.41s call     tests/test_wasyncore.py::DispatcherWithSendTests::test_send
0.17s call     tests/test_wasyncore.py::TestAPI_UseIPv6Poll::test_handle_close_after_conn_broken
0.11s call     tests/test_wasyncore.py::TestAPI_UseIPv4Select::test_handle_read
0.11s call     tests/test_wasyncore.py::TestAPI_UseIPv4Select::test_handle_close_after_conn_broken
0.11s call     tests/test_wasyncore.py::TestAPI_UseIPv4Poll::test_handle_close_after_conn_broken
0.11s call     tests/test_wasyncore.py::TestAPI_UseUnixSocketsPoll::test_handle_close_after_conn_broken
0.11s call     tests/test_wasyncore.py::TestAPI_UseIPv6Select::test_handle_close_after_conn_broken
0.11s call     tests/test_wasyncore.py::TestAPI_UseIPv4Select::test_handle_close

Python 3.8 on my MacBook Pro M1 Rosetta: 44.33 seconds to run the entire test suite

3.34s call     tests/test_functional.py::SleepyThreadTests::test_it
1.01s call     tests/test_channel.py::TestHTTPChannelLookahead::test_client_disconnect
0.41s call     tests/test_wasyncore.py::DispatcherWithSendTests::test_send
0.40s call     tests/test_functional.py::TcpInternalServerErrorTests::test_before_start_response_http_11_close
0.36s call     tests/test_functional.py::TcpInternalServerErrorTests::test_before_start_response_http_11
0.35s call     tests/test_functional.py::UnixWriteCallbackTests::test_no_content_length
0.34s call     tests/test_functional.py::TcpInternalServerErrorTests::test_after_write_cb
0.33s call     tests/test_functional.py::TcpInternalServerErrorTests::test_after_start_response_http11_close
0.32s call     tests/test_functional.py::TcpFileWrapperTests::test_filelike_nocl_http10
0.32s call     tests/test_functional.py::TcpFileWrapperTests::test_filelike_http10

Python 3.9 on my MacBook Pro M1 Rosetta: 41.93 seconds

3.28s call     tests/test_functional.py::SleepyThreadTests::test_it
1.01s call     tests/test_channel.py::TestHTTPChannelLookahead::test_client_disconnect
0.41s call     tests/test_wasyncore.py::DispatcherWithSendTests::test_send
0.36s call     tests/test_functional.py::TcpWriteCallbackTests::test_short_body
0.33s call     tests/test_functional.py::TcpFileWrapperTests::test_notfilelike_nocl_http10
0.32s call     tests/test_functional.py::TcpEchoTests::test_chunking_request_without_content
0.32s call     tests/test_functional.py::TcpEchoTests::test_date_and_server
0.31s call     tests/test_functional.py::TcpBadContentLengthTests::test_long_body
0.30s call     tests/test_functional.py::TcpEchoTests::test_broken_chunked_encoding_missing_chunk_end
0.30s call     tests/test_functional.py::TcpTooLargeTests::test_request_body_too_large_with_wrong_cl_http10

This is very visible too in the test times on Github actions:

https://github.com/Pylons/waitress/runs/1335343659?check_suite_focus=true (Python 3.7, running tox takes 40 seconds)

https://github.com/Pylons/waitress/runs/1335343690?check_suite_focus=true (Python 3.8, running tox takes 3m and 42seconds)

Although GitHub actions are not always reliable since it is shared infrastructure and you never know who your neighbors are, but the same issue holds across all of the Mac devices I own (2017 MacBook Pro i7, 2012 Mac Mini i7). Just as an extra data point, my 2017 MacBook Pro runs the py39 tests in 67 seconds, and the py37 tests in 26 seconds.

The problem is that I don’t know where to begin attempting figure out what is going on. So I am shooting my shot here in the bug tracker. How can I help debug this/track this down so that we can get rid of the regression and speed things back up?

Open source project: https://github.com/Pylons/waitress
Uses pytest to run tests, tox is the easiest way to accomplish this.

I’ve also created a bug report, but since it isn’t an actual straight forward report (https://bugs.python.org/issue42471), I figured I’d try posting here too.

I want to help, but need advice on how to try and figure out what caused the slowdown in the first place.

xistence · November 26, 2020, 7:27pm

@nad correctly deduced it is due to the change in multiprocessing to go from using the fork model to using the spawn model.

pitrou · November 26, 2020, 9:29pm

The solution is probably to avoid spawning new processes and instead use a process pool.

xistence · November 27, 2020, 12:00am

Since it is tests and we absolutely want a new instance for each test we run, for now I’ve just gone for using the fork model, which gets us the speed back. I’d rather not try to figure out how build out a process pool and change the tests around that.

We don’t do anything fancy with PyObjC or any macOS frameworks so its not unsafe, nor are we spawning these things from threads.

pitrou · November 27, 2020, 12:14am

Then I’d suggest to at least use forkserver, which is much less fragile than fork (on any Unix-like OS).

vstinner · November 27, 2020, 7:36pm

I took the liberty of changing the topic title.

andyryan · February 26, 2021, 1:07pm

I’ve also seen this slowdown using multiprocessing on MacOS 10.15.6 and Python 3.8.

We are craeting a pool of processes using Pool() and starmap to obtain our results.

Using spawn our processing takes ~130 seconds
Using fork : ~40 seconds
Using forkserver: ~130 seconds

I actually thought this was a CPU throttling issue at first. I would see my Mac’s CPU reach 100% on each process, the temperature spike to 95 degrees C and then the CPU usage for each process would drop to around 5%. However manually running the fans at full blast so the temperature never rises above 65 degrees, I see the same level of CPU usage across my processes. Reducing the number of processes does not help the situation. Limiting to 1 process does not alter the processing time compared to 4 or 8 when using spawn, but does with fork.

Amendum: We have also observed this on a Ubuntu Virtual Machine running on WSL 2.

Time with spawn: ~200 seconds
Time with deafult (‘fork’ since as its linux): 80 seconds

Amendum 2: A hunch from a colleague is that it is due to large imports such as NLTK and TextBlob. That these are bottlenecking the new processes whereas fork would have copied them from the main process? Not sure if that helps.

pitrou · February 26, 2021, 2:31pm

Wait… you’re saying creating the Pool has a 90 seconds overhead. That sounds insane. Are you sure you create the Pool only once?

andyryan · February 26, 2021, 2:45pm

Sorry, to be clearer, this is how long the processing takes within those pools.
I.E given a list of items, how long it takes to process all items using

with Pool() as p:
            results = p.starmap(function_name, [(list_item, jobid) for list_item in list])

pitrou · February 26, 2021, 2:58pm

Ok, so it’s not clear how to attribute the costs. There are two possible explanations for the 90 seconds overhead:

creating the Pool really takes 90 seconds
the function is extremely heavy to serialize and it takes 90 seconds to send it to the pool workers

I would suggest replacing the function with a trivial one and see how it goes.

pitrou · February 26, 2021, 2:59pm

Well, have you timed how long they take to import in isolation? (e.g. in a Python interpreter prompt). This is not something we can measure for you.