Doubts about Multiprocessing.Pool() class arguments

Good afternoon to everyone.

My name is Joan, this is my first message within the Python community and I would like to ask you some doubts about the meaning of two of the arguments from the Pool class: “processes” and “maxtasksperchild”.

  1. “processes”. As far as I’m concerned, this argument lets the user introduce the number of processors (cores from a CPU) in which the different processes will divide. In my case, I have 32 processors available. For curiosity, I have set “processes” to a number higher than 32, and the Python script still works fine. So, one of my questions is: in this case, why the program is still working even though I have set a “processes” value that makes no sense? In addition, I tried with another example and I set “processes” to a very high number (of the order of 10⁴). In this case, the Python script did not work fine, and a OSError was launched, concretely “Too many opened files”. So, my second question is, why the program does not work in this situation?

2)“maxtasksperchild”. In the Python documentation web page it says that maxtasksperchild is the number of tasks a worker process can complete before it will exit and be replaced with a fresh worker process, to enable unused resources to be freed. The default maxtasksperchild is None , which means worker processes will live as long as the pool. I would be glad if you could show me a practical example about the usage of this argument and see the benefits from using it.

Thank you very much!

Hi Joan,

The “processes” argument is not the number of CPU cores used. It is the number of processes, not processors.

Process: Process (computing) - Wikipedia

Not processor: Central processing unit - Wikipedia

Right now, my computer has approximately 290 processes running, and my CPU has only four cores. 13 of those processes are from Firefox alone.

Every operating system has limits to the number of running processes. For my small PC, 290 is not a lot. Even for your computer, a more powerful 32-core machine, 10000 processes is too many, and when you try to create 10000 processes all at once, the operating system says NO and you get an OSError.

What OS are you running? If you are running 64-bit Linux, the maximum number of processes is 4194303. Even 32-bit Linux has a maximum of 32768. So why does your code fail with only 10000?

Guessing from the error message you receiving, I think that this is an internal limitation. It looks like Pool tries to open a file for each worker process. Every operating system has a limit to the number of open files, and that is much smaller. So when the running Python process gets to (by default) 1024 open files, it cannot open any more files, and so you get the OSError you saw.

Note that having more processes is not always better. It takes OS resources (time and memory) to run each process, and the more processes you have the more work the OS has to do to manage them. At some point, the OS is doing more work managing the processes than you are getting benefit from multiprocessing, and instead of being faster your code is slower.

This happens in real life too. Have you ever seen the kitchen of a busy restaurant? With one cook, there is a limit to how many meals the cook can make. Adding a second cook means they can prepare more food more quickly. Adding a third cook, maybe a fourth, but eventually (say, ten or twenty or fifty cooks, depending on how big the kitchen is) they get in each other’s way and they are less productive. Even the biggest kitchen won’t work well with 10000 cooks crammed in :slight_smile:

I can’t answer your question about maxtasksperchild. I don’t know of any practical examples to show you.

But I imagine the maxtasksperchild argument exists in case your worker processes are running code that leaks resources (memory, open file descriptors, etc). You can tell each worker process to exit after a certain number of tasks, so the OS can reclaim the leaked resources.

If my guess is correct, then maxtasksperchild should be seen as a temporary work-around for bugs in your code until you fix your resource leaks in your worker function.

Hello, good morning Steven,

Thank you very much for your response!

Let me answer some of your questions:

  1. The OS is a Linux-64bits.
  2. I am working within a Linux cluster “ECGATE” that contains several nodes. In fact, my program runs in a single node which consists of 2 CPUs, each of them containing 16 cores.
  3. Sorry for the misunderstood with the “processes” parameter from the “Pool()” class: you are right, it means the number of processes that will be controlled by the Pool object. However, I would like to fully understand the usage of this parameter. Let me show an extract of my Python script, where I create a Pool object and I launch several processes using the “pool.starmap()” function.
    To put you in context, I have a 3D NumPy array which contains Precipitation data for the whole domain of a meteorological model. Each 2D array within the 3D array corresponds to data associated to a specific day and member from the model. I would like to perform some calculus associated to each grid point of the model. For this reason, I want to launch “ny * nx” processes. So, I define a list of arguments where each element will be, at the same time, a list containing the arguments for a specific process. Then, I define the “Pool” object (where I have to specify the “processes” argument). Finally, I call the “pool.starmap()” function to launch the different processes.

z, y, x = meteo_array.shape

argument_list = []

for ny in range(y):

   for nx in range(x):

      argument_list.append((meteo_array, ny, nx, cdf_mode, p_step))

num_processes = 31

pool = mp.Pool(processes = num_processes)

cdf_array_list = pool.starmap(construct_cdf, argument_list)

This code works fine, and it finishes the execution of the “nx * ny = 565 * 469 = 264985” processes in 8 minutes. The thing is that, in the definition of the Pool object, I specified that the number of processes would be 31.

I have performed a little experiment, which consists in modifying the “processes” parameter to see how it affects to the execution time of the program.

These are the results:

processes = 1 ==> time = 19 min
processes = 2 ==> time = 10 min
processes = 4 ==> time = 4 min
processes = 8 ==> time = 2 min
processes = 16 ==> time = 4 min
processes = 31 ==> time = 8 min

I suspect that these results are closely related to the example you exposed about the number of cooks in a kitchen, aren’t they? The largest number of “processes” parameter doesn’t imply the fastest execution time. I think that setting “processes” to a number close to the number of processors of the OS is counterproductive as we may be accessing some processors (or cores) that are actually busy with other processes (not related to the execution of my program), and, thus, provoking an increase of the execution time of the program.

Then, I understand that “processes” argument indicates the number of processes that will be executed simultaneously (or, equivalently, the number of processors controlled by the Pool object).
Could you, please, tell me if this explanation makes sense?

Thanks in advance,

Joan

1 Like

Hi, Joan. Thanks for an interesting result. I’ve been thinking that from the document Process Pools, the number of core’s is the optimum number for processes argument.

processes is the number of worker processes to use. If processes is None then the number returned by os.cpu_count() is used.

In your system, the cpu_count() returns 32 right? Then, I have a question for you. Is processes=8 always the fastest? Can the result depend on the number of physical and logical cores?

It may be that you are dividing your jobs too finely, causing inter-process communication and synchronisation to dominate your runtime. Suppose instead of processing a single x,y point per job you processed an entire y line? Then you would have only 565 jobs instead of 265985 jobs. That’s still plenty enough to take advantage of your 32 cores, but with fewer points of interprocess communication.

Hi Joan,

I really like the little experiment you performed. You have re-discovered parallel slowdown:

The number of processes argument to Pool lets you specify how many worker processes are created and run; it doesn’t tell you how many will actually be running simultaneously. In principle, you can have up to the number of available CPU cores running simultaneously, but in practice, that’s a best case scenario that is rarely achievable and is out of your control.

The actual number will depend on the nature of the problem (perhaps some of the workers spend a lot of time idling, waiting for data to reach them); or the cores are being used by the OS and other processes outside of your control.

You say:

“This code works fine, and it finishes the execution of the “nx * ny = 565 * 469 = 264985” processes in 8 minutes. The thing is that, in the definition of the Pool object, I specified that the number of processes would be 31.”

You don’t have 264985 processes. You have 31 processes, just as you specified when you created the Pool object.

You have 264985 chunks of data to process in argument_list. Each chunk of data is a tuple of five values:

(meteo_array, ny, nx, cdf_mode, p_step)

The Pool object parcels out those 264985 chunks of data between 31 processes, running in parallel. Each process is a worker: it can do some work or computation or calculation on one chunk of data at a time. You have 31 workers, so overall your computer is processing 31 chunks of data more or less at the same time.

It will actually be less on average, since the OS will still be giving some CPU time to the dozens of other processes running on your machine. But to a first approximation, you can ignore those other processes, and assume that your 31 worker processes are running simultaneously.

So, on average, each worker process ends up doing the computation for 8547.9 of those chunks of data.

With a single worker process, the Python interpreter would need to process those 264985 chunks of data one at a time:

CPU core 1:  process chunk 1; process chunk 2; process chunk 3; ...

With two processes, you get:

CPU core 1:  process chunk 1; process chunk 3; process chunk 5; ...
CPU core 2:  process chunk 2; process chunk 4; process chunk 6; ...

With 31 worker processes:

CPU core 1: process chunk 1; process chunk 32; process chunk 63; ...
CPU core 2: process chunk 2; process chunk 33; process chunk 64; ...
CPU core 3: process chunk 3; process chunk 34; process chunk 65; ...
...
CPU core 31: process chunk 31; process chunk 62; process chunk 93; ...

Naively, this would suggest that the 31 process version should be 31 times faster than the single process version, but in practice that’s not actually what happens. The difference between theory and practice is:

  • there are dozens of other processes running on your system, outside of the Python environment; even if they are mostly idling, the OS still has to give them some access to the CPU

  • there is some overhead needed to manage all those worker processes, and the more workers, the more overhead;

  • more time is being spent passing data from the main Python process that launched the workers, than is being spent doing the actual computation;

  • it may be that your computation is not well suited to parallel processing, so dividing it up over 31 workers does not come close to speeding it up by 31 times.

So you have good practical evidence that for your specific problem, the optimum number of worker processes is between 8 and 16 on a 32-core machine.

You say:

“I think that setting “processes” to a number close to the number of processors of the OS is counterproductive as we may be accessing some processors (or cores) that are actually busy with other processes (not related to the execution of my program)”

Yes, that is a very important factor. The busier your OS is, the less CPU time is available for your Pool of worker processes.

Since you are running Linux, you may be able to watch the active processes by running top in a terminal, you should see how many other processes are running and what the total load is.

Ideally the load should approach 32 while your computation is running, which means that all 32 cores are being fully utilized.

If the load is significantly below 32, it means that some of the cores are sitting around doing nothing.

If the load is above 32, it means there is more work needing to be done than cores to do it, and the OS is falling behind managing the allocation of work to cores, and the processes have to wait (potentially a long time!) before they get a chance to run on a core.

2 Likes

Hello Steven,

Thank you very much for all your comments and explanations! They have proved to be very useful and illustrative to me to understand what is really going on inside the CPU when I am setting the parameter “processes” of the Pool object.

Thanks again for your help!

Joan

Hello Kazuya,

I try to answer your three questions:

1)Yes, the node where I am running my program has two CPUs, each one of them with 16 cores.

2)Up to now, setting “processes” to “8” has produced the fastest execution time of the program.

  1. I guess the ideal number of processes to set the “processes” argument will depend on the total number of processes you are able to use. If you ask Steven, I am sure he will give you an answer with computational basis.

Regards,

Joan

Hello Anders,

I will take into account your idea. The thing is that the product that I am developing is grid-point based, so each grid-point will have its value.

Regards,

Joan