Multi process pool slow down overtime on linux vs. windows

We are trying to run multiple simulation tasks using a multiprocess pool in order to reduce the overall runtime compared to running each task individually in a series. At the beginning of the run CPU and GPU utilization are very high indicating multiple processes running in the background, however, over time both the CPUs and GPUs usage drops down to almost 0.


import multiprocessing
import main_mp

def run_sim(process_num, input_list, gpu_device_list):
    """
    multiprocess target function
    :param input_list: list of input files tasks
    :param process_num: process number
    :param gpu_device_list: gpu_device ids for each task
    """
    start_time = time.time()
    try:
        print(f"Process number: {process_num} starting")
        main_mp.run_mp_simulation_flow_dist(input_list, gpu_device_list)
        status = "completed"
    except Exception as e:
        print(f"Process number: {process_num} failed")
        status = "failed"
    finally:
        end_time = time.time()
        elapsed_time = end_time - start_time
        log_process_status(process_num, status, elapsed_time)


if __name__ == '__main__':

    # multi-process parameters
    multi_processes_number = int(multiprocessing.cpu_count())
    gpu_workers = get_cuda_device()
    
    # get file list and split it to tasks
    tuple_file_list = get_files(path)
    data_list, worker_num = split_data(tuple_file_list, args.process_num)
    
    # assign gpus to each task to balance the overall workload
    data_list_sorted = sb.run_sort(data_list, len(gpu_workers))

    total_runtime = 0
    total_start_time = time.time()

    multiprocessing.set_start_method('spawn')
    multiprocessing.freeze_support()

    # run simulation flow
    with multiprocessing.Pool(worker_num) as pool:
        run_times.append(pool.starmap(run_sim,[(i, data_list_sorted[i][0], 
                                      data_list_sorted[i][1]) for i in range(worker_num)]))

    # close all worker processes
    pool.close()
    pool.join()

    total_end_time = time.time()
    total_runtime = total_end_time - total_start_time

    save_log(in_file_list, multi_processes_number, gpu_workers)
    print("multi-process flow completed in ", round(total_runtime/60, 3), " minutes")

environment details: python: 3.12.4 running on an aws instance (g6.24xlarge, 96 vCPUs, 4 nvidia L4 GPUS), cuda 12.3, ubuntu server 24.04.

The simulation is invoked in run_mp_simulation_flow_dist() and utilizes openMP and CUDA, however, we didn’t write its code ourselves and therefore do not specifically control how it allocates the machine’s resources.

Each process simulation task (invoked in run_sim() and directly in run_mp_simulation_flow_dist()) is composed of sub-tasks that after each one of them is finished, a file is written out to the hard disk. Each one of these sub-tasks is invoked in a series but uses openMP and CUDA for it’s speed up. Our expectation was that running 96 process in parallel would speed up the overall runtime of the program. At the beginning the GPU and CPU usage displayed in the terminal is ~95-100% but over time it drops down to 2-4%. Each process doesn’t share any objects/parameters we input to it. However more than one process at a time can use some of the same GPU device and apparently some of the same CPU resources as well.

Could this be the result of some process hanging while holding it’s allocated resources which results in a domino effect of the entire program slowing down overtime? We tried running the code with a 40 process instead of 96 (leaving more vcpu free for run_mp_simulation_flow_dist()) but the results were the same. Attached is an image of the instance’s CPU monitor for the 40 process case.

When running the same code on a Windows 2022 server machine (same HW), no performance drops were shown and the process was parallelized correctly. We tried changing setting the set_start_method value to spawn on the Ubuntu machine and setting max_tasks_per_child to 1 but this didn’t help as well. What key difference might be the cause of this?

Can anyone offer some way to resolve this? We appreciate any feedback and assistance with this.

Yes that is possible, I guess likely given your description.

You can use ps afx to see what processes are running on your system as a tree.

Thank you for your reply.
I ran top and watch -n 0.5 nvidia-smi in addition to the ps afx command as you recompensed and monitored the process tree.
The thing I found strange is that there is a --multiprocessing-fork argument at the end of each process instead of spawn? Is this in order?

\_ /home/ubuntu/miniconda3/envs/sim/bin/python -c from multiprocessing.resource_tracker import main;main(5)
   2941 pts/0    Rl+    8:18  |               \_ /home/ubuntu/miniconda3/envs/sim/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=12) --multiprocessing-fork
   2942 pts/0    Rl+    9:09  |               \_ /home/ubuntu/miniconda3/envs/sim/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=14) --multiprocessing-fork
   2943 pts/0    Rl+    8:33  |               \_ /home/ubuntu/miniconda3/envs/sim/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=16) --multiprocessing-fork

It’s hard to tell the exact difference on ps prints between the beginning of the runtime when all of the GPUs and CPUS resources are used and the end stage when they are not utilized. However it seems that at the end at a given time more process have the R1+ status compared to the beginning where there’s a larger mixture of R1 and D1. I would assume that the D1 is the expected file writing that occurs on each sub iteration of the simulation which indicates the SW is running as expected. This confirms that overall the entire SW parallel operation slows down as in less file writing are being performed in a given time.

Can this observation be used to solve this problem and why does this not occur on Windows?

You could add logging the code that runs the forks so that you know when it starts and more importantly when it finishes.

Its not unusual for code on Windows to need to be coded differently to code on Linux systems. So does not help explain why you seem to have stuck process on linux.

Another tool you can use on a process you think might be stuck is strace.
Run it as strace -p <pid> and you can then see which system calls the process is making or if its waiting on some call to return. If it wait to read or write to an FD then you can use lsof -p <pid> to list the open files and the FDs.