We are trying to run multiple simulation tasks using a multiprocess pool in order to reduce the overall runtime compared to running each task individually in a series. At the beginning of the run CPU and GPU utilization are very high indicating multiple processes running in the background, however, over time both the CPUs and GPUs usage drops down to almost 0.
import multiprocessing
import main_mp
def run_sim(process_num, input_list, gpu_device_list):
"""
multiprocess target function
:param input_list: list of input files tasks
:param process_num: process number
:param gpu_device_list: gpu_device ids for each task
"""
start_time = time.time()
try:
print(f"Process number: {process_num} starting")
main_mp.run_mp_simulation_flow_dist(input_list, gpu_device_list)
status = "completed"
except Exception as e:
print(f"Process number: {process_num} failed")
status = "failed"
finally:
end_time = time.time()
elapsed_time = end_time - start_time
log_process_status(process_num, status, elapsed_time)
if __name__ == '__main__':
# multi-process parameters
multi_processes_number = int(multiprocessing.cpu_count())
gpu_workers = get_cuda_device()
# get file list and split it to tasks
tuple_file_list = get_files(path)
data_list, worker_num = split_data(tuple_file_list, args.process_num)
# assign gpus to each task to balance the overall workload
data_list_sorted = sb.run_sort(data_list, len(gpu_workers))
total_runtime = 0
total_start_time = time.time()
multiprocessing.set_start_method('spawn')
multiprocessing.freeze_support()
# run simulation flow
with multiprocessing.Pool(worker_num) as pool:
run_times.append(pool.starmap(run_sim,[(i, data_list_sorted[i][0],
data_list_sorted[i][1]) for i in range(worker_num)]))
# close all worker processes
pool.close()
pool.join()
total_end_time = time.time()
total_runtime = total_end_time - total_start_time
save_log(in_file_list, multi_processes_number, gpu_workers)
print("multi-process flow completed in ", round(total_runtime/60, 3), " minutes")
environment details: python: 3.12.4 running on an aws instance (g6.24xlarge, 96 vCPUs, 4 nvidia L4 GPUS), cuda 12.3, ubuntu server 24.04.
The simulation is invoked in run_mp_simulation_flow_dist()
and utilizes openMP and CUDA, however, we didn’t write its code ourselves and therefore do not specifically control how it allocates the machine’s resources.
Each process simulation task (invoked in run_sim()
and directly in run_mp_simulation_flow_dist()
) is composed of sub-tasks that after each one of them is finished, a file is written out to the hard disk. Each one of these sub-tasks is invoked in a series but uses openMP and CUDA for it’s speed up. Our expectation was that running 96 process in parallel would speed up the overall runtime of the program. At the beginning the GPU and CPU usage displayed in the terminal is ~95-100% but over time it drops down to 2-4%. Each process doesn’t share any objects/parameters we input to it. However more than one process at a time can use some of the same GPU device and apparently some of the same CPU resources as well.
Could this be the result of some process hanging while holding it’s allocated resources which results in a domino effect of the entire program slowing down overtime? We tried running the code with a 40 process instead of 96 (leaving more vcpu free for run_mp_simulation_flow_dist()
) but the results were the same. Attached is an image of the instance’s CPU monitor for the 40 process case.
When running the same code on a Windows 2022 server machine (same HW), no performance drops were shown and the process was parallelized correctly. We tried changing setting the set_start_method
value to spawn
on the Ubuntu machine and setting max_tasks_per_child
to 1 but this didn’t help as well. What key difference might be the cause of this?
Can anyone offer some way to resolve this? We appreciate any feedback and assistance with this.