Hello,
I have a code snippet that is suppose to input a dataframe called “data” which includes the ID column and the V column. It then uses Fitter() to find the best distribution that fits the data and stores the distribution type as well as its respective parameters into a nested list called distribution_info. I am trying to do all this parallelly to save time as it takes a long time for Fitter() to check the fitness for 80 distribution functions. As the data for each id is independent from the rest of the ids. However, this is all being done inside a function script which I call from my main body of code. When I run the main code and it calls the function including this below code, it returns an error, which is shown below as well. As far as I understand, if I do as the error suggests and include this inside a main function, since this multiproccesing code is inside a function script (being called from another code), Python will skip this block of code. How can I us multiproccesing and the power of my GPU inside a function script then? I would be grateful if someone could help me.
import scipy.stats as stats
from fitter import Fitter
import cupy as cp
import dask.dataframe as dd
from dask import compute, delayed
from dask.distributed import Client
import multiprocessing
client = Client()
# Converting Pandas DataFrame to a CuPy array
data = cp.asarray(data)
# Show unique IDs
unique_ids = cp.unique(data['ID'])
# Initialize an empty list to store distribution info
distribution_info = []
# Loop through each unique ID
for id in unique_ids:
# Extract V for the current StopID
var = data[data['ID'] == id]['V']
# Fit distributions to the data
f = Fitter(var, timeout=60)
f.fit()
# Get the best fitting distribution and its parameters
best_fit = f.get_best()
best_fit_name = list(best_fit.keys())[0]
best_fit_params = best_fit[best_fit_name]
# Append distribution info to the list
distribution_info.append(delayed({'ID': id,
'DistributionType': best_fit_name,
'Parameters': best_fit_params}))
# Compute the results in parallel
distribution_info = compute(*distribution_info)
Error:
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
To fix this issue, refer to the "Safe importing of main module"
section in https://docs.python.org/3/library/multiprocessing.html