How to use Python multiprocessing in a function script (module)?

ShervinAtaeian · March 18, 2024, 11:58pm

Hello,
I have a code snippet that is suppose to input a dataframe called “data” which includes the ID column and the V column. It then uses Fitter() to find the best distribution that fits the data and stores the distribution type as well as its respective parameters into a nested list called distribution_info. I am trying to do all this parallelly to save time as it takes a long time for Fitter() to check the fitness for 80 distribution functions. As the data for each id is independent from the rest of the ids. However, this is all being done inside a function script which I call from my main body of code. When I run the main code and it calls the function including this below code, it returns an error, which is shown below as well. As far as I understand, if I do as the error suggests and include this inside a main function, since this multiproccesing code is inside a function script (being called from another code), Python will skip this block of code. How can I us multiproccesing and the power of my GPU inside a function script then? I would be grateful if someone could help me.

import scipy.stats as stats
from fitter import Fitter
import cupy as cp
import dask.dataframe as dd
from dask import compute, delayed
from dask.distributed import Client

import multiprocessing

    client = Client()

    # Converting Pandas DataFrame to a CuPy array
    data = cp.asarray(data)

    # Show unique IDs
    unique_ids = cp.unique(data['ID'])


    # Initialize an empty list to store distribution info
    distribution_info = []

    # Loop through each unique ID
    for id in unique_ids:
        # Extract V for the current StopID
        var = data[data['ID'] == id]['V']

        # Fit distributions to the data
        f = Fitter(var, timeout=60)
        f.fit()

        # Get the best fitting distribution and its parameters
        best_fit = f.get_best()
        best_fit_name = list(best_fit.keys())[0]
        best_fit_params = best_fit[best_fit_name]

        # Append distribution info to the list
        distribution_info.append(delayed({'ID': id,
                                        'DistributionType': best_fit_name,
                                        'Parameters': best_fit_params}))

    # Compute the results in parallel
    distribution_info = compute(*distribution_info)

Error:

raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

DerSchinken · March 19, 2024, 7:58am

It’s not talking about a main function but rather a check if __name__ == "__main__":.

What is __name__?
Lets take a look at the documentation: __main__ — Top-level code environment — Python 3.12.2 documentation

When a Python module or package is imported, __name__ is set to the module’s name. Usually, this is the name of the Python file itself without the .py extension:
[…]
If the file is part of a package, __name__ will also include the parent package’s path:
[…]
However, if the module is executed in the top-level code environment, its __name__ is set to the string '__main__' .

Why would a simple check if the code is called as in a top-level code environment help?
This is because of Windows.
Since Windows may have an equivalent of Unix fork, but Windows libraries and the kernels are not built to be “forked” this would lead to some weird bugs and overall would make multiprocessing a lot worse thus the multiprocessing module just starts a new Python process and imports the calling module. If something like Process() (or in this case Client() from dask) gets called upon import, then this would set off an infinite succession of new processes. But since this is thought of Python instead throws a RuntimeError.

Sources:

ShervinAtaeian · March 19, 2024, 9:02am

I am not asking for the reason, I am looking for a solution. I would appreciate it if you could help me solve this problem.

DerSchinken · March 19, 2024, 9:06am

Sorry that I didn’t mention it directly, put your Client() code into the if block. Like python - Dask fails with freeze_support bug - Stack Overflow

da-dada · March 20, 2024, 10:53pm

well, did you have a look at the Executor of Python

Topic		Replies	Views
Help with error on Python Pipeline Python Help	8	1975	September 30, 2023
I don't know how to generate a number of data points from this part of codeings Python Help	0	412	February 26, 2020
I can't how to generate a number of data points from this part of a package Python Help	2	933	February 26, 2020
Problems using the minimize function Python Help	0	664	August 3, 2023
Help with my python code Python Help help	0	353	June 7, 2022

How to use Python multiprocessing in a function script (module)?

Sources:

Related Topics