Use multiprocessing module to handle a large file in python

The following is part of my code:

if name == ‘main’:

pool = mp.Pool(mp.cpu_count())
print("cpu counts:" + str(mp.cpu_count()))
with open("./test.txt") as f:
    nextLineByte = f.tell()
    for line in iter(f.readline, ''):
      print("nextLineByte"+str(nextLineByte))
      pool.apply_async(processWrapper, args=(nextLineByte,f) )
      nextLineByte = f.tell()
pool.close()
pool.join()

When I run this code, it got a results. Yet, it is not what I need. The issue is with this line:

      pool.apply_async(processWrapper, args=(nextLineByte,f) )

It looks like the target function processWrapper is not reached at all, which is really confusing. Any comments are greatly appreciated.

The task in the subprocess fails because you are passing a file object (f) as an argument. File objects can’t be passed to subprocesses. You probably meant to pass line instead.

The bigger issue is that the code doesn’t look at the result of the subprocess, so that exception is ignored.
The apply_async method returns an AsyncResult. Before joining, you should call get() on each of these to get the return value or exception from the function called in the subprocess.

    results = []
    for line in ...:
        result = pool.apply_async(...)
        results.append(result)
    ...
    for result in results:
        print('Got:', result.get())
    pool.close()
    pool.join()
1 Like

Yeah. Thank you for your kind reply. I was confusing a bit about it earlier. Now I know. Do you know why the file object cannot be used as an argument in subprocesses?

The method to pass data to the subprocess, pickle, is designed for general serialization. It handles objects you can send over the 'net to another computer, or save to disk to be opened a few years later. Open files aren’t that.

Even if subprocess used a custom, interprocess-only way to do this, it’s not clear what passing a file should do. Processes generally can’t share open files. What would the subprocess do if passed a file object? Open a file with the same name and mode? Or also restore the read/write positions? Keep them synchronized? Or also make sure the file has the same content? Or try using a platform-specific mechanism to share the file? There’s no good default for what exactly should happen, and we don’t guess.
But if you know what should happen, you can pass the file name, mode or contents over, and open the file in the subprocess yourself.