Using multiprocessing to convert files to pdf

Adenitz · June 11, 2023, 2:09pm

Hi Python community,
I have some experience with Python and don’t consider myself a complete beginner. However, this is my first time to use multiprocessing. I have watched two videos about using multiprocessing in Python.
I have written a script to convert all xlsx files from one folder to pdf files and save them to other folder. Conversion process is done with soffice (libreoffice command tool). I have about 100 files , so sequential processing can take a bit time (between 90 and 100 seconds on my PC). Since all files are independent from one another, this looks like ideal case for using multiprocessing. However, the scrip I wrote hangs and I cannot figure out why. I don’t want script to overwrite existing files, so I created a list of only those xlsx files that needs to be exported/converted to pdf. If needed I can provide test files in the folder XLSX
I have tried using imap also, but the problem remain the same.
I’m using Linux Mint Cinnamon 20 with python 3.8.

This is the code:

#!/usr/local/bin/python3

import os
import subprocess
import multiprocessing as mp
import time


def convert_pdf_soffice(xlsx_file):
    out_dir = './PdfDir/'
    print('Converting file: ', xlsx_file)
    dev_null = open(os.devnull, 'w')
    subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file], stderr=dev_null, stdout=dev_null)
    dev_null.close()


start_t = time.time()

input_directory = './XLSX/'
output_directory = './PdfDir/'

# Create folder if not exists
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

existing_pdf_files = [file for file in os.listdir(output_directory) if file.endswith('.pdf')]
# Replace extension was pdf now is xlsx
already_converted_xlsx = [file[:-4] + '.xlsx' for file in existing_pdf_files]

# List of all xlsx files
xlsx_file_list = [file for file in os.listdir(input_directory) if file.endswith('.xlsx')]

# List of xlsx files that actual needs to be converted to pdf
xls_files_to_be_converted = [os.path.join(input_directory, file) for file in xlsx_file_list if file not in already_converted_xlsx]

print('Length of the list is ', len(xls_files_to_be_converted ))

# Multiprocessing conversion
with mp.Pool() as pool:
    result = pool.imap(convert_pdf_soffice, xls_files_to_be_converted)



end_t = time.time()
duration_t = end_t - start_t
print(f'Duration is {duration_t}')

I’m pretty sure this has something to do with soffice tool, because I cannot produce similar situation in any other case such as for example (I know this is not correct way to produce pdf but just for testing purpose to see if this would also hang):

def convert_pdf_soffice(xlsx_file):
    out_dir = './PdfDir/'
    file = os.path.basename(xlsx_file)

    shutil.copy(xlsx_file, out_dir + file[:-4] + 'pdf')
    #dev_null = open(os.devnull, 'w')
    #subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file], stderr=dev_null, stdout=dev_null)
    #dev_null.close()

MRAB · June 11, 2023, 3:07pm

I’d probably start by not directing stdout and stderr to devnull in case it’s saying that it can’t do that because <some reason>.

Maybe it’s complaining that only one instance of soffice can run at a time.

I’d also suggest restricting it to a minimum number of files (2, probably) processed at a time while you investigate.

Adenitz · June 11, 2023, 4:33pm

Thank you for the reply. I hope you can agree that I didn’t make and conceptual, obvious mistake in implementing Pool.
I’m not able to reproduce problem when there are only 2 files, but I am when there are for example 9 files.
For example, using this code:

def convert_pdf_soffice(xlsx_file):
    out_dir = './PdfDir/'
    print('Started conversion of ', xlsx_file)
    subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file])
    print('Finished conversion of ', xlsx_file)

would caused execution to hang with the following screenshot of current output:

MRAB · June 11, 2023, 5:25pm

I was thinking about other reasons why it isn’t working and overlooked the important detail of how multiprocessing works!

Multiprocessing works by running the given function in another process, which, of course, requires that the process import the module first, so the module must be written using the if __name__ == '__main__': idiom. If you don’t do that, when the process imports the module, it’ll also run the code that starts the multiprocessing. It’ll spawn processes that’ll spawn processes that’ll spawn processes…

Rosuav · June 11, 2023, 5:52pm

Assuming that it isn’t forking, that is.

tjreedy · June 11, 2023, 5:54pm

All the ‘good’ examples in multiprocessing — Process-based parallelism — Python 3.11.4 documentation end with

if __name__ == '__main__':
    <do main module stuff only>

because of the explanation in the “Safe importing of main module” section. I believe this is always required on Windows though perhaps only ‘usually’ on *nix. ‘Always’ is easiest to remember and should never hurt.

Adenitz · June 11, 2023, 6:46pm

Thank you all for your replies. Adding

if __name__ == '__main__':

will not solve this problem. Sometime, everything goes OK and the script is able to finish to the end, and most of the time it hang, but it is somehow random in nature.
From the screenshot I posted it can be seen that 7 subprocess are started so it will not spawn processes that’ll spawn processes.

Anyway adding above code didn’t solve the problem.
Here is one characteristic situation. Output pdf directory is empty and I started conversion process. Two files are created immediately in the output folder: File2 and File5. The script hangs and this is output of ps aux command (if…main is added):

MRAB · June 11, 2023, 7:11pm

As you’re using imap, don’t you need to iterate over it?

.map returns a list of results, whereas .imap returns an iterable that you need to iterate over to get the results, which will all be None in this case.

Adenitz · June 11, 2023, 7:15pm

This is only because I tried many different variants, originally with map, and then I tried imaps.
The following code hangs occasionally in the same manner.

#!/usr/local/bin/python3

import os
import subprocess
import multiprocessing as mp
import time
import shutil


def convert_pdf_soffice(xlsx_file):
    out_dir = './PdfDir/'
    print('Started conversion of ', xlsx_file)
    subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file])
    print('Finished conversion of ', xlsx_file)
    

if __name__ == '__main__':

    start_t = time.time()

    input_directory = './XLSX/'
    output_directory = './PdfDir/'

    # Create folder if not exists
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    existing_pdf_files = [file for file in os.listdir(output_directory) if file.endswith('.pdf')]
    # Replace extension was pdf now is xlsx
    already_converted_xlsx = [file[:-4] + '.xlsx' for file in existing_pdf_files]

    # List of all xlsx files
    xlsx_file_list = [file for file in os.listdir(input_directory) if file.endswith('.xlsx')]

    # List of xlsx files that actual needs to be converted to pdf
    xls_files_to_be_converted = [os.path.join(input_directory, file) for file in xlsx_file_list if file not in already_converted_xlsx]

    print('Length of the list is ', len(xls_files_to_be_converted) )

    # Multiprocessing conversion
    with mp.Pool() as pool:
        result = pool.map(convert_pdf_soffice, xls_files_to_be_converted)

    #for file in xls_files_to_be_converted:
    #    convert_pdf_soffice(file)

    end_t = time.time()
    duration_t = end_t - start_t
    print(f'Duration is {duration_t}')

Adenitz · June 12, 2023, 6:24am

I have managed to make a progress, but still cannot be sure if this is the solution.
Before the script used to hang in 4/10 tries. Interesting was it almost never got stuck if all files (more than 100 had to be created), usually it was the case where 5, 6 or 7 files had to created in addition to existing ones.
My conclusion is that it has something to do with libreoffice soffice command tool, because when I rewrote the code to creade libreoffice process once and then pass it to the convert function, it really did improve the statistics. Now hangs happen very rarely, perhaps 1/20. I’m playing with time.sleep() inside the convert function. My goal is this to be guaranteed done in less than a minute. Sequentiall approach gives me execution time about 80 sec and multiprocessing approach lower this below 20 sec.
Anyway these were the changes:

Start LibreOffice process once

libreoffice_cmd = ['soffice', '--headless', '--convert-to', 'pdf']

# Multiprocessing conversion
with mp.Pool() as pool:
    pool.starmap(convert_pdf_soffice, [(libreoffice_cmd, file) for file in xls_files_to_be_converted])