Hi Python community,
I have some experience with Python and don’t consider myself a complete beginner. However, this is my first time to use multiprocessing. I have watched two videos about using multiprocessing in Python.
I have written a script to convert all xlsx files from one folder to pdf files and save them to other folder. Conversion process is done with soffice (libreoffice command tool). I have about 100 files , so sequential processing can take a bit time (between 90 and 100 seconds on my PC). Since all files are independent from one another, this looks like ideal case for using multiprocessing. However, the scrip I wrote hangs and I cannot figure out why. I don’t want script to overwrite existing files, so I created a list of only those xlsx files that needs to be exported/converted to pdf. If needed I can provide test files in the folder XLSX
I have tried using imap also, but the problem remain the same.
I’m using Linux Mint Cinnamon 20 with python 3.8.
This is the code:
#!/usr/local/bin/python3
import os
import subprocess
import multiprocessing as mp
import time
def convert_pdf_soffice(xlsx_file):
out_dir = './PdfDir/'
print('Converting file: ', xlsx_file)
dev_null = open(os.devnull, 'w')
subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file], stderr=dev_null, stdout=dev_null)
dev_null.close()
start_t = time.time()
input_directory = './XLSX/'
output_directory = './PdfDir/'
# Create folder if not exists
if not os.path.exists(output_directory):
os.makedirs(output_directory)
existing_pdf_files = [file for file in os.listdir(output_directory) if file.endswith('.pdf')]
# Replace extension was pdf now is xlsx
already_converted_xlsx = [file[:-4] + '.xlsx' for file in existing_pdf_files]
# List of all xlsx files
xlsx_file_list = [file for file in os.listdir(input_directory) if file.endswith('.xlsx')]
# List of xlsx files that actual needs to be converted to pdf
xls_files_to_be_converted = [os.path.join(input_directory, file) for file in xlsx_file_list if file not in already_converted_xlsx]
print('Length of the list is ', len(xls_files_to_be_converted ))
# Multiprocessing conversion
with mp.Pool() as pool:
result = pool.imap(convert_pdf_soffice, xls_files_to_be_converted)
end_t = time.time()
duration_t = end_t - start_t
print(f'Duration is {duration_t}')
I’m pretty sure this has something to do with soffice tool, because I cannot produce similar situation in any other case such as for example (I know this is not correct way to produce pdf but just for testing purpose to see if this would also hang):
def convert_pdf_soffice(xlsx_file):
out_dir = './PdfDir/'
file = os.path.basename(xlsx_file)
shutil.copy(xlsx_file, out_dir + file[:-4] + 'pdf')
#dev_null = open(os.devnull, 'w')
#subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', '--outdir', out_dir, xlsx_file], stderr=dev_null, stdout=dev_null)
#dev_null.close()