Running code concurrently in multiple processes is a huge advantage when performing CPU-hungry tasks such as rendering PDF pages to images.
As a relatively new Python developer, I managed to implement some kind of parallel processing using concurrent.futures.ProcessPoolExecutor
and its map()
method, but nevertheless the resulting code really doesn’t feel nice.
I’ll try to outline what I did using a code snippet based on a real use case from my project pypdfium2:
import concurrent.futures
def _process_page(input_obj, index, password, scale, rotation, colour, annotations, greyscale, optimise_mode):
with PdfContext(input_obj, password) as pdf:
result = render_page_topil(
pdf, index,
scale = scale,
rotation = rotation,
colour = colour,
annotations = annotations,
greyscale = greyscale,
optimise_mode = optimise_mode,
)
return index, result
def _invoke_process_page(args):
return _process_page(*args)
def render_pdf(
input_obj,
page_indices = None,
password = None,
n_processes = os.cpu_count(),
scale = 1,
rotation = 0,
colour = (255, 255, 255, 255),
annotations = True,
greyscale = False,
optimise_mode = OptimiseMode.none,
):
with PdfContext(input_obj, password) as pdf:
n_pages = pdfium.FPDF_GetPageCount(pdf)
if page_indices is None or len(page_indices) == 0:
page_indices = [i for i in range(n_pages)]
if not all(0 <= i < n_pages for i in page_indices):
raise ValueError("Out of range page index detected.")
n_digits = len(str( max(page_indices)+1 ))
args = [(input_obj, i, password, scale, rotation, colour, annotations, greyscale, optimise_mode) for i in page_indices]
with concurrent.futures.ProcessPoolExecutor(n_processes) as pool:
for index, image in pool.map(_invoke_process_page, args):
suffix = str(index+1).zfill(n_digits)
yield image, suffix
What displeases me most in this code is the necessity of using positional arguments, which tend to become an unreliable mess for functions that take lots of parameters. Similarly, I dislike the extra argument unwrapping function _invoke_process_page()
.
Apparently, this pool.map()
thing is just not made for my use case, as it presumes a function that only takes a single argument. However, I was unable to find a better altenative while browsing the documentation for multiprocessing
and concurrent.futures
, but I might have missed something.
Note though that an integral requirement of my code is that the function carrying out parallel processing must yield its result step by step instead of accumulating it in a list. Since images can be large amounts of data, it is very advantageous to only have one such object in memory at a time. That’s why I can’t use multiprocessing.Pool.starmap()
, for instance.
I’m glad to hear any recommendations how I could improve my implementation of concurrent processing. I would be especially interested in a possibility to use keyword arguments, as this would save me from code duplication and potential risks caused by positional arguments.