How to implement parallel processing properly?

mara004 · April 16, 2022, 7:49pm

Running code concurrently in multiple processes is a huge advantage when performing CPU-hungry tasks such as rendering PDF pages to images.
As a relatively new Python developer, I managed to implement some kind of parallel processing using concurrent.futures.ProcessPoolExecutor and its map() method, but nevertheless the resulting code really doesn’t feel nice.
I’ll try to outline what I did using a code snippet based on a real use case from my project pypdfium2:

import concurrent.futures

def _process_page(input_obj, index, password, scale, rotation, colour, annotations, greyscale, optimise_mode):
    with PdfContext(input_obj, password) as pdf:
        result = render_page_topil(
            pdf, index,
            scale = scale,
            rotation = rotation,
            colour = colour,
            annotations = annotations,
            greyscale = greyscale,
            optimise_mode = optimise_mode,
        )
    return index, result

def _invoke_process_page(args):
    return _process_page(*args)

def render_pdf(
        input_obj,
        page_indices = None,
        password = None,
        n_processes = os.cpu_count(),
        scale = 1,
        rotation = 0,
        colour = (255, 255, 255, 255),
        annotations = True,
        greyscale = False,
        optimise_mode = OptimiseMode.none,
    ):
    
    with PdfContext(input_obj, password) as pdf:
        n_pages = pdfium.FPDF_GetPageCount(pdf)
    
    if page_indices is None or len(page_indices) == 0:
        page_indices = [i for i in range(n_pages)]
    if not all(0 <= i < n_pages for i in page_indices):
        raise ValueError("Out of range page index detected.")
    
    n_digits = len(str( max(page_indices)+1 ))
    args = [(input_obj, i, password, scale, rotation, colour, annotations, greyscale, optimise_mode) for i in page_indices]
    
    with concurrent.futures.ProcessPoolExecutor(n_processes) as pool:
        for index, image in pool.map(_invoke_process_page, args):
            suffix = str(index+1).zfill(n_digits)
            yield image, suffix

What displeases me most in this code is the necessity of using positional arguments, which tend to become an unreliable mess for functions that take lots of parameters. Similarly, I dislike the extra argument unwrapping function _invoke_process_page().
Apparently, this pool.map() thing is just not made for my use case, as it presumes a function that only takes a single argument. However, I was unable to find a better altenative while browsing the documentation for multiprocessing and concurrent.futures, but I might have missed something.
Note though that an integral requirement of my code is that the function carrying out parallel processing must yield its result step by step instead of accumulating it in a list. Since images can be large amounts of data, it is very advantageous to only have one such object in memory at a time. That’s why I can’t use multiprocessing.Pool.starmap(), for instance.
I’m glad to hear any recommendations how I could improve my implementation of concurrent processing. I would be especially interested in a possibility to use keyword arguments, as this would save me from code duplication and potential risks caused by positional arguments.

mara004 · April 16, 2022, 9:05pm

Okay, writing this all down helped me think over the whole problem again. I’ve now come up with a substantial improvement in this commit, but it’s still not perfect.

petersaalbrink · April 20, 2022, 7:45pm

Have you considered using functools.partial to substitute your function call? It accepts keyword arguments.
You might then implement something like the following change:

diff --git a/pdf_renderer.py b/pdf_renderer.py
index 95179a0..2d8f0b3 100644
--- a/pdf_renderer.py
+++ b/pdf_renderer.py
@@ -3,19 +3,17 @@

 import os
 import concurrent.futures
+import functools
 from pypdfium2 import _pypdfium as pdfium
 from pypdfium2._helpers import page_renderer
 from pypdfium2._helpers.opener import PdfContext


-def _process_page(render_meth, input_obj, index, password, **kws):
+def _process_page(index, render_meth, input_obj, password, **kws):
     with PdfContext(input_obj, password) as pdf:
         result = render_meth(pdf, index, **kws)
     return index, result

-def _invoke_process_page(kws):
-    return _process_page(**kws)
-

 def render_pdf_base(
         render_meth,
@@ -51,10 +49,10 @@ def render_pdf_base(
         raise ValueError("Out of range page index detected.")

     n_digits = len(str( max(page_indices)+1 ))
-    arguments = [dict(render_meth=render_meth, input_obj=input_obj, index=i, password=password, **kws) for i in page_indices]
+    _invoke_process_page = functools.partial(_process_page, render_meth=render_meth, input_obj=input_obj, password=password, **kws)

     with concurrent.futures.ProcessPoolExecutor(n_processes) as pool:
-        for index, image in pool.map(_invoke_process_page, arguments):
+        for index, image in pool.map(_invoke_process_page, page_indices):
             suffix = str(index+1).zfill(n_digits)
             yield image, suffix

I haven’t tested it, but I think you’ll get the idea.

mara004 · April 20, 2022, 8:36pm

Thanks for the hint! functools.partial looks like a reasonable approach.

mara004 · April 20, 2022, 8:39pm

I’ve committed it: c242369