Add `filter()` method for `multiprocessing.Pool` class

evan0greenup · October 2, 2022, 5:25am

Pitch

Currently multiprocessing.Pool.map is a parallel replacement for map builtin. It would be nice to add multiprocessing.Pool.filteras a parallel replacement for filter builtin.

Implementation

Similar to Pool.map() method, it should be divide the input sequence to multiple chunks and feed in different pool workers. But the filter procedures after mapping should per-chunks and in parallel. Finally join the filtered sequence together. This can maximum the filtering parallelism.

storchaka · October 2, 2022, 6:56am

How would you use it? Could you please show a simple realistic example?

evan0greenup · October 2, 2022, 8:34am

@storchaka
Very simple:

for number theory: (given the prime number under 10000000)

with multiprocessing.Pool(8) as pool:
    primes = pool.filter(isprime, range(2,10000000))

All the place where filter is used can be replaced by this.

If user currently want to filter a sequence in parallel, please read https://stackoverflow.com/a/34059823 it has to add additional data structure (flag/return value) in the result list. And filtering process has to be sequentially. This is low efficient.

I believe this new method is not difficult to implement.

brettcannon · October 3, 2022, 10:59pm

I don’t see how you get a speed-up compared to using map() and then filtering post-execution. Is there a performance win somehow? Or is this just for ergonomics?

storchaka · October 4, 2022, 8:40am

If the testing cost is dominant, Pool.filter() can be simulted with Pool.map(). For example: primes = filter(None, pool.map(prime_or_none, range(2,10000000)) where prime_or_none(n) returns n if it is prime and None otherwise. It has an overhead of transferring all these Nones. In this case it is small, but it will be greater if the ratio between negative and positive test results be greater.

But what if the input is expensive too? filter() is usually combined with map() or other non-trivial generator (file iterator, CSV reader generator, etc). How do you combine Pool.filter() with Pool.map()? What if you have more complex flow like filter(..., map(..., filter(...), filter(..., map(...))))? Making simple Pool.filter() is easy, but it could be only used in some simple cases, making a tool useful in complex cases is difficult.

evan0greenup · October 16, 2022, 6:14am

I don’t see how you get a speed-up compared to using map() and then filtering post-execution. Is there a performance win somehow? Or is this just for ergonomics?

Because the ideal multiprocessing Pool filter kicks the negative result out from each worker on-the-fly with testing. But The built-in filter need to wait all the worker is finished. And then execute the screening procedures unparallelly. The performance difference is significant when the number of workers is big.

The ergonomics is also a important reason. Because Python code need to be Pythonic.

evan0greenup · October 16, 2022, 6:19am

When input is expensive too. I don’t think there is any disadvantage of Pool.filter(...) in comparison with filter(...) . Pool.filter(...) and filter(...) is semantically equivalent and can be inter-changable in any situation.

apalala · October 16, 2022, 12:16pm

I thought about a way to approach the problem without a Pool.filter().

The parallelized function could take in the range of items to process, filter, and return an iterator or a list that can be merged (chain, zip) from within the caller.

Topic		Replies	Views
Parallel reduction using concurrent.futures Ideas	0	767	February 7, 2023
Differences between `Pool.map`, `Pool.apply`, and `Pool.apply_async` Python Help	2	15491	January 13, 2021
Use multiple lists to collect multiprocessing results with one callback function while using python multiprocessing module pool.apply_async function Python Help	1	3615	January 14, 2021
Making multiprocessing.pool.Pool.map default to a chunksize of 1 Ideas	0	393	October 1, 2023
Request for review: pool_for_each_par_map Async-SIG help	3	1009	August 14, 2021

Add `filter()` method for `multiprocessing.Pool` class

Pitch

Implementation

Related Topics