Leaked semaphore objects to clean up error when using concurrent futures

arl7 · May 29, 2024, 8:00am

I am trying to download a bunch of files from AWS S3 and then processing them. I used the concurrent futures ThreadPoolExecutor.

Here is my code:

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    futures_to_filename: Dict[concurrent.futures.Future, str] = {}
    for filename in _glob_files_for_seven_days():
       futures_to_filename[
           executor.submit(self.process_file, filename)
       ] = filename
       if len(futures_to_filename) > _MAX_FILES_TO_PROCESS_CONCURRENTLY:
           # Process futures here
           _process_futures(futures_to_filename)
       _process_futures(futures_to_filename)


def _process_futures(futures: Dict[concurrent.futures.Future, str]):
    for completed_future in concurrent.futures.as_completed(futures):
        futures.pop(completed_future)

_MAX_FILES_TO_PROCESS_CONCURRENTLY is a parameter I am using to make sure that the memory doesn’t blow up.

By simply increasing the parameter _MAX_FILES_TO_PROCESS_CONCURRENTLY and max_workers, I get an error that says:

- /usr/local/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

Need some ideas on how I can debug? Also, this only happens on Linux machines and I haven’t seen this issue locally on my Mac.

hansgeunsmeyer · June 1, 2024, 2:42pm

This warning can be emitted by the resource tracker when the program exits. It’s only a warning, so you could choose to ignore it. It is a pretty ugly warning though, and it can indicate resource leaks in your code, but there are also several related bugs in Python 3.8 (and later and in some 3d party libs like tqdm) that were only fixed in later versions of Python (see CPython issues).

But if - as you wrote - this only happens on Linux and only when you increase the number of threads, my guess is that you are running out of system resources. Difficult to say more, since you don’t show all the code (including values for max_number of workers, number of files, available memory). If you have an IO-bound task (which this one seems to be, assuming it’s just downloads and not much more) then you could set the num_workers to be maybe about 2x the number of cpus. Setting it much higher doesn’t seem to make much sense - and will cause problems at some point.

Btw it seems to me that you can simplify your code: You are already setting max_workers so the use of _MAX_FILES_TO_PROCESS_CONCURRENTLY seems unnecessary.