Using multiprocessing to query a bunch of IPs for hostname

I’ve written some code that pulls a list of IPv4 subnets from a config file and then using Python’s ipaddress module, iterates over each IP in the list, trying to connect via SNMP to get the device’s hostname. I’m looking for routers, switches and the like. This works well, but it is very slow to iterate over the 4,500+ IPs in those ranges.

So I figured I’d use the multiprocessing module to run many of these in parallel. My first problem was that I didn’t want Python attempting to create thousands of subprocesses all at once. Right now I’m using a pool to accomplish this, and it works, but it doesn’t appear to be any faster than running the tasks in serial. The intent was to run only so many sub-processes at a given time while the list of total IPs in need of querying eventually whittles down to none.

My current code looks something like this:

import multiprocessing as mp
from itertools import product

and then later…

pool = mp.Pool()
results = pool.starmap(getHostname, product(addrs, cfg))

In this case, getHostname is the function, and addrs is a list of IPs that I want to iterate over. The cfg object has to be passed to the function because it contains the SNMP community name that the getHostname function needs to make the query. The getHostname function returns a tuple of both the original IP and the acquired hostname (or None if the action was unsuccessful).

I’ve fiddled with params to mp.Pool() such as processes, but it doesn’t seem to make a difference. I should note that this is running on a Linux VM. Again, the code runs fine, but it is SLOW – it takes many hours for this to complete. I would expect to be noticeably faster than the serial version of the same script.

Am I on the right track? Should I be approaching this from an entirely different perspective? I don’t profess to be a guru in this area, so if there’s a better way, I’m all ears.

Thanks!

product(addrs, cfg) means to make every possible combination of one element of addrs, and one element of cfg. If cfg is for example a dictionary:

>>> from itertools import product
>>> addrs = ['example.com', 'example.org', 'example.net']
>>> cfg = {'x': 1, 'y': 2, 'z': 3}
>>> list(product(addrs, cfg))
[('example.com', 'x'), ('example.com', 'y'), ('example.com', 'z'), ('example.org', 'x'), ('example.org', 'y'), ('example.org', 'z'), ('example.net', 'x'), ('example.net', 'y'), ('example.net', 'z')]

It iterated over the keys of the dict.

If you want cfg to be a complete object that is the same for each call, you can wrap it in a 1-tuple:

>>> list(product(addrs, (cfg,)))
[('example.com', {'x': 1, 'y': 2, 'z': 3}), ('example.org', {'x': 1, 'y': 2, 'z': 3}), ('example.net', {'x': 1, 'y': 2, 'z': 3})]

Or you can use one of several different techniques to “bind” the cfg value to getHostname, and then just use pool.map since there is only one iterable to map:

results = pool.map(lambda a: getHostname(a, cfg), addrs)

MOST of your time is spent waiting for SNMP. Or, more likely, waiting for responses that will never come. Parallelism may help, but multiprocessing just adds overhead. I would recommend using asyncio here if you can, as it scales well to huge numbers of parallel requests; but you may simply find that you are being rate-limited by something on your network, and nothing you do can speed it up.

Not sure how you are probing, but saw this SO How to auto-detect snmp devices using C/C++? - Stack Overflow
It shows use of nmap to find out which SNMP ports are open.

I would be sending a UDP probe to port 161 of each IP using async and watch to see if that gets a response. You can send to lots of hosts at the same time. Of course as UDP is not reliable you would need to retry hosts that do not respond.

Thanks everyone – I appreciate the feedback. I decided to take the multiprocessing code out of the script but tweaked the timeout on the SNMP polling. That has proven to be a bit more reliable than what I had, and actually a bit more performant.