Best approach: Multi-process image downloader, one SFTP connection

I’ve got a Python script that connects to a bunch of webcams and downloads still images for a legacy system. The script spawns a process for each camera and pulls the stills using the requests module and stuffs it in an IO object. It then opens a file on a remote SFTP server with the paramiko module and writes the IO object to the remote file. This happens repeatedly as fast as the system can handle it.

I can get it to work, but not efficiently. I’m back to the drawing board and am wondering what general approach I should take. The real kicker is that I am trying to limit the SFTP connection to one and then let each process send / upload files in parallel over that connection / login.

I tried passing a single SFTP paramiko object from the main process to each subordinate process but that didn’t seem to work right. I also tried to create a dedicated SFTP process and pass the info via queues, but the files are large enough the queue would simply get backed up and effectively grow and grow and grow.

I’m looking for a broad or general direction on how to get the SFTP working with a shared connection. When I give each process its own login, the firewall thinks its attack traffic and blocks the link preemptively.

Thanks!

Do you need to use multiple processes? Python runs in parallel when you’re blocked by IO.

If you do have to use multiprocessing, could you have requests dump the response body into shared memory, then simply pass pointers to that shared memory on the queue with the SFTP sender as the single consumer?

The idea behind multiprocessing, at least in my mind, was to prevent one laggy webcam from preventing another webcam from being processed. Right or wrong, that was my thinking.

Your latter suggestion was exactly what I did when I mentioned having a dedicated SFTP process. The problem was that my queue was growing almost exponentially and the SFTP sender simply wasn’t able to keep up. Note, however, that I’m no multiprocessing guru when it comes to Python.

The idea behind multiprocessing, at least in my mind, was to prevent
one laggy webcam from preventing another webcam from being processed.
Right or wrong, that was my thinking.

Multiprocessing does this with multiple subprocesses. Pretty heavyweight
for what is a single http fetch.

I’d use threads instead. The API is similar but they’re far far more
lightweight.

Anyway, it seems like sftp is your bottleneck anyway.

Your latter suggestion was exactly what I did when I mentioned having a
dedicated SFTP process. The problem was that my queue was growing
almost exponentially and the SFTP sender simply wasn’t able to keep up.
Note, however, that I’m no multiprocessing guru when it comes to
Python.

AFAIK SFTP is a singlethreaded protocol. I didn’t think you could run
parallel data streams through a single connecttion.

Is there any reason to run only one sftp connection?

If there’s a capacity issue, presumably that capacityy is >1. You could
keep a Semaphore (or, higher level, a thread pool or subprocess pool) to
run multiple sftp connections, capped at some small limit eg 2 or 4.

Since your image fetching stuff seems to work and outpace your sftp
worker, can you get timings from it. Likely there will be some latency
around making a connection, and setting up/tearing down the file
transfer itself.

Are you reusing your sftp connection, or making a fresh one every time?
That’s expensive (timewise).

By running a few in parallel you can at least get these latencies to
happen in parallel.

Cheers,
Cameron Simpson cs@cskk.id.au