Sporadic hang in subprocess.run

Hi all!

I am having (very) intermittent problems with calls to subprocess.run() hanging using Python 3.7.3 on Linux.

I have a Python script which, roughly speaking, does this:

def thread_body():
    subprocess.run(["long_running_process"], close_fds=False)
    
threading.Thread(target=thread_body).start()

while True:
    subprocess.run(["short_running_process"], capture_output=True, timeout=1)

I find that occasionally the run() of short_running_process never returns. When it happens, short_running_process appears to be a zombie, and the python script is stuck here. So I guess something is going wrong such that the errpipe is never written to, and subprocess ends up waiting forever.

While it would be nice to work out why this happens, it strikes me that also it might be better if this errpipe were non-blocking and so we could read from it while respecting the timeout. Does that make any sense?

Any thoughts would be most welcome.

1 Like

Are you able to reproduce the issue on any newer version of Python? A fix may have been added later in the 3.7 series, or may well be there in the current 3.11.

Hmm. I don’t know if this is necessarily a problem per se, but it’s overkill - spawning a thread which just blocks on another process. You could, instead, just start the process from the main thread, and then work directly with it as a non-blocking process.

Does the actual script do other work in this thread, and if so, can it be reworked as a “process has now terminated” event of some sort? And if you CAN simplify this down, does that solve your problems with the other processes?

Hi Steve,

Are you able to reproduce the issue on any newer version of Python? A fix may have been added later in the 3.7 series, or may well be there in the current 3.11.

As yet the only place I can reproduce the bug is on an embedded system (on which changing the Python version is difficult) running a torture test for many hours. I’m trying to find an easier reproduction recipe but it really doesn’t happen very often.

I backported the fix for this bug, which didn’t help. I think that case is slightly different, as with my bug there is no shell or sub-sub process involved.

Hi Chris,

Does the actual script do other work in this thread, and if so, can it be reworked as a “process has now terminated” event of some sort? And if you CAN simplify this down, does that solve your problems with the other processes?

It doesn’t do much; to flesh it out a little, it’s really more like

def thread_body():
    while True:
        subprocess.run(["long_running_process"], close_fds=False)
        do_some_stuff()

Can you elaborate a little on how I could have some code triggered when long_running_process terminates?

Thanks!

Oh, it’s actually a looped thing. That’s a little harder to transform, although definitely not impossible. Depending on how quickly you need it to respond, and how frequently your other loop iterates (one second in your example), you could do something like this:

def spawn_longrunner(): # give it a better name based on what it actually does
    global longrunner
    longrunner = subprocess.Popen(["long_running_process"], close_fds=False)

spawn_longrunner()
while True:
    subprocess.run(["short_running_process"], capture_output=True, timeout=1)
    if longrunner.returncode is not None:
        do_some_stuff()
        spawn_longrunner()

In effect, what this does is: Every time you finish one of the short-running processes, check if the long-running one has finished; if so, do the subsequent work, and then restart. (I’m assuming here that do_some_stuff() is relatively fast and doesn’t itself need to be parallelized against the short-running processes; if that’s not the case, you definitely want threads or something here.)

That might not be suitable, though. Trouble is, to get a more effective event-driven system, you would need to do a bigger transformation of your code. Here’s how you could do it with asyncio. There’s a lot more code here because I’ve gone for fully-runnable rather than any stubs; hopefully that’s useful.

import asyncio

# Be my own short-running process
import sys
if "subproc" in sys.argv:
	print("Hi, here's some output")
	import time
	time.sleep(0.5)
	print("Here's some more")
	# Uncomment to see the timeout in force
	#time.sleep(1.5)
	#print("I'll have timed out before this")
	sys.exit()
# End subprocess code, now back to the main

def do_some_stuff():
	print("Doing some stuff!")

async def long_running_processes():
	while True:
		global proc; proc = await asyncio.create_subprocess_exec("sleep", "10", close_fds=False)
		await proc.wait() # Wait for termination
		do_some_stuff()

async def main():
	thread = asyncio.create_task(long_running_processes())
	while True:
		proc = await asyncio.create_subprocess_exec("python3", "aioparallel.py", "subproc", stdout=asyncio.subprocess.PIPE)
		try:
			out, err = await asyncio.wait_for(proc.communicate(), timeout=1)
		except asyncio.TimeoutError:
			print("Stopped the subprocess after one second")
			proc.kill()
		else:
			print("Got %d lines of output" % len(out.decode().strip().split("\n")))

loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
	loop.run_until_complete(main())
finally:
	proc.kill()

The key distinction here is that, instead of threads, we have tasks. Now, it’s entirely possible that this has NOTHING WHATSOEVER to do with your problem, and it’s all been a waste of time; but I have known weird things to happen with threads and subprocesses being mixed on different platforms. (And yes, for once that isn’t a euphemism for “on anything other than Linux”; in fact, subprocess issues happen on basically every platform, but they’re different issues. Isn’t cross-platform coding fun?)

As a side note, this handles the timeout directly, since there’s no subprocess.run() involved. So you have the flexibility to do whatever you wish. I’ve written it to be broadly equivalent to run()'s timeout behaviour (kill the process and then raise rather than returning output) but I don’t know what your actual requirements are here.

3 Likes

Yes, this is a posix subprocess enhancement I think we should make. I was staring at that line just last week thinking it looked gross.

The intent was that the child is “supposed” to close that pipe (cloexec) indicating no error or report an error rapidly so it’d never hang (nearly always true). But even without knowing why specifically you’re observing it, the reality is that it technically could take time or hang as other things on systems can interfere with processes and system calls. Other threads could even have their own bugs and mess up the errpipe_read file descriptor.

2 Likes

I filed posix subprocess should support timeout when waiting on the errpipe_read pipe · Issue #103911 · python/cpython · GitHub to track supporting timeout on that read.

2 Likes

Hi Gregory,

Sounds great, thanks for making the github issue! All being well I can submit a PR for that soon.

Hi Chris,

Thank you so much for taking the time to write such a detailed suggestion! I will take a look and see if it helps with my problem.

No probs! If it helps, awesome. If it doesn’t, well, at least we’ve ruled something out.

I tried adjusting our code to use asyncio, as you suggested, and it seems to have fixed our problems! :smile:

So I guess there is something fishy going on with threads and subprocess (on Linux, at least).

Thank you so much again for your help!

Ah, cool!

Which means: Now you have a mystery waiting to be solved! :slight_smile: Completely up to you how much time you want to sink into it.

But even if you don’t, hey, we can count this as a win as it is.

My pleasure! Always happy to help people who have interesting problems and are willing to document them thoroughly.

1 Like