Allowing pass_fds to be a mapping to pass arbitrary file descriptors to Popen()

It is currently possible to pass specific file descriptors to a subprocess using the pass_fds parameter of subprocess.Popen(). That parameter is a sequence of file descriptors that will be kept, in addition to 0, 1, and 2.

However it is not possible to map those file descriptors to other numbers before starting the subprocess, like it is possible in shell (e.g. cmd 7> output.log 4>&3). Those fds will keep the number they are currently.

It should be easy to change pass_fds to accept a mapping in addition to a sequence. Another more compatible option would be to introduce a separate parameter, that would be exclusive with pass_fds.

Use case

When a shell script grows in complexity, it is often desirable to rewrite it in Python. This gives easier error-handling, support for lists and dictionaries, and more portability. This translation is currently easy and can be done in an iterative fashion (script be script) as Popen() supports everything the shell does, including support for passing files or pipes from other processes as stdin, stdout, and stderr.

# printf 'Hello world\n' | tr a-z A-Z
p1 = Popen(['printf', 'Hello world\\n'], stdout=PIPE)
p2 = Popen(['tr', 'a-z', 'A-Z'], stdin=p1.stdout)

The one blind spot I have found is passing specific file descriptors to a command. Sometimes a command will have a parameter or environment variable to specify which file descriptor to use, but often this is not the case, especially with shell scripts. The only way to do this is to use the os.dup2() interface and hope the file descriptor isn’t already used, or do it in preexec_fn which is even more error-prone.

# cmd 3> output.log
log = open('output.log', 'w')
p1 = Popen(['cmd'], pass_fds=[log.fileno()])  # might be 3, might be another fd...
p1 = Popen(['cmd'], pass_fds={3: log.fileno()})  # proposed

This was already mentioned in the ticket for the pass_fds feature by ferringb: add pass_fds paramter to subprocess.Popen() · Issue #50808 · python/cpython · GitHub

You want the redirections etc to be done in the new forked process before it exec’s the new program?

Which I think is what unix shells will do.

Since a dictionary is a sequence, this would be backward incompatible. Maybe it’d be better to have a separate mapping of “assigned file descriptors”, where you map an integer (subprocess’s fd) to an open file?

Does subprocess use fork and exec? I see in the code use of fork_exec as a single call.
Not sure where that code is, a C module maybe. Also mention of using posix spawn.

If you cannot run code between the fork and exec this could not be implemented?

Yeah, it’s in Modules/_posixsubprocess.c - most of the work is in other functions though, so child_exec is probably where you’d want to go looking.

Since a dictionary is a sequence, this would be backward incompatible.

This is my thought as well, it’s possible that existing code would get broken by such as change. The current pass_fds is ran through set() without type-checking, so code that passes {1: 2} today works and passes fd 1 and would change to passing fd 2 (as 1 in the subprocess). It all depends on whether we consider such code broken, if not then introducing a separate parameter to Popen() is more prudent.

where you map an integer (subprocess’s fd) to an open file?

The list could include open files (ie objects with a fileno()), today’s pass_fds doesn’t accept that so that seems like an orthogonal discussion.

Actually that raises a curious point. After being run through set(), the fds are mapped through int, sorted, and retained as a tuple. This can result in peculiar behaviour:

>>> proc = subprocess.Popen(["sleep", "3600"], pass_fds=(1, '1', b'1', 2)); print(proc.pid)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/subprocess.py", line 1024, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/lib/python3.12/subprocess.py", line 1850, in _execute_child
    self.pid = _fork_exec(
               ^^^^^^^^^^^
ValueError: bad value(s) in fds_to_keep
>>> proc = subprocess.Popen(["sleep", "3600"], pass_fds=(1, 1, 1, 2)); print(proc.pid)
1130362
>>> proc = subprocess.Popen(["sleep", "3600"], pass_fds=(1, 1, 1, '2')); print(proc.pid)
1130367

It’s okay to duplicate FDs, it’s okay to mix types (honestly, I’ve no idea when anyone would, but sure, you can have some integers and some strings if you want to), but it’s not okay to have duplicates with different representations.

You’re right that this is orthogonal, though. But having a basic idea of a mapping, not just a sequence (whether that means a separate parameter or if it can be done on the same one) opens up a number of possibilities, such as having bytestring values in the mapping become automatic pipes. That’d be pretty handy (I’ve made use of this with other languages, where that’s a feature).

fork() and exec() are used and there is already a lot of logic between the two.
As well as warnings about care being needed to avoid deadlocks.

So the moving of an FD to a new number could be done.
Which leaves the issue of what the API should be to provide this information.

I think using a dict {new_fd: current_fd} would not be controversial, the choice is between:

  1. re-using the argument pass_fds which could be a change of behavior in (fairly broken) code, or
  2. introducing a new argument, named something like map_fds or with_fds, which would be exclusive with pass_fds (and close_fds, those two are already exclusive)

My preference would be on 1, re-using pass_fds to keep the Popen interface as simple as possible, and since passing a dict to a (low-level) interface that expects “a sequence” is almost certainly a bug.

I’d dispute this in general - a dict iterates as its keys, its length is the number of keys in it, and it retains order, so it should be perfectly reasonable as a sequence. (Incidentally, I believe the current implementation merely requires that it be an iterable, not even a sequence.)

It’s a bit iffy in the current situation. A Python that expects a sequence, when given a dict, would attempt to retain the target FDs as they currently are, without transforming them. However, this would silently do the wrong thing if you attempt a swap, which is the entire purpose of using the dictionary. I’m on the fence. Ultimately, it would be tidiest to have the API be just “pass_fds is a dictionary, or a sequence that implicitly means {x:x, y:y, z:z}”; but the silent wrong behaviour isn’t good.

1 Like

That’s an excellent point. Unless we want to go through a transition period where dicts are rejected, there is a big risk that code that uses the new feature would do something else on older Python versions.

I must admit I hadn’t considered the forward-compatibility issues. The fact that it’s silent is particularly bad.

Keep in mind that executing code between fork() and exec() is fraught with issues. You can’t just execute Python code without risking deadlocks, and even touching Python objects or allocating memory is problematic. I think dup2() itself is safe, but you can’t just go poking at a Python dict in the child process (because it means executing arbitrary Python code). The implementation would have to make sure to turn the dict into a fork-safe structure before the fork.

Can you explain why python code can deadlock after a fork?

Also why would memory allocation be an issue?
It is not an issue in the C runtime.

It seems that there is some traction, and no one seems to object to the idea. I should be able to propose an implementation as well.

Is the next step for me to open a GitHub issue? Or should bike-shedding of the new argument name happen here? (I propose map_fds)