Choosing correct encoding for subprocess.Popen()

I’m working on a Python wrapper around 3rd part command-line tool and need to exchange data with it via stdin/stdout. So I use subprocess.Popen() to start a process and then write()/readline() to send data or retrieve result. Here is simplified code

import subprocess
 
command = ['/path/to/executable', 'arg1', 'arg2', 'arg3']
instance = subprocess.Popen(command,
                            stdin=subprocess.PIPE,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.DEVNULL,
                            universal_newlines=True)
 
# pass data and command to the tool
instance.stdin.write('/path/to/data\n-command\n')
instance.stdin.flush()
 
# get results
instance.stdout.readlines()

Unfortunately, this does not work in Windows environment (Python 3.7.0 and Python 3.8, cmd.exe) if file path contains non-ASCII characters, error is

UnicodeEncodeError: 'charmap' codec can't encode characters in position 53-59: character maps to <undefined>

As I understand, this is because I’m using universal_newlines=True and Python tries to decode process output using locale.getpreferredencoding(False) that is not utf-8 on Windows. And when file path contains non-ASCII characters which are not available with current system encoding (e.g. cyrillic on German system) error is raised.

So I tried to remove universal_newlines=True and decode/encode process input and output manually. But can not find a crossplatform way to find out which encoding to use. I will be grateful if you can point me in the right direction or provide small example.

To clarify, by if file path contains non-ASCII characters you mean /path/to/data\n-command\n contains undecodable characters, right?

In that case, you’re essentially asking what the canonical encoding is when communicating with a third-party tool—and there is none. Since input/output are fundamentally all bytes, the encoding to use is entirely between the two processes. There are some general guidelines you can follow (UTF-8 for POSIX and UTF-16 on Windows are common), but ultimately you’ll need to refer to either documentation or implementation of the tool to be entirely sure.

Popen has an encoding parameter that can be used instead of text=True or universal_newlines=True. Python supports two locale-dependent encodings in Windows. “mbcs” (alias “ansi”) is the process ANSI codepage, and “oem” is the process OEM codepage.

(The process uses the system ANSI and OEM codepages unless overridden to UTF-8 in the application manifest. In turn, the system ANSI and OEM codepages are based on the system locale by default, but can they be overridden in the registry – even to UTF-8 in Windows 10.)

We already know that the process ANSI codepage can’t encode the filepath, so check whether the program uses the console encoding (discussed below), UTF-8, or has a setting to force it to use UTF-8. If not, then you may be at an impasse if you can’t change ANSI to UTF-8 (in Windows 10 only). You can work around such a problem by using a bind mountpoint (i.e. either a junction or a subst drive), a symlink, or a hardlink.

this does not work in Windows environment (Python 3.7.0 and Python 3.8, cmd.exe)

The shell (cmd, pwsh, bash, etc) isn’t relevant to the encoding the child process expects, but the console (conhost) might be if the current process is attached to the same console as the child process.

Each process that attaches to a console shares the console’s active input and output codepages – i.e. the console doesn’t use a separate set of active codepages for each attached process. In Python, the console’s active input codepage is os.device_encoding(0), and the active output codepage is os.device_encoding(1). They both default to the same value, which defaults to the system OEM codepage unless modified in the registry. However, they can be changed independently of each other via WINAPI SetConsoleCP and SetConsoleOutputCP. Python has no builtin function to change the device encoding, but it’s easy enough to implement via ctypes. For example:

import ctypes
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
if not kernel32.SetConsoleCP(1251):
    raise ctypes.WinError(ctypes.get_last_error())
 # Set the active output codepage to UTF-8
if not kernel32.SetConsoleOutputCP(65001):
    raise ctypes.WinError(ctypes.get_last_error())

>>> os.device_encoding(0)
'cp1252'
>>> os.device_encoding(1)
'cp65001'

cmd, for example, uses the active console output codepage when translating to and from its internal UTF-16 text encoding. That said, not many applications follow this pattern. Using a console codepage as a preferred encoding isn’t documented as a suggested practice, plus it’s limited to Windows and just plain weird. To compare with Unix, it’s as if there were a way to query the active encoding used by a terminal emulator and a program chose to use that instead of the LC_CTYPE locale category.