Configurable error handling of subprocess.Popen universal_newline / text argument

gitmatters · April 22, 2024, 11:05am

ATM the universal_newlines / text args are pretty much useless, hence they are not handling decoding errors instead throwing UnicodeDecodeError, when something goes wrong with default encoding=“utf-8”. This can not be intended behaviour, meaning who would want the subprocess to crash, caused by problems with stdout?
The robust workaround is to do the decoding manually, which comes with an option of error handling:

process = subprocess.Popen(command_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
stdout = stdout.decode('utf-8', errors='replace')
stderr = stderr.decode('utf-8', errors='replace')
logging.info(f"ProcStdOut: {stdout}")
logging.error(f"ProcStdErr: {stderr}")

For me replacing decoding errors would be the expected default behaviour, which should be used for universal_newlines/text arg as well. The arg could than be used (not as boolean but) with different modi, fe.None (for false), Replace(for robust replacing errors), Error (for True, throwing Error).

storchaka · April 22, 2024, 12:35pm

Did you try to specify the errors argument?

If you propose to change its default value, the current behavior is more reliable. Errors should not be silently ignored by default.

gitmatters · April 22, 2024, 1:57pm

yeah errors is kind of what I was looking for, how did I missed that …

Anyways I find the default behaviour on throwing errors on logging issues difficult for lasting subprocesses that run for hours, which are than crashing because of encoding. Who would want that ?

In general i`m completly with you errors should not be silent, but in this scenario, where it interrupts the subprocess, I find it little helpful. But could be that I am alone with that opinion, than I dont want to cause drama…

kknechtel · April 22, 2024, 5:50pm

If you don’t know how the data encodes text, you cannot properly treat it as text. Full stop.

After all, not all data is meant to be treated as text, at all.

utf-8 encoding is guessed because it’s popular. When a simple guess doesn’t work, there needs to be some other source of information. Or at least, a different guess.

There aren’t “problems with stdout”. There is one problem, which is trying to treat the data as if it were encoded in UTF-8, even though it isn’t.

The subprocess doesn’t crash. Something in the Python code raises an uncaught exception. Whatever other program you started up, doesn’t care whether your Python code is correctly interpreting its output data. It doesn’t even know your Python code exists. (It might know that it isn’t outputting directly to a terminal, but that’s about it.)

It’s intended behaviour in that raising an exception is far better than getting wrong input and having no way to know it’s wrong.

If replacing erroneous UTF-8 sequences gives you “good enough” input, that’s your decision to make. It doesn’t need to be this “manual”, because subprocess.Popen has encoding and errors parameters which are forwarded as needed to the encoding steps. So do all the other tools like subprocess.call which share all the common “frequently used arguments” described in the documentation.

I’m moving this to the Help section because it appears to be proposing functionality that already exists.

The problem is: if you are running into errors, it is much more likely that you have the wrong encoding, rather than that there is some random garbage byte in the middle of normal text. There are a lot of “code page” (single-byte) encodings out there that will look perfectly normal in UTF-8 as long as you stick to ASCII, and break constantly as soon as there’s anything else. For example, Latin-1 is defined so that each byte value 0…255 corresponds to the first 256 Unicode code points (and nothing else can be represented). If you try to interpret this as UTF-8, bytes with values A0…BF that perfectly well correspond to ordinary characters (as well as 80…9F that correspond to valid control characters) will be seen as invalid continuation bytes. All of the characters ÀÁõö÷øùúûüýþÿ will be seen as invalid bytes that should never appear anywhere in UTF-8 data. A bunch more (C2…F4) will be seen as the leading bytes of some multi-byte sequence, but the following byte will probably not be a valid continuation byte. If it is, that’s even worse: the sequence will be incorrectly interpreted as a single character. If, for example, your company registered a trademark that ends with É, you could be in for a nasty surprise.