Command output capture and character encoding issue

jsb · November 20, 2023, 5:46pm

So I have this script that dynamically generates a configuration file for MRTG, the multi-router traffic grapher. My script begins by scanning a predefined list of IPs on the network for hosts that respond to SNMP. It then generates a snmp-get query for the hostname of the system. Once it gets an answer, that hostname is passed along to MRTG’s cfgmaker utility. If the answer is empty or invalid, the host is essentially taken out of the running.

The script captures the output (stdout) of the cfgmaker command and writes the output to disk. It works as expected except in one instance where a specific host (a Cisco router) seems to make cfgmaker spit out content that trips Python up when the content is written to disk. The error is:

 'utf-8' codec can't decode bytes in position 145484-145485: invalid continuation byte

I’m trying to parse the content as strings. I’ve tried capturing the content as bytes and writing it that way, but then I run into other technical issues, particularly because I have to prepend all the cfgmaker content with global MRTG parameters. No matter how I slice it, this one host seems to cause me trouble of some kind or another, no matter what I do.

My workaround, I thought, was to essentially “sanitize” the string of invalid UTF-8 characters with something like this:

text = bytes(text, 'utf-8').decode('utf-8', 'ignore')

But unfortunately that does not work. And for the record, I’ll admit I don’t know much about continuation bytes. Could the output be UTF-16?

Do any of you wiser folk have some good technical advice for me? This one is kicking my rear.

MRAB · November 20, 2023, 6:00pm

If the bytestring isn’t UTF-8, it might be ISO-8851-1 (Latin 1) or cp1252, especially if you’re dealing with a US company.

You could try translating from UTF-8 and, if that fails, try ISO-8851-1 (Latin 1) or cp1252 as a fallback.

This:

text = bytes(text, 'utf-8').decode('utf-8', 'ignore')

“doesn’t work” because it simply translates a Unicode string to UTF-8 and then back again.

jsb · November 20, 2023, 10:05pm

It appears to be a latin-1 issue. I forced everything as latin-1 and everything just works. I’m actually kind of amazed that’s all it was.

In any event, thank you for the response. Sometimes it takes a bit for me to figure these sorts of things out!

kknechtel · November 21, 2023, 12:31am

I assume you’re using something from the subprocess module to capture the output? Are you asking that module to treat the output as text (with a text=True keyword parameter), or decoding it yourself?

What encoding are you expecting the output to use, and why? (Why should it make sense to interpret the output as textual?)

If you are trying to use code like this (assuming you aren’t stuck on 2.x), it means that text is already a string at this point. It can’t fix the problem you describe, because that problem would occur before this code could run. Actually, it can’t fix any problem, because the bytes call will produce valid UTF-8 data regardless of the string that was input (as UTF-8 can encode every Unicode character), so there can’t be any issues to 'ignore' in the decode step.

The term “continuation byte” is just talking about the details of UTF-8 encoding. If the data you’re trying to capture were UTF-16, it would probably fail a lot sooner. If it succeeded, your output would be full of NUL characters (depending on how you display the data, this might not be noticeable).

In the general case, the only way a program can verify whether some data is valid for a particular text encoding, is to try decoding the raw bytes and see if it encounters a problem. The UTF-8 encoding only allows one “role” for each possible byte value: depending on the value, it can only be either a single-byte character, a “start” byte (the first byte of a multi-byte character), or a “continuation” byte (i.e., the not-first byte of a multi-byte character). The error message you saw means, byte 145484 was a start byte, but byte 145485 was not a continuation byte. A start byte has to be followed by (depending on its exact value) a specific number of continuation bytes (at least one).

It was an encoding issue. Working with text requires knowing how it was encoded. UTF-8, UTF-16 and Latin-1 are all possibilities, and there are many more.

There is no automatic way to know how for sure how text is encoded - only heuristics and metadata. Data doesn’t interpret itself - it’s the same as how nothing prevents you from trying to treat a raw sampled audio format as if it represented a raw sampled image, or vice-versa. We have to rely on metadata for that: filename extensions, the header data in a file format, etc.

Some text has a very crude form of header. Python code can use a coding declaration, which uses characters that are represented the same way in many encodings (so Python can guess to start, read a little at the start, and change if necessary). Files that use either UTF-8 or UTF-16 sometimes have a “byte order mark” at the beginning - this is just a particular Unicode character that has no meaning aside from being used for the purpose. By looking for specific patterns of bytes, we can decide the encoding.

Generally, if all of your text uses only the plain ASCII characters, you can get away with guessing wrong to a large extent. UTF-8 is the closest thing we have to a modern “standard”, for serialized data (in-memory is a whole other matter), and it will work for ASCII just like Latin-1 (and many other encodings).

If you think that you have legacy data that uses a “single-byte” encoding (every character uses only one byte in the data; there are thus only 256 possible characters the data can represent, rather than the entire Unicode range), this table can help you figure out which one.