I assume you’re using something from the subprocess
module to capture the output? Are you asking that module to treat the output as text (with a text=True
keyword parameter), or decoding it yourself?
What encoding are you expecting the output to use, and why? (Why should it make sense to interpret the output as textual?)
If you are trying to use code like this (assuming you aren’t stuck on 2.x), it means that text
is already a string at this point. It can’t fix the problem you describe, because that problem would occur before this code could run. Actually, it can’t fix any problem, because the bytes
call will produce valid UTF-8 data regardless of the string that was input (as UTF-8 can encode every Unicode character), so there can’t be any issues to 'ignore'
in the decode
step.
The term “continuation byte” is just talking about the details of UTF-8 encoding. If the data you’re trying to capture were UTF-16, it would probably fail a lot sooner. If it succeeded, your output would be full of NUL characters (depending on how you display the data, this might not be noticeable).
In the general case, the only way a program can verify whether some data is valid for a particular text encoding, is to try decoding the raw bytes and see if it encounters a problem. The UTF-8 encoding only allows one “role” for each possible byte value: depending on the value, it can only be either a single-byte character, a “start” byte (the first byte of a multi-byte character), or a “continuation” byte (i.e., the not-first byte of a multi-byte character). The error message you saw means, byte 145484 was a start byte, but byte 145485 was not a continuation byte. A start byte has to be followed by (depending on its exact value) a specific number of continuation bytes (at least one).
It was an encoding issue. Working with text requires knowing how it was encoded. UTF-8, UTF-16 and Latin-1 are all possibilities, and there are many more.
There is no automatic way to know how for sure how text is encoded - only heuristics and metadata. Data doesn’t interpret itself - it’s the same as how nothing prevents you from trying to treat a raw sampled audio format as if it represented a raw sampled image, or vice-versa. We have to rely on metadata for that: filename extensions, the header data in a file format, etc.
Some text has a very crude form of header. Python code can use a coding declaration, which uses characters that are represented the same way in many encodings (so Python can guess to start, read a little at the start, and change if necessary). Files that use either UTF-8 or UTF-16 sometimes have a “byte order mark” at the beginning - this is just a particular Unicode character that has no meaning aside from being used for the purpose. By looking for specific patterns of bytes, we can decide the encoding.
Generally, if all of your text uses only the plain ASCII characters, you can get away with guessing wrong to a large extent. UTF-8 is the closest thing we have to a modern “standard”, for serialized data (in-memory is a whole other matter), and it will work for ASCII just like Latin-1 (and many other encodings).
If you think that you have legacy data that uses a “single-byte” encoding (every character uses only one byte in the data; there are thus only 256 possible characters the data can represent, rather than the entire Unicode range), this table can help you figure out which one.