In the examples you label “silent data corruption” above, no data is lost either. Fixing them may not be as simple as stripping a single character, but it’s still relatively straightforward.
I agree that a potential cause of corruption which tops out at 1 wrong character per file is less serious than one which can yield an arbitrary number of wrong characters. And you’re of course free to see it as too trivial to be called corruption, compared to the other ones. I still try to avoid it though.
(The reverse situation – an initial sequence of b"\xef\xbb\xbf"
which is not intended as a UTF-8 BOM – seems vanishingly unlikely. But from the perspective of ‘we can’t rely on “unlikely”’, I get that utf-8-sig
potentially deleting data (instead of “just” garbling it or erroring out, like the others) feels more problematic.)
I think much of the difference you’re seeing in your benchmark is caused by utf-8
taking a shortcut through the codecs lookup machinery:
CPython implementation detail: Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution.
See also the following plot, which extends the benchmark to longer strings and more encoding aliases (full notebook here):
U8
is another alias for utf-8
, just one that happens to not take said shortcut. It’s much closer to utf-8-sig
, though there’s still a gap. But AFAICS, utf-8-sig
ultimately delegates to the same codecs.utf_8_decode
as utf-8/U8
(after handling the BOM), so the performance of the decoding itself should be the same.
(U8 + lstrip
corresponds to running bytes.decode("U8").lstrip("\ufeff")
, which is functionally identical to utf-8-sig
, but it comes out much closer to plain U8
. So maybe the gap between U8
and utf-8-sig
could be made smaller?)
(EDIT: That should be removeprefix instead of lstrip of course, lstrip would incorrectly remove repeated occurrences.)
I thought this would be implemented as a single new encoding added to the codecs registry, with encoders from utf_8.py
and decoders from utf_8_sig.py
– a sort of hybrid.
But yeah, saying it out loud, I’m starting to see how while not necessarily technically complicated, it’s also probably not worth going through the community process of considering a Frankenstein codec like this as the default, just for the sake of a legacy/deprecated format.