PEP 686: Make UTF-8 mode default (Round 2)

In the examples you label “silent data corruption” above, no data is lost either. Fixing them may not be as simple as stripping a single character, but it’s still relatively straightforward.

I agree that a potential cause of corruption which tops out at 1 wrong character per file is less serious than one which can yield an arbitrary number of wrong characters. And you’re of course free to see it as too trivial to be called corruption, compared to the other ones. I still try to avoid it though.

(The reverse situation – an initial sequence of b"\xef\xbb\xbf" which is not intended as a UTF-8 BOM – seems vanishingly unlikely. But from the perspective of ‘we can’t rely on “unlikely”’, I get that utf-8-sig potentially deleting data (instead of “just” garbling it or erroring out, like the others) feels more problematic.)

I think much of the difference you’re seeing in your benchmark is caused by utf-8 taking a shortcut through the codecs lookup machinery:

CPython implementation detail: Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution.

See also the following plot, which extends the benchmark to longer strings and more encoding aliases (full notebook here):

U8 is another alias for utf-8, just one that happens to not take said shortcut. It’s much closer to utf-8-sig, though there’s still a gap. But AFAICS, utf-8-sig ultimately delegates to the same codecs.utf_8_decode as utf-8/U8 (after handling the BOM), so the performance of the decoding itself should be the same.

(U8 + lstrip corresponds to running bytes.decode("U8").lstrip("\ufeff"), which is functionally identical to utf-8-sig, but it comes out much closer to plain U8. So maybe the gap between U8 and utf-8-sig could be made smaller?)

(EDIT: That should be removeprefix instead of lstrip of course, lstrip would incorrectly remove repeated occurrences.)

I thought this would be implemented as a single new encoding added to the codecs registry, with encoders from utf_8.py and decoders from utf_8_sig.py – a sort of hybrid.

But yeah, saying it out loud, I’m starting to see how while not necessarily technically complicated, it’s also probably not worth going through the community process of considering a Frankenstein codec like this as the default, just for the sake of a legacy/deprecated format.

Just because the UTF-8 BOM might be “legacy/deprecated” does not mean it’s not common in the wild. Gratuitously failing to decode it by default just because we don’t want to improve some micro-optimization sounds a bit user-hostile.

(other than said micro-optimization, there is no technical reason I can think of for “UTF-8 + BOM” decoding to be significantly slower than plain UTF-8)

Micro optimization is just an example of additional complexity. There are some other additional complexities.
For example, TextIOWrapper supports random access read&write file. “utf8 for write and utf-8-sig for read” is not so simple.

Anyway, the idea is not directly relating to PEP 686.
It doesn’t solve any problem introduced by the PEP. It is just an additional idea.
So please create a new thread for the new idea.

1 Like

That’s true, but such support already exists for utf-16 and utf-32:

It is not relating to PEP 686.
Please create a new thread for the new ideas.

1 Like