PEP 686: Make UTF-8 mode default (Round 2)

In the examples you label “silent data corruption” above, no data is lost either. Fixing them may not be as simple as stripping a single character, but it’s still relatively straightforward.

I agree that a potential cause of corruption which tops out at 1 wrong character per file is less serious than one which can yield an arbitrary number of wrong characters. And you’re of course free to see it as too trivial to be called corruption, compared to the other ones. I still try to avoid it though.

(The reverse situation – an initial sequence of b"\xef\xbb\xbf" which is not intended as a UTF-8 BOM – seems vanishingly unlikely. But from the perspective of ‘we can’t rely on “unlikely”’, I get that utf-8-sig potentially deleting data (instead of “just” garbling it or erroring out, like the others) feels more problematic.)

I think much of the difference you’re seeing in your benchmark is caused by utf-8 taking a shortcut through the codecs lookup machinery:

CPython implementation detail: Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution.

See also the following plot, which extends the benchmark to longer strings and more encoding aliases (full notebook here):

U8 is another alias for utf-8, just one that happens to not take said shortcut. It’s much closer to utf-8-sig, though there’s still a gap. But AFAICS, utf-8-sig ultimately delegates to the same codecs.utf_8_decode as utf-8/U8 (after handling the BOM), so the performance of the decoding itself should be the same.

(U8 + lstrip corresponds to running bytes.decode("U8").lstrip("\ufeff"), which is functionally identical to utf-8-sig, but it comes out much closer to plain U8. So maybe the gap between U8 and utf-8-sig could be made smaller?)

(EDIT: That should be removeprefix instead of lstrip of course, lstrip would incorrectly remove repeated occurrences.)

I thought this would be implemented as a single new encoding added to the codecs registry, with encoders from utf_8.py and decoders from utf_8_sig.py – a sort of hybrid.

But yeah, saying it out loud, I’m starting to see how while not necessarily technically complicated, it’s also probably not worth going through the community process of considering a Frankenstein codec like this as the default, just for the sake of a legacy/deprecated format.

Just because the UTF-8 BOM might be “legacy/deprecated” does not mean it’s not common in the wild. Gratuitously failing to decode it by default just because we don’t want to improve some micro-optimization sounds a bit user-hostile.

(other than said micro-optimization, there is no technical reason I can think of for “UTF-8 + BOM” decoding to be significantly slower than plain UTF-8)

Micro optimization is just an example of additional complexity. There are some other additional complexities.
For example, TextIOWrapper supports random access read&write file. “utf8 for write and utf-8-sig for read” is not so simple.

Anyway, the idea is not directly relating to PEP 686.
It doesn’t solve any problem introduced by the PEP. It is just an additional idea.
So please create a new thread for the new idea.

1 Like

That’s true, but such support already exists for utf-16 and utf-32:

It is not relating to PEP 686.
Please create a new thread for the new ideas.

1 Like

Shouldn’t this PEP recommend for Windows users to set default system encoding to UTF-8 already, if possible?

That way transition can occur more seamless and it would help in cases like below that would fail silently after Python 3.15. Not to mention, that this would bring a world with consistently encoded output a tiny step closer.

Before 3.15:

# Prints 'лл' just fine as default encoding is utf8 when there is no pipe
python -c "print('лл')"
# Because of the pipe, Python fallbacks to preferred encoding cp1252
# and user sees an error.
# UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>
python -c "print('лл')" | more
# no errors, prints some garbage values since it's interpretting utf8 bytes using current chcp
python -x utf8 -c "print('лл')" | more

And after Python 3.15 python -c "print('лл')" | more will stop emit any errors and user might miss that they’re passing garbage values.

If utf-8 is enabled globally on Windows, then it’s like having both chcp 65001 (utf-8) set in the terminal (e.g. more will interpret values from pipe correctly) and also getpreferredencoding becomes cp65001, so user gets effect of PEP686 even before 3.15.

It cannot be recommended. It breaks many legacy applications.
Microsoft provides per-application UTF-8 locale for now.

Python 3.15 doesn’t change more. You can get effect of PEP 686 before 3.15 by setting set PYTHONUTF8=1.

Environments around Python are various.

  • If you use msysgit or git-for-Windows, your shell is UTF-8 based already. PYTHONUTF8=1 will work nice.
  • If you use modern PowerShell (Core), it is recommended to set out-file encoding and Console encoding to UTF-8.
  • If you use legacy PowerShell or cmd.exe, especially with legacy console, UTF-8 mode may not work well. In such legacy environment, you need to set PYTHONUTF8=0 after Python 3.15.
4 Likes