PEP 597: Use UTF-8 for default text file encoding

With PowerShell, or Python, or any other application that supports the Unicode APIs for writing to/reading from the console.

cmd.exe (and many of the tools you’d use from it) are not such applications.

(If you’re an end-user, then you don’t get to choose to use the console in any way other than how your applications will let you. It’s only as a developer that you get to choose.)

Just my anecdotal 2c:

As a Windows-only user for the past 10+ years, the absolute only time I’ve written/read things in something other than UTF-8 was when burning in subtitles to video that were created by others. In these cases one can only guess and therefore chardet was used.

Having the default be UTF-8 would have saved me lots of pain over the years.

2 Likes

Thanks for your reaction.
When I saw aws-cli repository for discussion in other thread, I found this issue too.

It’s very obvious that this is common bug, and many Windows users are suffered by default encoding is not UTF-8.

On the other hand, it’s very unobvious that how many (or how few) Windows users are suffered by the backward incompatible change in mid-2020s. It’s devil’s proof.

So my PEP 597 (2nd) propose environment variable to configure default encoding. If it is accepted, you can change the default text encoding. We can postpone the discussion about when change the default of “default text encoding”.

But we have PYTHONUTF8 already. The most important part of PEP 597 is why UTF-8 mode is not enough for Windows users.

So, if you would like to contribute this discussion, it’s very helpful that trying UTF-8 mode now (maybe with chcp 65001).
If it is enough, we don’t need to add yet another configuration option.

Actually it’s not right to say that fopen doesn’t care about encoding – at least not in Windows. For many years, the C runtime in Windows has supported UTF-8 and UTF-16 text files – and even UTF-16LE for console access. [_w]fopen[_s] takes a ccs mode flag, which can be ccs=UTF-8, ccs=UTF-16LE, or ccs=UNICODE – e.g. "a,ccs=UTF-8". At a lower level, these ccs flag values correspond to the _[w]open[_s] flag values _O_U8TEXT, _O_U16TEXT, and _O_WTEXT. The behavior when opened for reading or appending depends on the presence of a BOM. When opened for writing in these Unicode text modes, a BOM is always written.

The catch for UTF-8 mode is that Unicode text is wchar_t UTF-16LE characters. This is not encoding-neutral support for Unicode via UTF-8 as a sequence of arbitrary bytes. The CRT translates UTF-16LE -> UTF-8 when writing and UTF-8 -> UTF-16LE when reading. This is similar to Python’s str <-> bytes translation between text and buffered /raw layers. In text mode, the CRT also implements CRLF <-> LF newline translation.