PEP 597: Enable UTF-8 mode by default on Windows

eryksun · February 13, 2020, 5:58am

There’s nothing in Windows that’s comparable to the Unix LC_CTYPE environment variable in terms of reliably controlling the standard I/O encoding of child processes. So the problem is relying on text=True. That won’t be reliable until the Windows NLS team supports an environment variable, checked at startup, that sets the active codepage to UTF-8.

The reliable course with subprocess in Windows is to research exactly which standard I/O coding an application uses by default and what command-line options and environment variables it supports to change the default, such as PYTHONIOENCODING for a child Python process. Configure it to use UTF-8, if possible. Either way, set encoding explicitly to whatever the child process uses.

I’d love to find more modern native builds of the GNU tools for Windows that didn’t have these issues, but I’ve been looking for a long time to no avail
[/quote]

I have doubts about MinGW. I’m more comfortable using a Unix environment such as Cygwin, MSYS2 or WSL to run Unix tools:

>>> grep = r'C:\Msys\usr\bin\grep.exe'
>>> s = ['a£', 'b€', 'c\N{SNOWMAN}']
>>> p = subprocess.run([grep, '\N{SNOWMAN}'], input='\n'.join(s),
...         capture_output=True, encoding='utf-8')
>>>
>>> p.stdout
'c☃\n'

steve.dower · February 13, 2020, 6:01am

For applications that agree to use the active code page to communicate over pipes, they’ll inherit the right thing today. But there’s no good way to do this in a subprocess call. Similarly for those that agree to use the current console encoding. Your best bet is to pass a command line argument to ask the application to use Unicode and also update the encoding parameter so you can decode it.

On your earlier point, even very aware developers are caught out frequently by our breaking changes. In particular, many libraries are not thoroughly tested with new Python versions before some of their users switch to them. This change would be guaranteed to break at least some people, so the evidence is needed to figure out just how many, and the deprecation plan is needed so we can warn the rest as best as we can (though the fact that some people still haven’t realised that Python 2.7 is unsupported means we’ll never reach everyone).

methane · February 13, 2020, 12:11pm

As far as my understanding, MinGW uses just msvcrt setlocale(LC_ALL, "") that uses legacy encoding.

On the other hand, MSYS2 (including “Git for Windows”) and cygwin use own locale. And UTF-8 is used by default. For example, LANG=ja_JP.UTF-8 is set in my environment by default.

When UTF-8 console become stable, people using such UTF-8 tools will love it soon. And people using legacy encoding tools will not use it because it will cause much troubles like this.

So current idea is enable UTF-8 mode automatically in UTF-8 console is detected. But this is still rough idea because UTF-8 console is not stable yet.

At this point, if (GetConsoleCP() == CP_UTF8 && GetConsoleOutputCP() == CP_UTF8) is the proposed way to detect UTF-8 console. But if better way is introduced in Windows 10 21H1, I propse it instead.

pf_moore · February 13, 2020, 12:21pm

I’d be happy to test mingw-built tools. My naive tests using chcp caused the tools to crash, but I now think that was because I was doing the wrong thing. Can you explain to me how I can test this? I have the latest “Windows terminal” build, but that currently doesn’t even support AltGr-4 to enter the € on my keyboard So I assume that’s not what you mean here.

PS This has gone a long way off-topic, so I’m equally happy if you just want to drop it.

methane · February 14, 2020, 7:00am

I think mingw-built tools are not UTF-8 tools, as I said

I don’t know any plan that ucrt supports environment variables like LANG or LC_CTYPE.

MSYS2 (including Git for Windows) and Cygwin are UTF-8 tools.

methane · April 11, 2020, 9:34am

This issue is postponed to “v1.x”. It will be too late for Python 3.9 so I suspend this idea. So I suspend this idea.

I am considering the idea that changing the default encoding again. I will create a new post.

ofek · July 19, 2020, 2:36pm

Any update?

methane · July 20, 2020, 2:36am

I had expected that Microsoft release some environment variable to use UTF-8 conhost for Windows Terminal. But it is not released in Windows 10 (2020).

So I updated the PEP to deprecate opening file without encoding. See current PEP here.

If Windows 10 (2021) will have some environment variable, I will propose this idea again in new PEP.

Topic		Replies	Views
PEP 597: Use UTF-8 for default text file encoding PEPs	83	31073	September 7, 2019
PEP 686: Make UTF-8 mode default PEPs	61	8831	April 27, 2022
PEP 686: Make UTF-8 mode default (Round 2) PEPs	24	5412	January 3, 2023
PEP-597: Emit a Warning when encoding is omitted PEPs	27	3953	February 1, 2021
Add legacy_text_encoding option to make UTF-8 default Ideas	6	1380	March 18, 2022

PEP 597: Enable UTF-8 mode by default on Windows

Related Topics