PEP 597: Enable UTF-8 mode by default on Windows

There’s nothing in Windows that’s comparable to the Unix LC_CTYPE environment variable in terms of reliably controlling the standard I/O encoding of child processes. So the problem is relying on text=True. That won’t be reliable until the Windows NLS team supports an environment variable, checked at startup, that sets the active codepage to UTF-8.

The reliable course with subprocess in Windows is to research exactly which standard I/O coding an application uses by default and what command-line options and environment variables it supports to change the default, such as PYTHONIOENCODING for a child Python process. Configure it to use UTF-8, if possible. Either way, set encoding explicitly to whatever the child process uses.

I’d love to find more modern native builds of the GNU tools for Windows that didn’t have these issues, but I’ve been looking for a long time to no avail
[/quote]

I have doubts about MinGW. I’m more comfortable using a Unix environment such as Cygwin, MSYS2 or WSL to run Unix tools:

>>> grep = r'C:\Msys\usr\bin\grep.exe'
>>> s = ['a£', 'b€', 'c\N{SNOWMAN}']
>>> p = subprocess.run([grep, '\N{SNOWMAN}'], input='\n'.join(s),
...         capture_output=True, encoding='utf-8')
>>>
>>> p.stdout
'c☃\n'

For applications that agree to use the active code page to communicate over pipes, they’ll inherit the right thing today. But there’s no good way to do this in a subprocess call. Similarly for those that agree to use the current console encoding. Your best bet is to pass a command line argument to ask the application to use Unicode and also update the encoding parameter so you can decode it.

On your earlier point, even very aware developers are caught out frequently by our breaking changes. In particular, many libraries are not thoroughly tested with new Python versions before some of their users switch to them. This change would be guaranteed to break at least some people, so the evidence is needed to figure out just how many, and the deprecation plan is needed so we can warn the rest as best as we can (though the fact that some people still haven’t realised that Python 2.7 is unsupported means we’ll never reach everyone).

As far as my understanding, MinGW uses just msvcrt setlocale(LC_ALL, "") that uses legacy encoding.

On the other hand, MSYS2 (including “Git for Windows”) and cygwin use own locale. And UTF-8 is used by default. For example, LANG=ja_JP.UTF-8 is set in my environment by default.

When UTF-8 console become stable, people using such UTF-8 tools will love it soon. And people using legacy encoding tools will not use it because it will cause much troubles like this.

So current idea is enable UTF-8 mode automatically in UTF-8 console is detected. But this is still rough idea because UTF-8 console is not stable yet.

At this point, if (GetConsoleCP() == CP_UTF8 && GetConsoleOutputCP() == CP_UTF8) is the proposed way to detect UTF-8 console. But if better way is introduced in Windows 10 21H1, I propse it instead.

I’d be happy to test mingw-built tools. My naive tests using chcp caused the tools to crash, but I now think that was because I was doing the wrong thing. Can you explain to me how I can test this? I have the latest “Windows terminal” build, but that currently doesn’t even support AltGr-4 to enter the € on my keyboard :frowning: So I assume that’s not what you mean here.

PS This has gone a long way off-topic, so I’m equally happy if you just want to drop it.

I think mingw-built tools are not UTF-8 tools, as I said

I don’t know any plan that ucrt supports environment variables like LANG or LC_CTYPE.

MSYS2 (including Git for Windows) and Cygwin are UTF-8 tools.