PEP 597: Enable UTF-8 mode by default on Windows

Hi, all.

Here is the 2nd edition of the PEP 597. (The 1st edition is here)

I had proposed to change the default text encoding to UTF-8 in the previous edition. But it was backward incompatible change. And raising DeprecationWarning when encoding is ommitted is huge pain.

This time, I am proposing to utilize the UTF-8 mode by enabling it by default on Windows.

I don’t propose it for Python 3.9 because I want to get feedback about UTF-8 mode on Windows from users. I documented the UTF-8 mode on Windows (link) already. I want to recommend the UTF-8 mode for Windows users in 2020.


Abstract

This PEP proposes to make UTF-8 mode [#]_ enabled by default on
Windows.

The goal of this PEP is providing “UTF-8 by default” experience to
Windows users like Unix users.

Motivation

UTF-8 is the best encoding nowdays

Popular text editors like VS Code uses UTF-8 by default.
Even Microsoft Notepad uses UTF-8 by default since the Windows 10
May 2019 Update.
Additionally, the default encoding of Python source files is UTF-8.

We can assume that most Python programmers use UTF-8 for most text
files.

Python is one of the most popular first programming languages.
New programmers may not know about encoding. If the default encoding
for text files is UTF-8, they can learn about encoding when they need
to handle legacy encoding.

People assume the default encoding is UTF-8 already

Developers using macOS or Linux may forget that the default encoding
is not always UTF-8.

For example, long_description = open("README.md").read() in
setup.py is a common mistake. Many Windows users can not install
the package if there is at least one emoji or any other non-ASCII
character in the README.md file.

Even Python experts assume that default encoding is UTF-8.
It creates bugs that happen only on Windows. See [#]_ and [#]_.

Changing the default text encoding to UTF-8 will help many Windows
users.

Specification

Enable UTF-8 mode on Windows unless it is disabled explicitly.

UTF-8 mode affects these areas:

  • locale.getpreferredencoding returns “UTF-8”.

    • open, subprocess.Popen, pathlib.Path.read_text,
      ZipFile.open, and many other functions use UTF-8 when
      the encoding option is omitted.
  • The stdio uses “UTF-8” always.

    • Console I/O uses “UTF-8” already [#]_. So this affects
      only when the stdio are redirected.

On the other hand, UTF-8 mode doesn’t affect to “mbcs” encoding.
Users can still use system encoding by choosing “mbcs” encoding
explicitly.

Backwards Compatibility

Some existing applications assuming the default text encoding is the
system encoding (a.k.a. ANSI encoding) will be broken by this change.

Users can disable the UTF-8 mode by environment variable
(PYTHONUTF8=0) or command line option (-Xutf8=0) for backward
compatibility.

Rejected Ideas

Change the default encoding of TextIOWrapper to “UTF-8”

This idea changed the default encoding to UTF-8 always, regardless of
platform, locale, and environment variables.

While this idea looks ideal in terms of consistency, it will cause
backward compatibility problems.

Utilizing the UTF-8 mode seems better than adding one more backward
compatibility option like PYTHONLEGACYWINDOWSSTDIO.

Reference Implementation

To be written.

References

… [#] PEP 540 -- Add a new UTF-8 Mode <https://www.python.org/dev/peps/pep-0540/>_
… [#] https://github.com/pypa/packaging.python.org/pull/682
… [#] https://bugs.python.org/issue33684
… [#] PEP 528 -- Change Windows console encoding to UTF-8 <https://www.python.org/dev/peps/pep-0528/>_

2 Likes

To save me doing the diff by hand on my phone, exactly what has changed from the last time this was proposed?

The PEP text says nothing about warnings and provides no suggestion of usage patterns or numbers. Since you are proposing a change to the status quo, you have the responsibility to justify and support it - you don’t get to pass that task on those who are opposed.

The most important change is this proposal is 100% bacward compatilble to status quo when PYTHONUTF8=0 is set.

The next important part is I postponed the target Python version from 3.9 to 3.10.

I want to get feedback about UTF8 mode on Windows from users. That’s why I published this PEP.
After seeing feedbacks, I will postpone the target version again or withdraw the PEP.

So please don’t hurry about making a conclusion.

What the “warnings” mean?

“status quo” is chosen when Python 3.0 (2008). The environment around Python on Windows is drastically changed:

  • Populer text editors like VSCode, Atom, Sublime text uses UTF-8 by default.
  • Even Notepad changed the default encoding to UTF-8 now.
  • Recent console supports UTF-8 (chcp 65001 was broken until Windows 7).
  • UTF-8 is used by 94.8%** of all the websites now.
  • Even Microsoft calls code page as “legacy code page” (ref)
  • Emojis are widely used and most code page encoding don’t support it.

I think there is no doubt that UTF-8 is the best default text encoding in 2020s. The main issue is when and how to change it.

One thing that hasn’t (to my knowledge) changed is that most command line programs do not produce output in UTF8 (unless we force codepage 65001, which as far as I know the proposal doesn’t suggest, leaving that to user configuration). So, for example, if you are calling git in a subprocess and capturing the output, using UTF-8 will give incorrect results.

Also, from my limited tests, it appears that code compiled with the mingw toolchain has serious issues dealing with chcp 65001. I don’t have any details, but I’d be very cautious about assuming that switching subprocess calls to UTF-8 is something we can do easily :slightly_frowning_face:

If and when the default code page (both ANSI and console) is changed by Microsoft to 65001, then I think this change is safe - because Microsoft is incredibly cautious about stuff like this. If we want to make this change sooner than that, then we need to be prepared to manage the additional risk that we’re taking on.

1 Like

That’s not quite right on any count. Using codepage 65001 for multibyte console I/O is broken for both input and output in Windows 7. In Windows 8+, writing UTF-8 is mostly fixed. It’s still a bit broken if we’re nitpicking, because it doesn’t support splitting a UTF-8 sequence across two writes. As to the input side of input/output, even in Windows 10 the console does not support reading multibyte input as UTF-8. It is broken for all but ASCII (7-bit) characters. For example:

>> os.read(0, 11)
aĀbƁcÇ
b'a\x00b\x00c\x00\r\n'

Using an alternate terminal for the user interface doesn’t necessarily help, not unless it hooks ReadConsoleA in client process, routes the call to ReadConsoleW, and manually encodes the result back to the caller. The above example was copied out of ConEmu, so despite all of its API hooking, this issue has slipped by it. The new Windows Terminal app doesn’t hook API functions in client processes, and so it definitely needs the problem to be addressed at the API level in ConHost (aka “OpenConsole”).

Edit:
It may help to have a simple diagram of the relationship here: Client Applications <=> ConDrv files <=> ConHost [<=> Alternate Terminal]. ConHost provides a built-in terminal window for console sessions, but the Unicode support of this legacy window is sad. It’s limited to the BMP and doesn’t support combining codes, complex scripts, or automatic font fallback. This has nothing to do with the console’s UTF-8 support, however. That’s on the API server side of ConHost, and it’s a factor even if we’re using an alternate terminal such as the new Windows Terminal app.

Many commands produces code page output, and many other commands produce UTF-8.
For example, “Git for Windows” produces regardless code page. (They use own locale, and UTF-8 is used in the locale). When I do ls > x in bash in “Git for Windows”, x is encoded in UTF-8.

FWIW, my previous idea (I didn’t write it in PEP) was “Enable UTF-8 mode when GetConsoleCP returns CP_UTF8”.
I played around with chcp 65001 and its behavior is very strange, because there are some conhost.exe behind one terminal.
Windows Terminal creates OpenConHost, WSL creates another conhost, and Windows command executed from WSL creates another conhost. So GetConsoleCP may return legacy code page even after chcp 65001.

Another idea was adding one more option to change only text file encoding. But Python has many xxxencoding already and Python users are confused about it.

When considering simplicity, easy to learn, and backward compatibility, I concluded that UTF-8 mode is the best.

I suspect that Microsoft can change ANSI code page to 65001, at least in next 5 years.
Many legacy applications embeds text in legacy encoding and can not run under Windows in other ANSI code page.
Microsoft adds “process code page” last year. It seems that Microsoft recommends using UTF-8 without waiting system code page changed.

Anyway, console code page and ANSI code page are different to encoding for text files and pipes.
I hope Microsoft will introduce new way to use UTF-8 which is better than current “chcp 65001”.
Then I will change this PEP to follow it.


Currently, this PEP enables UTF-8 mode “by default”. But I just want to enable the UTF-8 mode “automatically” based on some nice knob.

Until we find the knob, I want that many Windows users try UTF-8 mode and share feedback.

Thanks for detailed information.

What I want to say here is more and more developers use UTF-8 nowadays.
I can find “chcp 65001” is recommended in many places. (VSCode, Windows Terminal, etc…).

conda uses chcp 65501, PYTHONIOENCODING=utf-8, PYTHONUTF8=1 when executing Python subprocess too. (ref) to avoid troubles caused by legacy encoding.

Since Python 3.x itself uses the console wide-character API for everything except os.read and os.write, limiting input to 7-bit ASCII isn’t a serious problem for Python itself. But in this case all legacy applications such as Python 2 or PHP will read from the console via ReadFile and will be limited to 7-bit ASCII input. I’d call that more trouble than it’s worth if you’re entering non-English text. (It may be different for East-Asian DBCS locales and using IME input. It might work better, but it still may not work across the full range of the BMP.)

For anyone unfamiliar with this, the new Windows Terminal app spawns a custom build of ConHost named OpenConsole, which is only functioning as an API / console session backend server, i.e. Clients <=> ConDrv <=> OpenConsole <=> Windows Terminal. AFAIK, eventually changes in OpenConsole will be pushed back into the OS ConHost – such as the new command-line options (e.g. --headless, --width, --height, --signal, --server). The “open” version is the one anyone can build from the GitHub repo.

As I said, I find issues with multiple command line utilities I use, so I can’t follow that advice. I have no way of knowing how many people will be in that situation. But more importantly, it’ll be a while yet before we can assume that the user’s default encoding is UTF-8.

The subprocess module assumes the default encoding and a lot of code in my experience just uses that default, because it’s basically impractical to get anything more accurate. Switching Python’s default encoding (and hence that assumption) to not match what the OS sets as the default codepage will result in mojibake or errors when reading subprocess output.

The big problem here is that it can be hard to get good reports of issues. I spent a lot of time on various projects hunting down encoding issues after the Python 2 -> 3 transition, and they trickled in over a period of years. They definitely didn’t all get spotted quickly. While this change won’t be anything like as problematic as that one, I know of specific cases where we tackled the question of subprocess output and basically had to take the position that we assume that the encoding the OS has set is the one subprocesses use. (Invoke was one such project that I recall, and I’m pretty sure we hit this on pip too).

But I’m starting to repeat myself now. The fundamental question is whether we think it’s safe for Python to assume that subprocess output will be in UTF-8 when the OS default encoding is something else. I don’t think that’s a safe assumption for us to make - but I can’t easily prove that by trying out the proposed change as personally I live in a mostly-ASCII world.

Apart from subprocesses, I’m not aware of any particular reason for concern in other areas.

One further data point to consider, rust apparently uses lossy conversion from UTF-8 if you request a string version of subprocess output without handling the conversion yourself. Does anyone have any data on how well that works in practice? Do rust programs use stdout_str successfully, or do they normally work at the bytes level?

I still keep this idea because:

  • 65001 codepage can be considered sign the user want to use UTF-8.
  • Some commands including dir will output UTF-8 in this case.

Cons are:

  • When executing Python on Windows from WSL, user might be surprised by legacy encoding is still used. I hope this is fixed in future Windows.
  • pythonw still use legacy encoding.

While this is not ideal, it seems safer than enable UTF-8 mode by default regardless code page.
How do you think about this idea?

There is no safe default encoding. Some command follow the console code page. Some command use legacy encoding regardless console code page. And some command always use UTF-8.

But when console CP is 65001, UTF-8 will be safer than ANSI code page, because many basic commands in cmd.exe shell (e.g. dir, echo, etc…) use UTF-8.

Powershell has different behavior. But subprocess module uses cmd.exe for the default shell.

You mean when CMD is writing output from its internal commands to a non-console file such as a pipe or disk file, without specifying the /U option that makes it write UTF-16. Encodings that use the console input and output codepage could be of limited use. Some applications always use OEM or ANSI for I/O, and we have those covered, but some use either the console input or output codepage.

However, if a program is run as a DETACHED_PROCESS, GetConsoleCP and GetConsoleOutputCP both return 0 (i.e. CP_ACP) because there is no console. Also, if it’s run with CREATE_NEW_CONSOLE or CREATE_NO_WINDOW, the new console will use whatever the current default or per-window-title codepage is. In these cases, the input or output codepage of our own console is irrelevant.


In Windows 10, applications have the option of setting the per-process ANSI/OEM codepages to UTF-8 using the manifest “activeCodePage” setting. In this case, the CRT also defaults to UTF-8 for setlocale(category, ""), as opposed to its normal use of the ANSI codepage from the user (not system) locale.

If anyone is thinking about implementing this for a distribution of Python, I would definitely think twice about using the preferred encoding (UTF-8 now) as the default for the suprocess module. UTF-8 isn’t common enough to justify making it the default for subprocess. Maybe create a new “system_ansi” encoding for this case. The system-locale ANSI codepage can always be queried via GetLocaleInfoEx(LOCALE_NAME_SYSTEM_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...), whereas GetACP() is overridden by the “activeCodePage” setting.

You are right and that’s why I chose “Enable UTF-8 mode by default” in this PEP. Console code page is fragile. It is not a perfect approach.

But it is still nice signal to detect user want to use UTF-8. “chcp 65001” is a very widely used hack for using UTF-8 tools on Windows.
Only people prefer UTF-8 to legacy encoding uses “chcp 65001”, I suppose.

If Microsoft introduces some better flag, I update this PEP to use it. I’m waiting for what Microsoft introduce next.

I thought about it but I prefer UTF-8 mode because we can use “mbcs” for legacy system encoding.

How about change “mbcs” from GetACP to system ANSI codepage?
Since we have not used the “activeCodePage” yet, this is backward cimpatible change, isn’t it?

The core of the problem is that encoding is an application setting, not a system setting. Any signal from the system about what encoding should be used refers to how your application communicates with the system. (And Windows prefers UTF-16 for that, which is why the configuration setting/code page is deprecated.)

What encoding an application uses is up to the application. In Python’s case, we only get to choose the default, but then we should expect the application (script) to override it. Unfortunately, that rarely happened.

Changing the default is a massively breaking change. It’s a 4.x change, in my opinion, not a 3.y. It’ll break cache and configuration files that apps expect to survive updates, or cross versions (pip.ini, tox.ini, setup.cfg are easy examples).

To change the default, we need to start warning when people use the default. Because it’s not just an application setting, but a library and framework setting too. Every level of program has an expectation of how to read its own files, and they all need to be prepared for it to change - the user can’t just override it one day and expect everything to work. We need to be telling library developers that they need to specify an encoding, and show them how to handle forwards compatibility if they need to handle old versions of their own files.

Alternatively, we could make a more intelligent default decoder for Windows that will read UTF-8 until it fails, then assume ACP. Because that’s what we’re going to tell libraries to do anyway, so may as well make it easy on them.

subprocess is a special case, because the encoding there is an agreement between two applications, including if they both agree to use the ACP. In that case, I’d prefer to not have any default encoding, so if you want str rather than bytes you have to specify it yourself.

That’s also what I’d like for open(), but it’s far too late to force that.

I regularly make the argument that if you don’t specify an encoding when you write text to a file, you can’t possibly read it back. If we change the default, everyone is going to learn that very quickly and painfully.

-1 on this PEP.

I can’t speak for others, but I usually use only std::process::Command (subprocess is a third-party crate). Output is strictly bytes-only.

How do you think about enable UTF-8 mode automatically when console code page is 65001 (or based on some better flag if Microsoft introduce) ?

I can not wait Python 4.x. More and more people including kids in kinder school start learning Python. UTF-8 must be the default for them. And setting an environment variable is a complex step for them.

On the other hand, current default value broke these files many times, because legacy encoding can not represent some paths and tool authors forget the default encoding is not always UTF-8.

If the default decode become it, do you think we can change the default encode for writing file to UTF-8?

After default encoding is changed to UTF-8, everyone can omit encoding and no pain. They are not forced to learn what is “encoding”, until they are required to handle legacy encoding.
I want to provide such experience to people start learning Python in 2020s.

Not true. Consider:

  • Old files that were created before the default changed (as Steve says, persistent configuration and data files are likely the biggest risk here, rather than user-created files).
  • Older applications that haven’t been updated. For example, mingw (gcc) on Windows still uses msvcrt. Will msvcrt be updated to cleanly handle cp 65001? Will mingw be updated to use a newer CRT (given that the key issue here is licensing, and being “part of the OS”)? Will older applications be recompiled with the revised mingw?

At the point where essentially every user and every application uses UTF-8, then people can ignore encodings (for a while, at least :slightly_smiling_face:). But it’s not about defaults, it’s about actual usage.

1 Like

Note that I replied to “if you don’t specify an encoding when you write text to a file, you can’t possibly read it back.” here.

For people using legacy application (e.g. automation) , they shouldn’t use UTF-8 mode.

On the other hand, for people who are new to Python, data science or web developer, almost all texts are UTF-8 already and almost all omitting “encoding” option are bug. UTF-8 mode make them happy.