PEP 597: Enable UTF-8 mode by default on Windows

One further data point to consider, rust apparently uses lossy conversion from UTF-8 if you request a string version of subprocess output without handling the conversion yourself. Does anyone have any data on how well that works in practice? Do rust programs use stdout_str successfully, or do they normally work at the bytes level?

I still keep this idea because:

  • 65001 codepage can be considered sign the user want to use UTF-8.
  • Some commands including dir will output UTF-8 in this case.

Cons are:

  • When executing Python on Windows from WSL, user might be surprised by legacy encoding is still used. I hope this is fixed in future Windows.
  • pythonw still use legacy encoding.

While this is not ideal, it seems safer than enable UTF-8 mode by default regardless code page.
How do you think about this idea?

There is no safe default encoding. Some command follow the console code page. Some command use legacy encoding regardless console code page. And some command always use UTF-8.

But when console CP is 65001, UTF-8 will be safer than ANSI code page, because many basic commands in cmd.exe shell (e.g. dir, echo, etc…) use UTF-8.

Powershell has different behavior. But subprocess module uses cmd.exe for the default shell.

You mean when CMD is writing output from its internal commands to a non-console file such as a pipe or disk file, without specifying the /U option that makes it write UTF-16. Encodings that use the console input and output codepage could be of limited use. Some applications always use OEM or ANSI for I/O, and we have those covered, but some use either the console input or output codepage.

However, if a program is run as a DETACHED_PROCESS, GetConsoleCP and GetConsoleOutputCP both return 0 (i.e. CP_ACP) because there is no console. Also, if it’s run with CREATE_NEW_CONSOLE or CREATE_NO_WINDOW, the new console will use whatever the current default or per-window-title codepage is. In these cases, the input or output codepage of our own console is irrelevant.


In Windows 10, applications have the option of setting the per-process ANSI/OEM codepages to UTF-8 using the manifest “activeCodePage” setting. In this case, the CRT also defaults to UTF-8 for setlocale(category, ""), as opposed to its normal use of the ANSI codepage from the user (not system) locale.

If anyone is thinking about implementing this for a distribution of Python, I would definitely think twice about using the preferred encoding (UTF-8 now) as the default for the suprocess module. UTF-8 isn’t common enough to justify making it the default for subprocess. Maybe create a new “system_ansi” encoding for this case. The system-locale ANSI codepage can always be queried via GetLocaleInfoEx(LOCALE_NAME_SYSTEM_DEFAULT, LOCALE_IDEFAULTANSICODEPAGE, ...), whereas GetACP() is overridden by the “activeCodePage” setting.

You are right and that’s why I chose “Enable UTF-8 mode by default” in this PEP. Console code page is fragile. It is not a perfect approach.

But it is still nice signal to detect user want to use UTF-8. “chcp 65001” is a very widely used hack for using UTF-8 tools on Windows.
Only people prefer UTF-8 to legacy encoding uses “chcp 65001”, I suppose.

If Microsoft introduces some better flag, I update this PEP to use it. I’m waiting for what Microsoft introduce next.

I thought about it but I prefer UTF-8 mode because we can use “mbcs” for legacy system encoding.

How about change “mbcs” from GetACP to system ANSI codepage?
Since we have not used the “activeCodePage” yet, this is backward cimpatible change, isn’t it?

The core of the problem is that encoding is an application setting, not a system setting. Any signal from the system about what encoding should be used refers to how your application communicates with the system. (And Windows prefers UTF-16 for that, which is why the configuration setting/code page is deprecated.)

What encoding an application uses is up to the application. In Python’s case, we only get to choose the default, but then we should expect the application (script) to override it. Unfortunately, that rarely happened.

Changing the default is a massively breaking change. It’s a 4.x change, in my opinion, not a 3.y. It’ll break cache and configuration files that apps expect to survive updates, or cross versions (pip.ini, tox.ini, setup.cfg are easy examples).

To change the default, we need to start warning when people use the default. Because it’s not just an application setting, but a library and framework setting too. Every level of program has an expectation of how to read its own files, and they all need to be prepared for it to change - the user can’t just override it one day and expect everything to work. We need to be telling library developers that they need to specify an encoding, and show them how to handle forwards compatibility if they need to handle old versions of their own files.

Alternatively, we could make a more intelligent default decoder for Windows that will read UTF-8 until it fails, then assume ACP. Because that’s what we’re going to tell libraries to do anyway, so may as well make it easy on them.

subprocess is a special case, because the encoding there is an agreement between two applications, including if they both agree to use the ACP. In that case, I’d prefer to not have any default encoding, so if you want str rather than bytes you have to specify it yourself.

That’s also what I’d like for open(), but it’s far too late to force that.

I regularly make the argument that if you don’t specify an encoding when you write text to a file, you can’t possibly read it back. If we change the default, everyone is going to learn that very quickly and painfully.

-1 on this PEP.

I can’t speak for others, but I usually use only std::process::Command (subprocess is a third-party crate). Output is strictly bytes-only.

How do you think about enable UTF-8 mode automatically when console code page is 65001 (or based on some better flag if Microsoft introduce) ?

I can not wait Python 4.x. More and more people including kids in kinder school start learning Python. UTF-8 must be the default for them. And setting an environment variable is a complex step for them.

On the other hand, current default value broke these files many times, because legacy encoding can not represent some paths and tool authors forget the default encoding is not always UTF-8.

If the default decode become it, do you think we can change the default encode for writing file to UTF-8?

After default encoding is changed to UTF-8, everyone can omit encoding and no pain. They are not forced to learn what is “encoding”, until they are required to handle legacy encoding.
I want to provide such experience to people start learning Python in 2020s.

Not true. Consider:

  • Old files that were created before the default changed (as Steve says, persistent configuration and data files are likely the biggest risk here, rather than user-created files).
  • Older applications that haven’t been updated. For example, mingw (gcc) on Windows still uses msvcrt. Will msvcrt be updated to cleanly handle cp 65001? Will mingw be updated to use a newer CRT (given that the key issue here is licensing, and being “part of the OS”)? Will older applications be recompiled with the revised mingw?

At the point where essentially every user and every application uses UTF-8, then people can ignore encodings (for a while, at least :slightly_smiling_face:). But it’s not about defaults, it’s about actual usage.

1 Like

Note that I replied to “if you don’t specify an encoding when you write text to a file, you can’t possibly read it back.” here.

For people using legacy application (e.g. automation) , they shouldn’t use UTF-8 mode.

On the other hand, for people who are new to Python, data science or web developer, almost all texts are UTF-8 already and almost all omitting “encoding” option are bug. UTF-8 mode make them happy.

In my experience, people new to Python are generally fine with the current defaults. (Although I admit I’m in a country which is mostly-ASCII, so my experience is privileged in this regard).

Anyhow, I’ll let others comment here. I’m -1 on this proposal, I’ll leave it at that.

In my experience from a not-only-ASCII country, beginners are not at all fine with Python’s defaults on Windows.
Since code editors use UTF-8 and Python I/O defaults to something else*, they often can’t correctly read in a text file they just saved – unless they’re careful about encodings.

* (That’s not necessarily some common “national codepage”. Many people set computers to English because it has the best localization, and I often meet multilingual people who moved in from neighboring countries.)

In my experience, “always use encoding='utf-8' when opening text files” the best advice I can give to beginners. When they want to share their files with non-Windows systems or differently-configured Windows, having everything in utf-8 solves a lot of problems (compared to trying to properly keep track of and convert encodings of every piece of text). And exceptions are fine: when they meet a non-UTF8 file (most commonly when interacting with government data/APIs, required by some crusty old standard to use the national encoding), they’re in a good place to learn about encodings. Arguably, teaching encodings separately from file I/O basics is better pedagogy.

IMO, viewing non-UTF8 applications (like mingw or old config files) as exceptions that need to be treated specially is not much difference to the status quo, where basically anything is an exception to be treated specially – as soon as you share files. Specifically, if your tox.ini or setup.cfg could be shared across systems, relying on a machine-local encoding is already a bug (unless you rely on a smart sharing tool).
But we can start treating these as exceptions now, rather than wait until everything uses UTF-8.

Whereas now, people learn about the problem very slowly and – eventually – painfully.

If the default changes to a single specific encoding, then everyone will magically start writing their files with a known encoding.
Consider a person who designs a system with a configuration file, but doesn’t know or care about encodings at first. When someone points out the issue, they could just say “Oh, it’s UTF-8! Thanks for asking, let me document that!” rather than “Oops, it’s system-dependent! I should have saved the encoding into the file, or specified it explicitly. Sorry!”

IMO it can be a deprecation-cycle change.
If not specifying an encoding is either a hard-to-detect-at-first data-loss bug, or can be fixed easily since you know the encoding, we should start issuing warnings now.

4 Likes

I’m not too keen on UTF-8 encoding, for me it wasting disk space.

Most Chinese characters are in BMP, one such character needs to be represented by 3 bytes with UTF-8. But with local encodings (GBK/GB18030), it only takes 2 bytes.

I tested a text book, the file size:
GBK: 116 KB
UTF-8: 172 KB
If there are many such files, the storage space is considerable.

A concern is backward compatibility, maybe a lot code will be broken.
How about adding an open_utf8() convenience function?

I agree, but the only reasonable warning is “you did not specify an encoding in this call to open”. We can’t base it on the data being decoded (this is what was so bad about str+bytes and auto-decoding), the warning has to appear regardless, on all platforms and in all test suites.

Users can’t necessarily fix the warning themselves, which makes them noise on the level of the invalid escape sequence warning. So the warning needs to reach library maintainers.

At the end of it, only unmaintained libraries will rely on the default encoding to read back their own files, as everyone else will be warned/linted into specifying an encoding. So then we can decide whether to break all the unmaintained code or not - historically, we’ve decided not to break it.

It’s certainly a multi-year effort, including marketing to make people aware that the change is coming and why, none of which is covered in the proposal. PEPs 528 and 529 could slip by (in part because of existing years of deprecation of bytes paths), but I don’t believe this one can without directly causing data loss.

Perhaps we should see whether Jupyter would launch Python kernels in UTF-8 mode? That would reach a lot of people who have to worry about encodings (though they’d probably prefer Latin-1, as they generally just want best effort to read an existing file…)

Maybe that’s the first step? If Pylint, flake8, etc. started to lint on not specifying an ‘encoding’ argument to open() that would hopefully start nudging people slowly towards better practices such that we can then flip on a deprecation warning in a few years and change the semantics in Python 4.

Python uses the active process codepage (i.e. GetACP()) as the preferred encoding. This is normally the legacy codepage of the system locale. The locale in C, on the other hand, uses the user locale. Recently, the CRT has staked out a middle-ground position. If the process active codepage is UTF-8, setlocale(category, "") uses UTF-8 instead of the legacy codepage from the user locale. For example:

C:\>python -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
English_Canada.1252

C:\>python.utf8 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
English_Canada.utf8

Switching the whole process to UTF-8 (not just Python code) during testing may help to highlight code that makes fragile assumptions. Code that assumes that a given default encoding (e.g. process, locale, preferred encoding) is the legacy codepage from the system or user locale will hopefully fail hard on an encoding error and get fixed. However, to be fair, in some cases it could also mask programming errors. This includes extension modules and linked DLLs, which may call MBS ‘ANSI’ functions such as CreateFileA. Python core is guilty of this mistake. _winapi.CreateFile calls CreateFileA with a UTF-8 string. :frowning_face: If the active codepage is a legacy encoding, Windows will happily ‘decode’ the UTF-8 string as mojibake, but this at least can potentially get flagged as an error. If the active codepage is UTF-8, however, this bad code won’t raise any red flags because it just happens to be using the right encoding.

Suggestion

We could replace “python[w].exe” in an installation with a launcher that runs the base executable. This generalizes how venv virtual environments are currently implemented. The base executable would be distributed as both UTF-8 and locale variants, maybe named “python[w].utf8.exe” and “python[w].locale.exe”. The UTF-8 version would set the “activeCodePage” to UTF-8 in the embedded manifest. The launcher would detect UTF-8 mode when it’s enabled by “-X utf8”, PYTHONUTF8, or a new “UTF8” registry value (not set by default). The PYTHONUTF8 environment variable would take precedence over the “UTF8” registry value. The system “py” launcher would also be updated to support this scheme and thus bypass the middle-man launcher.

Also, at startup in the base process, if GetACP() returns CP_UTF8 (65001) and UTF-8 mode isn’t explicitly disabled by the environment/registry, then automatically enable it. Thus UTF-8 mode is automatically set, if it’s not explicitly disabled, when either “python[w].utf8.exe” is run directly or when the system locale is configured to use UTF-8. This supports any code that special cases UTF-8 mode, such as setting the preferred encoding value to “utf-8” instead of “cp65001”.

New “system_ansi”, “system_oem”, “user_ansi”, and “user_oem” encodings could be added that use the system locale and user locale legacy codepages, which can be queried directly via GetLocaleInfoEx. “mbcs” (alias “ansi”, “dbcs”) and “oem” would continue to use the process active codepages from GetACP() and GetOEMCP().

If UTF-8 mode eventually becomes the default (maybe after Windows 8.1 support ends on 2023-01-10 – Python 3.12?), then the only required change would be to set the registry “UTF8” value to true for a particular installation.

So if I’m understanding it, the proposal is now roughly to have UTF8 mode enabled when the user has cp65001 set. Is that correct?

If so, then how is telling people “set cp65001” any different from telling people “set PYTHONUTF8=1”? Both involve making an environment configuration change, and both would have exactly the same impact on Python. Where they differ is in how they affect other aspects of the system, and it’s not up to us to tell users how to configure the non-Python parts of their system.

By the way - one issue here is that the proposal is about enabling UTF-8 mode, but it doesn’t really explain what problems this would solve. I’ve no idea whether any of the encoding issues I’ve encountered over the years would have been fixed by UTF-8 mode, for example. It’s hard to discuss the question without a proper feel for what it will achieve in practice.

1 Like

Is this is in reference to Inada’s suggestion to “enable UTF-8 mode automatically when console code page is 65001”? A few console applications, including CMD, use the console input or output codepage even for non-console files. If they’re run without a console (i.e. as a detached process), they usually just default to the process active codepage because GetConsoleCP() and GetConsoleOutputCP() return 0 (i.e. CP_ACP) .

It’s a weird decision. The encoding of a particular file used by a process shouldn’t determine the preferred I/O encoding of the whole process. I’d reluctantly accept this if Microsoft’s own programs consistently did it, and if it was documented as such, but they don’t, and it’s not documented as such.

Many console applications use the legacy OEM codepage of a process for non-console I/O, including:

  • icacls.exe
  • attrib.exe
  • tree.com
  • fc.exe
  • net.exe

When their output is redirected to a pipe or disk file, non-OEM characters in strings (typically filepaths) get corrupted by best-fit and default character ("?") translation.

Otherwise, most applications, especially GUI applications, use the process active codepage, assuming they don’t use the wide-character API with UTF-16. This sets a common codepage for non-Unicode text on a multi-user system. If a user’s preferred language isn’t compatible with the active codepage, then a program should at least warn the user and provide a chance to switch to UTF-8 or UTF-16. Even Notepad gets this right.


In Windows 10, both the active codepage and active OEM codepage can be forced to UTF-8. This affects the entire process, including extension modules and DLLs. It’s useful in testing, such as looking for cases where the encoding of files isn’t explicit, or cases where multibyte-string API calls use the wrong encoding of arguments. The MBS API is far less forgiving when UTF-8 is the active codepage. It still doesn’t use a strict decoding of arguments, but invalid sequences get mapped to the replacement character (U+FFFD), so there’s no chance that a string (commonly a filepath) with the wrong encoding will silently rountrip to and from an incorrect Unicode string, as is often silently the case with legacy codepages.

That said, my suggestion in a previous reply to force the active codepage when UTF-8 mode is enabled was looking at the wrong setting. Forcing the active codepage to UTF-8 is much closer to Unix PYTHONCOERCECLOCALE. Except it’s coercing the system and user locales away from their legacy ANSI codepage instead of coercing the POSIX and C locales away from ASCII. Also, there’s no support for an LC_CTYPE environment variable in Windows to coerce the locale of child processes.

Yes. I want an easy way to enable UTF-8 mode for environments which are popular/recommended for new Python users. Jupyter is executed from command, VSCode, and PyCharm.

My idea is adding “set PYTHONUTF8=1 enviornment variable” option to the installer. But the installer is not used when Python is installed from conda or Windows Store.

Latin-1 can read UTF-8, shift-jis, etc without any error. It makes Python 3 more danger than Python 2.
I rarely use latin-1 when I want byte transparent behavior. But it must be a hack. It shouldn’t be recommended, especially for new users.

Maybe, it is too early to discuss about it.

Currently, chcp 65001 (or [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding in PowerShell Core) is a hack for users who want to use tools like TeX, node.js, GHC, etc…

I expect many terminals will have “UTF-8 session” by default in near feature. For example, see this issue on Windows Terminal.

So I want to discuss that is there something we should fix before enabling UTF-8 mode automatically, or UTF-8 mode is ready for “UTF-8 console on Windows” era already?

Many sample code in books, and production code omit the encoding option even though they expect UTF-8. For example, when reading json, markdown, reST, and csv download from the internet.

I wrote the example in this PEP. Even packaging.python.org and json module omit encoding option to read README.md and json file.