PEP 597: Enable UTF-8 mode by default on Windows

It’s the second time that the same discussion happens about changing the default encoding. There are basically two groups: supporters of the status quo who are fine with the current used encoding, and supporters of UTF-8 everywhere.

The problem of Python 3 was that there was no easy way to switch from Python 2 to Python 3 and that the migration was mandatory. Maybe we need to find a way to support “Python 2” (current encoding) and “Python 3” (UTF-8) in the same Python. For example, if I trust my environment and understand what I’m doing, how can I easily enable the UTF-8 mode from my Python script?

There is no sys.set_utf8_mode(True) for technical reasons: the encoding cannot be changed at runtime. Maybe we need an helper function somewhere to opt-in for UTF-8 Mode in an application? The function would re-execute Python with UTF-8 Mode, but only for the executed process (no effect on processes: don’t set PYTHONUTF-8 env var).

import sys, os, subprocess

def enforce_utf8_mode(enabled=True):
    # this function should be carefully designed to prevent fork-bomb...
    if sys.flags.utf8_mode == enabled:
        return
    opt = 'utf8' if enabled else 'utf8=0'
    argv = [sys.executable, '-X', opt]
    argv.extend(subprocess._args_from_interpreter_flags())
    argv.extend(sys.argv)

    os.execv(argv[0], argv)
    # should not return

enforce_utf8_mode()

print(sys.flags.utf8_mode)

enforce_utf8_mode(False) ensures the the code runs with UTF-8 disabled.

Maybe somehow, we should ensure that an application doesn’t call the function twice. This function should not be used by Python modules (“import …” should not executed it!).

Python modules using open() without specifying an encoding would use the ANSI code page by default, but UTF-8 enforce_utf8_mode() is called.

Does it sound like a bad idea?

Old solutions:

1 Like

Notepad changed the default encoding to UTF-8 already. And VSCode is the popular editor and it’s default to UTF-8 too.
So people learning Python save text from editor, can not read it from Python by default now, unless they are writing only ASCII.

This PEP trying to fix that use case! And I wrote it in the PEP already.

I planned to update this PEP after several months (after Windows Terminal 1.0 is released and people start using UTF-8 console). But I will update this topic in next week.

In short, Windows has different situation to Unix. It is difficult to change the system encoding because it will break many legacy application. Microsoft recommends ignoring the system encoding and use UTF-8 and W-APIs.

This isn’t a 2 vs 3 problem, it’s “what are other applications on my platform trying to speak to me with”. And it’s a big historical problem, in that we could have chosen UTF-8 years ago for Python but didn’t, and now we can’t change it without hurting users.

And developers can’t change it for the whole Python process without hurting themselves or other libraries. The best they can do is not use the default encoding, which they can already do.

Only in the context of a new application.

This isn’t anything to do with what Microsoft recommends, it’s totally about whether we (Python) are prepared to break our own users in order to change it.

Let me give you a concrete example of evidence you could go find: if the vast majority of sdists will “pip install” identically under UTF-8 mode or ACP, then you can argue we wouldn’t be breaking too many people. (I’m not saying it’s enough evidence, but it’s the sort of thing you are obligated to demonstrate as part of your PEP.)

Can we have sys.enable_utf8_mode() only on Windows?

Since UTF-8 mode doesn’t affect to commandline and environment variables on Windows, it is far safer than on Unix. The sys.enable_utf8_mode() set utf8mode flag and sys.std*.rreconfigure(encoding="utf-8").

Since Python use UTF-8 on almost all environment except Windows, it is enough to write an application expects the default text encoding is UTF-8.

import sys

if hasattr(sys, "enable_utf8_mode"):
    sys.enable_utf8_mode()

Doesn’t we discuss about enable UTF-8 mode only in UTF-8 console? I think it doesn’t hurt users.
I haven’t updated the PEP because UTF-8 console are still hack now but I expect it comes in this year, before Python 3.10.

Windows Terminal v1.0 is scheduled in 2020H1 and it will be shipped with solid UTF-8 support. (see terminal/doc/terminal-v1-roadmap.md at main · microsoft/terminal · GitHub)
And it is proposed that “we should introduce a flag that starts up the pseudoconsole host in codepage 65001” (see Terminal should force pseudoconsole host into UTF-8 codepage by default · Issue #1802 · microsoft/terminal · GitHub)

OK, I will try it.

FYI, conda do it already. They set “PYTHONIOENCODING=utf-8” and “PYTHONUTF8=1” when running subprocess including pip to avoid trouble caused by legacy encoding.

This is one evidence for Windows users are hurt by default legacy encoding.

1 Like

That’s a great start! I wonder if @msarahan has anything he can add about conda’s experience here?

Make sure your evidence ends up in the PEP text, as this thread is already too long for anyone to find it again.

I would not support a function which does not reexecute the process. There is a risk that os.fsencode/os.fsdecode or the underlying C functions have been called before. The only safe option is to enable UTF-8 more at Python startup: with PyConfig (PEP 587), not at runtime in Python.

I know that in practice, it’s ok for most use cases on Windows. But it’s not safe in the general case.

1 Like

Unless PYTHONLEGACYWINDOWSFSENCODING is used, FS encoding is UTF-8 anyway.

Additionally, there is sys._enablelegacywindowsfsencoding() already. Adding sys._enablewindowsutf8mode() doesn’t increase the risk much.

I got top 4000 download list from https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json (timestamp is “2020-01-17 15:31:43”)

Then I find packages with non-ASCII description using PyPI JSON API. There are 489/4000 packages.

Then I installed the 489 packages into venv on Linux with C locale (set LC_CTYPE=C, PYTHONCOERCECLOCALE=0, PYTHONUTF8=0). (script)

Then, 82 pacakges are failed to install by UnicodeDecodeError (list). There are some packages failed due to other cause like missing dependency.

Of course, this number doesn’t mean Windows users can not install these 82 packages, because some packages provide wheels. But this number shows:

  • How often non-ASCII characters are used in texts like README. (489/4000)
  • How often experienced Python developers miss the encoding="utf-8" option. (at least 82/489)
1 Like

Thanks for doing this, it’s very useful information. Did you confirm those README files were UTF-8 (as opposed to some other encoding) or is that just an assumption? I think it’s fine to assume that, I’m just interested, and I think it’s something that should be made explicit. I assume you’ll add at least a summary of these findings to the PEP.

Also, can we please clarify what the actual proposal is now for when UTF-8 mode gets set? This experiment is just as good an argument for forcing UTF-8 mode on Unix systems where the encoding is set to something like latin-1 or shift-JIS. So given that we’re not doing that on Unix, when precisely are we proposing to fix this issue on Windows? And what’s the justification for that choice? PEP 540 explains the reasons for when and how UTF-8 mode is forced on Unix here. - I think PEP 597 should explain precisely when UTF-8 mode is forced on, and why that choice was made, in just the same way for Windows. It’s actually a bigger deal on Windows because unlike Unix, almost no Windows systems default to UTF-8 (yes, yes, I know, it’s not a global setting, “default” doesn’t make sense, etc - all the more reason to be precise about what conditions the PEP is proposing to check!)

1 Like

Chaging the active codepage to UTF-8 doesn’t have to be configured at the system level, which at this time is generally a bad choice and requires administrator access and rebooting the system. It can be coerced to UTF-8 at the process level by setting it in the application manifest. This could be implemented by always using the “python.exe” launcher (as used in virtual environments) to select the base executable depending on whether coercion is enabled.

It should be highlighted that neither locale coercion nor UTF-8 mode is automatically enabled in Unix if a locale other than “C” or “POSIX” is configured, even if the configured locale uses ASCII. Here’s an example in Linux with an “en_CA.ascii” locale:

$ LC_CTYPE=en_CA.ascii ./python -q
>>> import sys; sys.version_info[:3]
(3, 9, 0)
>>> from locale import *             
>>> setlocale(LC_CTYPE)
'en_CA.ascii'
>>> getpreferredencoding()
'ANSI_X3.4-1968'

So enabling UTF-8 by default in Windows can’t be justified solely on the basis of avoiding the use of a legacy encoding as the preferred encoding. Other reasons are required to justify special casing Windows, else ISTM the proposed new behavior needs to be extended to all platforms.

That’s quite a weird setting, so we can assume the user knows what they’re doing. The situation would be different if *.ascii was a common system default.

I don’t understand how this is relevant. The codepage can be set in a lot of ways, what I’m trying to say is that we should focus more on how Python deals with things when the user doesn’t make an effort to change settings. We can assume that if the user knows enough to configure a codepage, they know enough to deal with encodings in general.

Also, I don’t see how application manifests would be relevant here. I’m a reasonably knowledgeable coder, and I wouldn’t even consider trying to change the application manifest for a program I had (for example, GNU grep installed via Scoop) just to get it to talk UTF-81.

That is precisely the point I was trying to make, thanks.

1 Also, I said before that I think programs compiled with mingw that use msvcrt seem to get in a mess when run in an environment set to use cp 65001. So I’d pretty definitely not want to force subprocesses to use that codepage, if that’s true.

I used ASCII as an extreme, but it could be any encoding, such as ISO-8859-1 or SHIFT_JIS. The point is the fact that the system locale in Windows (the default source of the active codepage) is using a legacy codepage instead of UTF-8 should not be the deciding factor to automatically enable UTF-8 mode. This would be inconsistent with Unix, so we need good reasons to justify it.

My point is that UTF-8 is the de-facto default on *nix. If you’re not using UTF-8, you either changed the encoding (so you probably know what you’re doing, and Python respects the decision), or you’re using an old or specialist system (so you probably also know what you’re doing).
The most common case of non-UTF-8-ness on *nix is the POSIX default, C.ascii. And Python does coerce that to UTF-8! (Granted, that was an much easier decision than here, because UTF-8 doesn’t break code that’s actually correctly ascii-only.)
There’s the reason Python doesn’t coerce other ASCII encodings: en_CA.ascii is something the user set explicitly, so we respect the decision.

On *nix, if you don’t care about encodings, you get UTF-8. That’s what’s inconsistent with Windows.

2 Likes

No, it can only be changed in exactly two ways. An administrator can change the system encoding to UTF-8, but this will likely break many applications. Or an individual application can set its “activeCodePage” to UTF-8 in its manifest.

I brought up locale conversion because it’s a significant component of the startup behavior in Unix, which tries to coerce to a UTF-8 locale if “C” or “POSIX” is configured, and it only falls back on UTF-8 mode if coercing fails or is disabled. To do something similar in Windows, we would need to set the “activeCodePage” to UTF-8 in an alternate base executable and use a “python.exe” launcher to execute the required executable depending on whether coercion is enabled (e.g. “python_utf8.exe” vs “python_locale.exe”). This affects the entire process. It sets the Windows multibyte API to use UTF-8, and the CRT will also set its default locale to use UTF-8 in this case. It’s the closest we can get to the effect of locale coercion in Unix, except since there’s no environment-variable support, we have no way consistent, reliable way to influence child processes.

Maybe you’re confusing the console’s input and output codepages with the active codepage of a process (CP_ACP). Once set at startup, the active codepage is locked in for the remainder of the process lifetime (barring low-level hacking of the PEB and private data structures in ntdll). It’s the encoding used by the Windows multibyte-string (i.e. ANSI) API, and many programs use it as a default encoding for files and other I/O, but it is not related to console I/O.

The console is in its own process (conhost.exe) and has an unrelated system for its multi-byte string API (i.e. ReadConsoleA – despite the “A” suffix – is not necessarily using our process active codepage, CP_ACP). Console files (i.e. files opened on “\Device\ConDrv” that are used for I/O, such as “Input”, “Output”, “CurrentIn”, and “CurrentOut”) are internally UTF-16 buffers in the console host process (well UCS-2, but they should be UTF-16), but for convenience of programs that use multibyte strings, the console host has an input (reading) and an output (writing) codepage for use with multibyte-string API functions such as generic ReadFile and WriteFile and console-specific ReadConsoleA and WriteConsoleA. Since 3.6, Python core no longer uses these console codepages because we do our own transcoding between UTF-8 and UTF-16 and use the console’s wide-character API.

Thank you. Yes, I was. Your explanation was very helpful.

Having said that, it’s helped me to understand a lot better the technical details of what the PEP is proposing to do, but it’s made me even less sure of what the practical impact will be :slight_smile: Luckily, I understand well enough now to try some actual experiments.

One thing that bothers me is the following interaction:

>py -Xutf8
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = ["a£", "b€", "c\N{SNOWMAN}" ]
>>> from subprocess import *
>>> p = Popen(["grep", "\N{SNOWMAN}"], stdin=PIPE, stdout=PIPE, text=True)
>>> o, e = p.communicate("\n".join(s))
>>> print(o)

>>>

Note that the output is empty - which is clearly wrong. Of course, without -Xutf8, the program fails with an encoding error, because "\N{SNOWMAN}" is not a valid character in cp1252 (my default codepage). But is an encoding error worse than silently giving the wrong answer?

I assume the issue here is somehow a result of grep not handling UTF-8 correctly. But that’s precisely my point, if Python’s defaults are changed in such a way that they can cause silent errors in code that runs subprocesses, then we are not helping our users (even if it’s technically the subprocess that’s wrong).

I don’t know if what’s proposed here would somehow be different than running Python 3.8.1 in -Xutf8 mode. Nor do I know how serious an issue this would be in practice. Certainly most of the other tests I ran with grep “just worked” with -Xutf8. But then, they also worked without -Xutf8

Let’s just say that I think that this specific interaction should be discussed in the PEP.

And to be clear, the grep command I used was my own build of grep using mingw - available here. Running the same test with ripgrep works perfectly. The default version of grep from scoop has the same problem as my build, so I suspect it’s common to mingw-built software (ripgrep is written in rust). My feeling is that there’s a certain class of users/systems that would be hit by this issue1 but I have no real feel for how big that group would be. (Offtopic, I’d love to find more modern native builds of the GNU tools for Windows that didn’t have these issues, but I’ve been looking for a long time to no avail).

1 Well, at least I am in that class, and I’m arrogant enough to think I’m not the only one :wink:

I’d like to know how many users who would upgrade to a new version of Python (where this default encoding would be changed) would be surprised by the change of default encoding. I know this would be a difficult question to answer. My thoughts:

  • Those who run scripts (CI, management, etc) will run those scripts on the same environment, know that to upgrade their environment will require dev anyway.
  • Those started to learn Python are likely to download Python once, and stick with it until they know more. Sticking with the same version means their input and output text files will have the same encoding as they’re created and read by the same executable (unless they download input from the internet / course instructors etc, but at that point it’s already broken as Linux Python won’t be able to read those correctly now)
  • System internals and configuration, which isn’t terribly common on Windows as compared to Linux.
  • Larger applications using Python should be bundling their own Python or calling the system Python with specific configuration anyway.
  • Professional developers using Python, who I hope are at least aware of encoding, could be heavily affected by this change.

Another thing:

This is another discussion, but should Python on Windows be updating the manifest of its subprocesses to set the active code-page be the same as the running Python?