PEP 686: Make UTF-8 mode default

methane · March 20, 2022, 11:04pm

When I said “If the first argument is WindowsConsoleIO, UTF-8 is used.”, I meant it.

Technically speaking, TextIOWrapper doesn’t create WindowsConsoleIO. open() does.
So even PYTHONLEGACYWINDOWSSTDIO is not specified, console codepage is used when TextIOWrapper(file) is used and file.fileno() is 0-2.

Anyway, such technical details don’t affect conclusion. encoding=None is not always same to encoding=locale.getpreferredencoding().

When WindowsConsoleIO is used, encoding is UTF-8 regardless UTF-8 mode. So it may be different from locale.getpreferredencoding().
When WindowsConsoleIO is not used and file.fileno() is 0-2 and isatty(fd) return true, Console(Output)CP() is used. It may be different from locale.getpreferredencoding().

I seen this Q&A in Stackoverflow used PYTHONLEGACYWINDOWSSTDIO.

I am not sure that is valid use case. I am OK to remove it.

Do you think we can drop supporting Console(Output)CP in TextIOWrapper too?
Users can pass console file object to TextIOWrapper even when PYTHONLEGACYWINDOWSSTDIO is not set. But it is really rare.

Since os.device_encoding(fd) is same to locale.getpreferredencoding(False) on Unix, we can remove it if we don’t care console CP.
TextIOWrapper.__init__ is very long and hard to maintain. I am happy about reducing some code.
And then, encoding=None is same to encoding=locale.getpreferredencoding(False) for TextIOWrapper.

methane · March 20, 2022, 11:40pm

We can promote EncodingWarning for them. Isn’t it enough? Note that Java changed the default encoding without any warnings.

Maybe, we can add some banner like following to Python REPL when locale.getpreferredencoding(False).lower() not in {"utf8", "utf_8", "utf-8", "cp65001"}.

NOTE: Python will change the default encoding to UTF-8 and it may affect this platform.
Please check https://peps.python.org/pep-0686/#backward-compatibility

I am not sure what “promote it to a regular deprecation warning” means. Is this seems OK to you?

_encodingwarning_message = """\
encoding option is omitted. It defaults to locale encoding but will be changed to UTF-8 in next
version. Please specify `encoding="utf-8"` or `encoding="locale"`.
See https://peps.python.org/pep-0686/#backward-compatibility for details.
"""

def text_encoding(encoding, stacklevel=2):
    if encoding is None:
        encoding = "locale"
        if sys.flags.warn_default_encoding:
            import warnings
            warnings.warn(_encodingwarning_message, EncodingWarning, stacklevel+1)
        else if locale.getpreferredencoding(False).lower() in {"utf8", "utf_8", "utf-8", "cp65001"}:
            import warning
            warnings.warn(_encodingwarning_message, DeprecationWarning, stacklevel+1)
    return encoding

This still can produce tons of false positive deprecation warning if the application need to read many ASCII files and used open() without encoding for them.

Massive false positive deprecation warning may hide other real deprecation warning. That’s why we provide opt-in EncodingWarning, instead of using deprecation warning.

eryksun · March 21, 2022, 12:18am

The caching problem has been fixed since I wrote that answer, but the problem remains with using ReadConsoleW() and WriteConsoleW() instead of C read() and write() when a non-console file is duped to the fd. I wouldn’t recommend using legacy mode just to use code like os.dup2(stdio_fp, 1) in new code, as opposed to reassigning sys.std*. That said, you’re right to point out that I 'm being hasty in suggesting that legacy support be dropped. Some deployments are probably using it that don’t care about the initial console I/O files, but need to keep using low-level os.dup2(), or os.close() + os.open(), for reasons that aren’t obvious to me.

The encoding has always been wrong for fds above 2 because os.device_encoding() is hard coded. Since os.device_encoding() was never implemented generally to support fds above 2, and consumers of legacy mode don’t seem to care that it’s not using the console encoding for standard console files 0-2, then I think Python really could get away with removing all vestiges of support for console code pages, but keep legacy mode that uses io.FileIO instances for console files.

ncoghlan · March 21, 2022, 10:51am

I think there’s a not missing in the sample code (it’s warning when the preferred encoding is UTF-8, rather than when it isn’t), but that’s the gist of what I was thinking, yeah.

You’re right I hadn’t considered the “lots of ASCII files opened from different parts of the code” case, though. Handling that gracefully would require a “warn on non-ASCII” capability in the io module, which would be a much more intrusive change, and much harder to use effectively in a test suite.

Which means we’re not likely to be able to do better than what PEP 597 encoding warnings already offer

methane · March 22, 2022, 7:02am

For example, see this pull request.

github.com/ipython/ipython

Fix EncodingWarning on Python 3.10

ipython:master ← gousaiyang:fix-3-10-encoding-warning

opened 06:32AM - 05 Feb 22 UTC

gousaiyang

+259 -182

As per [PEP 597](https://www.python.org/dev/peps/pep-0597/), text files should a…lways be opened with an explicitly specified encoding. Setting `PYTHONWARNDEFAULTENCODING=1` on Python 3.10 will emit [`EncodingWarning`](https://docs.python.org/3/whatsnew/3.10.html#optional-encodingwarning-and-encoding-locale-option) to help find mistakes where encoding is omitted. This PR tries to address the problem mostly by explicitly specifying UTF-8 encoding. If there is any case where using the locale encoding is desired, we can further revise this PR.

ipython added 84 encoding="utf-8" to fix the EncodingWarning. They added 0 encoding="locale" or locale.getpreferredencoding(False).

I did similar commit by adding hundreds of encoding="utf-8". Most of them are just ASCII, although few of them can be a real bug found by the warning. And those few hidden bugs are fixed when UTF-8 become default.

So I suppose many people don’t want to add dozens or hundreds of encoding="utf-8". They may want to wait UTF-8 become default. For example, craft-parts maintainer rejected adding dozen encoding="utf-8":

One example in the pull request tried to add encoding="utf-8":

    with open(call_fifo, "w") as fifo:
        fifo.write(json.dumps(data))

Although json must be UTF-8, json.dumps generates only ASCII unless ensure_ascii=False is specified. So omitting encoding here is not a bug.

I re-read the PEP 387.

Adding a warning is required by the policy. But DeprecationWarning is not required. Other warning category or compiler warning can be used when there is a reason.
“Wait for the warning to appear in at least two minor Python versions” is required.

Now I am considering postpone the target to Python 3.13.
Although EncodingWarning was added in Python 3.10, there was no official plan for changing the default encoding. So I don’t count the Python 3.10 in the deprecation period.

If this PEP accepted before 3.11 beta:

Fix encoding="locale" in UTF-8 mode in Python 3.11
Advertise the EncodingWarning and UTF-8 mode in the release note of Python 3.11 and 3.12
Make the UTF-8 mode default in Python 3.13

If no objections, I will update the PEP.

steve.dower · March 22, 2022, 5:20pm

Just for the record, some of these changes appear to be incorrect. Blindly assuming UTF-8 is every bit as wrong as assuming "locale" (or anything that may imply), particularly for cases where files are shared between tools (e.g. where IPython reads from Conda’s history file).

This is why I say that developers need to be aware and deliberate about the changes. It’s not as simple as putting encoding="utf-8" everywhere, unless you know for sure that all your files are UTF-8 and therefore your code was incorrect before the change.

In the world we live in, nothing is “just ASCII” unless it’s deliberately encoded to be so. That’s a Python 2.x assumption, and is not correct anymore. We have many users who will be putting non-ASCII characters in all sorts of paths and names that are ending up in these files, and successfully, because code pages actually do cover 99% of the cases that ASCII does not. Without deliberately converting these files from the current code page, encoding="utf-8" is just as wrong as omitting it.

methane · March 23, 2022, 12:41am

For the record, conda history file is UTF-8 encoded. So this is an example of fixing hidden bug rather than breaking existing code.

Of course, I agree that blindly assuming UTF-8 is not 100% correct. But I think it is better than assuming "locale" already, and it becomes more better in the future.

When file is shared between tools, UTF-8 is really better default encoding. Locale encoding is fragile. It can be changed easily.

And locale encoding is not a good choice even for private files like logs and settings. In Japan, cp932 is really legacy. It can not represent characters we use daily. Since Japanese filename is so common in Japan, writing file path in cp932 is unsafe. Most Japanese developers hate cp932.

I think it is ideal but not practical.

If we show warning by default, more and more developers will blindly put encoding="utf-8". It is worse than this PEP, because user can not fix it by PYTHONUTF8=0.

So I think PEP 597 + PEP 686 is the best way to go forward.

steve.dower · March 23, 2022, 12:47am

Good catch I found a different section of code that was doing something less important with the file, so should probably report that issue to them (unless we get the default changed soon enough that they don’t notice )

I’m not arguing about this point. UTF-8 is superior to code pages in every way, and has been ever since it was invented some years after everyone started using Windows with code pages.

The problem is the transition, not the destination. I don’t want to cause a Python 3-like compatibility issue by suddenly changing this. I really want people to know this is coming, update their code ahead of time, and not be at all hurt when we make a change.

The only way to do that is to get them to think about what their current files are encoded as, write code to change them, and explicitly specify encodings. And they should do this anyway, because their code is probably broken today! But all we’re doing is accelerating the timing, we aren’t going to make it any easier for them with this change. That’s my concern. That’s why I want noisy warnings for existing users. So that we can get to using UTF-8 everywhere.

Just not at the cost of breaking everyone’s code by surprise. It shouldn’t be a surprise.

methane · March 23, 2022, 1:54am

I agree that. This is why this PEP (and JEP 400) provides backward compatible option.

I agree too. We can advertise EncodingWarning and UTF-8 mode before Python 3.13 is released.
And people who can not prepare to the change in 2.5 years can use Python 3.13 with PYTHONUTF8=0.

I don’t think forcing people to write encoding everywhere even they just using ASCII is good idea. Noisy warning will tend people to write encoding="utf-8" blindly or ignore warning. That’s why I want to keep EncodingWarning opt-in.

methane · March 23, 2022, 8:21am

For the record, I and Victor are discussing about locale.get_encoding() APIs in bpo-47000.

About Underscore
- getencoding()
- get_encoding()
About “locale encoding (at Python startup)” and “current locale encoding”.
- locale.getencoding(current=False)
- locale.get_encoding() + locale.get_current_encoding()
- sys.getlocaleencoding() + locale.getencoding() (or locale.get_current_encoding())

eryksun · March 23, 2022, 4:30pm

For subprocess, how about using PYTHONIOENCODING if it’s set? In other words, expand the scope of config->stdio_encoding to all standard I/O (stdin, stdout, stderr), including the default encoding to use for standard I/O with child processes. The values of config->stdio_encoding and config->stdio_errors could be exposed as sys.getstdioencoding() and sys.getstdioencodeerrors(). If PYTHONIOENCODING isn’t set at startup, then config->stdio_encoding is whatever the default is otherwise, based on UTF-8 mode or the locale.

Given _Py_device_encoding(fd) has been ignored for legacy standard I/O in Windows since 3.8 because the interpreter configuration only has one stdio_encoding value, a new “console” pseudo-encoding could be supported in Windows for use cases that need the console I/O encoding, not only for console files, but as a general ‘locale encoding’ for standard I/O. TextIOWrapper would be changed to call _Py_device_encoding(fd) in Windows for the “console” pseudo-encoding. For subprocess, if the encoding is “console”, look up the real encoding to use via os.device_encoding(fd), using the standard files 0-2. If there’s no console, the “console” encoding should be whatever the default is otherwise.

methane · March 24, 2022, 3:21am

I thought that idea but excluded it from PEP 686 because I want to keep PEP 686 simple.

Some Windows users may want to keep using legacy encoding for stdio because Windows is very slow about evolving console environments.

Ideally speaking, PYTHONIOENCODING should be consistent with chcp on cmd.exe, or OutputEncoding + [console]::OutputEncoding on PowerShell (although OutputEncoding and [console]::OutputEncoding is different by default!!!), and UTF-8 on MSYS2 (Git for Windows bash).

So using it for default subprocess PIPE encoding seems good idea to me.
But how does it relating to this PEP? Should we use PYTHONIOENCODING for subprocess PIPE when UTF-8 mode is enabled? It is bit confusing…

eryksun · March 28, 2022, 2:31pm

Switching to UTF-8 as the default file encoding could unmask an encoding problem that went unnoticed when the default was the process ANSI code page. Python should try to make it easier to diagnose and resolve such problems. This helps to reduce the pain of switching to UTF-8 as the default.

One thing we can improve, and something we really should have implemented from the outset, is to provide a simple way to use the active code page(s) of the current console session for standard I/O, which relates to the suggestion to work around problems by setting PYTHONIOENCODING. Using the active code page is a paradigm from MS-DOS. It’s still used by some Windows console applications, so it needs to be supported. (I don’t want Python to use this behavior by default, however. I prefer to use UTF-8 or the ANSI code page of either the user locale or the system locale, depending on the context.)

I suggested the addition of a “console” pseudo-encoding for this. It’s not a real encoding because it resolves to the current input or output code-page encoding of the console session (e.g. “cp850”). This makes it simple to work around a legacy encoding problem by setting PYTHONIOENCODING=console, or by spawning a child process with subprocess.Popen(args, stdin=PIPE, stdout=PIPE, stderr=PIPE, encoding='console').

The “console” encoding could be evaluated internally by a new _Py_console_encoding(fd) function, for Windows only. This would always call GetConsoleCP() for stdin (0) and GetConsoleOutputCP() for stdout (1) and stderr (2), regardless of the file’s existence or type. If there’s no console session, return None, and let the caller decided how to handle it. For file numbers above 2, explicitly check for an existing input or output console file to determine whether to use GetConsoleCP() or GetConsoleOutputCP().

Since 3.8, the interpreter initialization only supports one standard I/O encoding (i.e. config->stdio_encoding), which defaults to the locale encoding. That there’s only one standard I/O encoding is a non-issue given the evaluation of the “console” encoding. The same applies to the single encoding parameter of subprocess.

The “console” encoding could be supported more generally by a codec search function that calls _winapi.GetConsoleCP(). This could also support “conin” and “conout” encodings that respectively use _winapi.GetConsoleCP() and _winapi.GetConsoleOutputCP(). Python standard I/O providers such as subprocess.Popen could evaluate the generic “console” encoding as “conin” for stdin and “conout” for stdout and stderr.

I suggested exposing the standard I/O encoding via sys.getstdioencoding() and sys.getstdioencodeerrors(). Setting PYTHONIOENCODING would thus override UTF-8 mode for not only sys.std*, but also anything that uses sys.getstdioencoding(). I think this should include the default encoding used by subprocess.Popen.

steve.dower · March 28, 2022, 2:53pm

Couldn’t we just register a “console” alias at startup to achieve this? (Presumably nobody supports console encodings changing at runtime anyway…)

eryksun · March 28, 2022, 3:50pm

I assume you mean to include “conin” and “conout” aliases as well, since they’re not necessarily the same encoding, unless you’d rather ignore the output code page and assume that they’re same, as they are by default and as set by “chcp.com”.

It’s of course possible to cache the GetConsoleCP() and GetConoleOutputCP() values at startup. It loses the flexibility of being able to change it at runtime via “chcp.com” or directly calling SetConsoleCP() and SetConsoleOutputCP(). If “console” isn’t assumed to be the input code page, then its evaluation in a standard I/O context (e.g. PYTHONIOENCODING=console) would still need to be dynamic, to use “conin” for stdin and “conout” for stdout/stderr.

methane · March 29, 2022, 4:38am

I understand your concern.
I was thinking about changing text encoding except subprocess, instead of making UTF-8 mode default.

But adding yet another “default encoding” and options will confuse users.
I concluded that subprocess should be same to file. User can manage to do migration by:

Disable UTF-8 mode (if migration happen after Python 3.13)
Use the EncodingWarning option or other tools like pylint to find use of the default encoding.
If you find where locale encoding should be used, add encoding="locale".
Run test with UTF-8 mode.
Enable UTF-8 mode in production.

This workflow is same to file and subprocess.

I think this is general improvement. It can be implemented regardless of this PEP.

chcp 65001 in cmd.exe and [Console]::OutputEncoding = [Console]::InputEncoding = [Text.Encoding]::UTF8; in PowerShell are common technique.
Deciding encoding based on console (output) codepage would be useful already. No need to include it in this PEP.

BTW, how about just use encoding="locale" option, instead of adding encoding="console" option?
TextIOWrapper uses console encoding when fd=0,1,2 and encoding=None or encoding="locale" already.
So using console encoding when encoding="locale" is passed makes sense to me.
It also makes cross platform script slightly easy.

steve.dower · March 29, 2022, 2:03pm

The console and the locale have different encoding on Windows, that’s why. chcp doesn’t change the locale, only the console’s active code page (for non-Unicode applications).

So when reading/writing to the standard IO handles, defaulting to console is slightly better than locale, but for a file on disk that existed before the current console, locale is likely better (obviously Unicode is better if you’re creating the file and you have the choice, so reading an arbitrary file is where we need a reasonable default).

methane · March 29, 2022, 2:21pm

Of course, we are talking about stdio.
My idea is subprocess.Popen(encoding="locale", stdout=PIPE, stdin=PIPE) uses console encoding.

If user don’t want to use console encoding, user need to pass encoding=locale.getencoding() to use ANSI codepage.
(The API to get ANSI Codepage/locale encoding is under discussion.)

methane · March 30, 2022, 7:41am

I am updating the PEP 686.

In this pull request, I added using PYTHONIOENCODING for backward compat option in “Rejected ideas”.

Better console encoding support is different topic. I don’t include it in the PEP.

vstinner · April 19, 2022, 7:00am

These functions are already exposed in Python for a long time. os.device_encoding(0) calls GetConsoleCP() if stdin is a TTY. os.device_encoding(1) and os.device_encoding(2) call GetConsoleOutputCP() if stdout and stderr are TTY.