Add legacy_text_encoding option to make UTF-8 default

methane · March 15, 2022, 7:08am

This is one of sub topic toward making UTF-8 default.

Java changed the default encoding and added file.encoding option for backward compatibility.
Java user can set -Dfile.encoding=COMPAT and get old behavior.

I want to add similar option for forward/backward compatibility:

Spec

Add PYTHONLEGACYTEXTENCODING environment variable and -Xlegacy_text_encoding commandline option to Python 3.12. These options takes 0 or 1.

When legacy_text_encoding=0, Python makes UTF-8 as default text encoding:

TextIOWrapper() uses UTF-8 by default if they are not TTY.
stdin, stdout, and stderr use UTF-8 by default if they are not TTY too.
io.text_encoding() returns “UTF-8”, instead of “locale”.
subprocess.Popen(..., text=True) uses UTF-8 by default, (we may deprecate text option)
os.device_encoding(), locale.getpreferredencoding() returns locale specified encoding.

In Python 3.12, the legacy_text_encoding=1 by default. So the default behavior is not changed.
Users can set legacy_text_encoding=0 to test their application with new behavior.

At some version, Python will change the legacy_text_encoding=0 by default and make UTF-8
default. I don’t chose any specific version for now. We need to discuss about deprecation period in other topic.

Relation to PEP 540 UTF-8 Mode.

legacy_text_encoding=0 is very similar to UTF-8 mode. But there is significant difference:

legacy_text_encoding=0 just change the default text encoding. locale encoding is still used in several places (e.g. fsencoding and TTY). locale.getpreferredencoding(False) returns locale encoding (e.g. LC_CTYPE)
UTF-8 mode emulates that Python runs on UTF-8 locale. fsencoding, TTY, and locale.getpreferredencoding(False) is UTF-8 regardless actual locale.

legacy_text_encoding will cover most use cases of UTF-8 mode.
But UTF-8 mode would be still useful for some environments like Android.

Vote

How do you think this idea?

Of course, adding yet another encoding option may confuse users.
We can change UTF-8 mode behavior, but it will break some existing use case of UTF-8 mode.

Add legacy_text_encoding
Repurpose UTF-8 mode
Do not provide forward/backward compat option.

0 voters

pitrou · March 15, 2022, 9:41am

This sounds a bit fragile to me. Whether or not a stream is a TTY can depend on subtleties (for example piping to cat or grep) and it’s not very nice to users if the default encoding changes based on such subtleties.

methane · March 15, 2022, 11:43am

Non UTF-8 platforms are always fragile. There is no way to distinguish the final output is file or terminal or both (tee). User may use en_US.US-ASCII locale from UTF-8 terminal.

Other options:

a. Use UTF-8 always on Unix. os.device_encoding() is not UTF-8 only when PYTHONLEGACYWINDOWSSTDIO is enabled and the fd is Console.

Unix users need to use PYTHONIOENCODING when they want to use non UTF-8 stdio.
Works very nice with tools using UTF-8 regardless locale (node, rust, Go, Java (>=18), etc…)
Don’t work nice with tools using locale encoding when locale is not UTF-8.

b. Do not touch stdio and subprocess.PIPE at all.

somescript.py -o outfile.txt and somescript.py > outfile.txt may use different encoding.
- It would be confusing for new users…
User need to use PYTHONIOENCODING when they want to use UTF-8 stdio. (status quo)
Works nice with tools using locale encoding.
If we want to change the stdio encoding in the future, another breaking change is required.

Maybe, (b) is the most conservative approach. And that is what I thought when I wrote PEP 597 EncodingWarning…

pf_moore · March 15, 2022, 11:55am

I don’t really have an opinion here (and hence haven’t voted) other than to say we seem to be continually adding more and more complexity (UTF-8 mode, this new text encoding mode, etc, etc). I feel like we’d be better just making this a clean break. If we’re confident that the end result (UTF-8 everywhere) is worth the cost, then let’s just get on with it and do it. If we’re not confident, then let’s wait.

At a minimum, can anyone clearly state what conditions would have to apply for us to simply switch to UTF-8 everywhere? (“All operating systems that we care about use UTF-8 throughout”, for example).

malemburg · March 15, 2022, 11:55am

I don’t think we need this.

People can already experiment with UTF-8 mode to figure out whether
their applications work in this eventually new default and we should
instead point people in that direction, rather than introducing a new
way to keep the existing behavior.

FWIW: I have been using UTF-8 mode for several years now and it works
much better than relying on locales, OS env vars, UI settings, etc.

methane · March 15, 2022, 1:29pm

I think your idea is very close to my “Repurpose UTF-8 mode” idea.

I agree that UTF-8 mode on/off is enough for most users and adding new mode has tiny gain.

I will write draft spec of new UTF-8 mode behavior.
Thanks you!

methane · March 18, 2022, 11:03am

Now I wrote PEP 686. It uses UTF-8 mode, instead of adding yet another option.