Add legacy_text_encoding option to make UTF-8 default

This is one of sub topic toward making UTF-8 default.

Java changed the default encoding and added file.encoding option for backward compatibility.
Java user can set -Dfile.encoding=COMPAT and get old behavior.

I want to add similar option for forward/backward compatibility:


Add PYTHONLEGACYTEXTENCODING environment variable and -Xlegacy_text_encoding commandline option to Python 3.12. These options takes 0 or 1.

When legacy_text_encoding=0, Python makes UTF-8 as default text encoding:

  • TextIOWrapper() uses UTF-8 by default if they are not TTY.
  • stdin, stdout, and stderr use UTF-8 by default if they are not TTY too.
  • io.text_encoding() returns “UTF-8”, instead of “locale”.
  • subprocess.Popen(..., text=True) uses UTF-8 by default, (we may deprecate text option)
  • os.device_encoding(), locale.getpreferredencoding() returns locale specified encoding.

In Python 3.12, the legacy_text_encoding=1 by default. So the default behavior is not changed.
Users can set legacy_text_encoding=0 to test their application with new behavior.

At some version, Python will change the legacy_text_encoding=0 by default and make UTF-8
default. I don’t chose any specific version for now. We need to discuss about deprecation period in other topic.

Relation to PEP 540 UTF-8 Mode.

legacy_text_encoding=0 is very similar to UTF-8 mode. But there is significant difference:

  • legacy_text_encoding=0 just change the default text encoding. locale encoding is still used in several places (e.g. fsencoding and TTY). locale.getpreferredencoding(False) returns locale encoding (e.g. LC_CTYPE)
  • UTF-8 mode emulates that Python runs on UTF-8 locale. fsencoding, TTY, and locale.getpreferredencoding(False) is UTF-8 regardless actual locale.

legacy_text_encoding will cover most use cases of UTF-8 mode.
But UTF-8 mode would be still useful for some environments like Android.


How do you think this idea?

Of course, adding yet another encoding option may confuse users.
We can change UTF-8 mode behavior, but it will break some existing use case of UTF-8 mode.

  • Add legacy_text_encoding
  • Repurpose UTF-8 mode
  • Do not provide forward/backward compat option.

0 voters

This sounds a bit fragile to me. Whether or not a stream is a TTY can depend on subtleties (for example piping to cat or grep) and it’s not very nice to users if the default encoding changes based on such subtleties.

Non UTF-8 platforms are always fragile. There is no way to distinguish the final output is file or terminal or both (tee). User may use en_US.US-ASCII locale from UTF-8 terminal.

Other options:

a. Use UTF-8 always on Unix. os.device_encoding() is not UTF-8 only when PYTHONLEGACYWINDOWSSTDIO is enabled and the fd is Console.

  • Unix users need to use PYTHONIOENCODING when they want to use non UTF-8 stdio.
  • Works very nice with tools using UTF-8 regardless locale (node, rust, Go, Java (>=18), etc…)
  • Don’t work nice with tools using locale encoding when locale is not UTF-8.

b. Do not touch stdio and subprocess.PIPE at all.

  • -o outfile.txt and > outfile.txt may use different encoding.
    • It would be confusing for new users…
  • User need to use PYTHONIOENCODING when they want to use UTF-8 stdio. (status quo)
  • Works nice with tools using locale encoding.
  • If we want to change the stdio encoding in the future, another breaking change is required.

Maybe, (b) is the most conservative approach. And that is what I thought when I wrote PEP 597 EncodingWarning…

I don’t really have an opinion here (and hence haven’t voted) other than to say we seem to be continually adding more and more complexity (UTF-8 mode, this new text encoding mode, etc, etc). I feel like we’d be better just making this a clean break. If we’re confident that the end result (UTF-8 everywhere) is worth the cost, then let’s just get on with it and do it. If we’re not confident, then let’s wait.

At a minimum, can anyone clearly state what conditions would have to apply for us to simply switch to UTF-8 everywhere? (“All operating systems that we care about use UTF-8 throughout”, for example).


I don’t think we need this.

People can already experiment with UTF-8 mode to figure out whether
their applications work in this eventually new default and we should
instead point people in that direction, rather than introducing a new
way to keep the existing behavior.

FWIW: I have been using UTF-8 mode for several years now and it works
much better than relying on locales, OS env vars, UI settings, etc.

1 Like

I think your idea is very close to my “Repurpose UTF-8 mode” idea.

I agree that UTF-8 mode on/off is enough for most users and adding new mode has tiny gain.

I will write draft spec of new UTF-8 mode behavior.
Thanks you!

Now I wrote PEP 686. It uses UTF-8 mode, instead of adding yet another option.

1 Like