PEP 686: Make UTF-8 mode default

Abstract

This PEP proposes making UTF-8 mode [1] on by default.

With this change, Python uses UTF-8 for default encoding of files, stdio, and pipes consistently.

Motivation

UTF-8 becomes de-facto standard text encoding.

  • Default encoding of Python source files is UTF-8.
  • JSON, TOML, YAML uses UTF-8.
  • Most text editors including VS Code and Windows notepad use UTF-8 by default.
  • Most websites and text data on the internet uses UTF-8.
  • And many other popular programming languages including node.js, Go, Rust, Ruby, and Java uses UTF-8 by default.

Changing the default encoding to UTF-8 makes Python easier to interoperate with them.

Additionally, many Python developers using Unix forget that the default encoding is platform dependant. They omit to specify encoding="utf-8" when they read text files encoded in UTF-8 (e.g. JSON, TOML, Markdown, and Python source files). Inconsistent default encoding caused many bugs.

Specification

Changes to UTF-8 mode

Currently, UTF-8 mode affects to locale.getpreferredencoding() .

This PEP proposes to remove this override. UTF-8 mode will not affect to locale module.

After this change, UTF-8 mode affects to:

  • stdin, stdout, stderr
    • User can override it with PYTHONIOENCODING .
  • filesystem encoding
  • TextIOWrapper and APIs using it including open() , Path.read_text() , subprocess.Popen(cmd, text=True) , etc


This change will be introduced in Python 3.11 if possible.

Enable UTF-8 mode by default

Python enables UTF-8 mode by default.

User can still disable UTF-8 mode by setting PYTHONUTF8=0 or -X utf8=0 .

Backward Compatibility

Most Unix systems use UTF-8 locale and Python enables UTF-8 mode when its locale is C or POSIX. So this change mostly affects Windows users.

When a Python program depends on the default encoding, this change may cause UnicodeError , mojibake, or even silent data corruption. So this change should be announced very loudly.

To resolve this backward incompatibility, users can do:

  • Disable UTF-8 mode
  • Use EncodingWarning to find where the default encoding is used and use encoding="locale" option to keep using locale encoding. [2]

Preceding examples

  • Ruby changed the default external_encoding to UTF-8 on Windows in Ruby 3.0 (2020). [3]
  • Java changed the default text encoding to UTF-8 in JDK 18. (2022). [4]

Both Ruby and Java have an option for backward compatibility. They don’t provide any warning like EncodingWarning [2] in Python for use of the default encoding.

Rejected Alternative

Deprecate implicit encoding

Deprecating use of the default encoding is considered.

But there are many cases user uses the default encoding when just they need ASCII. And some users use Python only on Unix with UTF-8 locale.

So forcing users to specify the encoding option everywhere is too painful.

Java also rejected this idea [4].

How to teach this

For new users, this change reduces things that need to teach.

Users can delay learning about text encoding until they need to handle non-UTF-8 text files.

For existing users, see Backward compatibility section.

Resources

[1] PEP 540 – Add a new UTF-8 Mode

[2] (1, 2) PEP 597 – Add optional EncodingWarning

[3] Set default for Encoding.default_external to UTF-8 on Windows

[4] (1, 2) JEP 400: UTF-8 by Default

10 Likes

Very doubtful. At best, we can announce with 3.11 that we’ll be changing it in a future release. Virtually nobody outside of the people who have discussed it will be even aware that it’s coming, and plenty of libraries are going to need time to catch up - ideally well before the .0 release, or else our release will be essentially unusable for many people (which turns into “everyone” by word-of-mouth).

3.12 is probably the earliest. And it’s going to take active campaigning (e.g. to get people to set PYTHONUTF8=1 for their test suites and fix issues over the following year).

We probably need to acknowledge here that this is likely a majority of new users (let’s say, “in their first year”). The number of students in China (GB-18030) alone will far exceed those who can assume UTF-8. Obviously UTF-8 is still preferable based on number of markets (countries), but is only weakly justified on number of users.[1]


  1. I’m not even sure whether Python will default to GB-18030, which may make this whole point irrelevant if everyone in China has had to be specifying their encoding already. ↩

2 Likes

Thanks for writing this up, Inada-san.

I’m +1 on the general idea and have used UTF-8 mode very successfully for several years (our eGenix PyRun has had UTF-8 mode on per default since the Python 3.8 version).

Some comments:

  • I’d add a section on how to switch over to UTF-8 mode, specifically mentioning that we first advertise using UTF-8 actively for, say, two releases (3.11 and 3.12) and then change the UTF-8 default to on in 3.13.
  • It would be good to also add an advice to start being explicit about encodings and not relying on defaults in applications. There is way too much guessing going on when it comes to encodings and Python cannot easily determine what the user really wants (use OS settings, environment settings, application settings, defaults based on whether a pipe, file, TTY is being used, hard-coded defaults, etc.). Most of these problems can be avoided by explicitly setting the encoding to use for I/O wherever it occurs and let the application decide what is right, rather than having Python guess.
  • At some point in the future the UTF-8 mode setting should be removed altogether. The PEP should list a time frame for this, so that people can start to adapt accordingly. I’d suggest to wait another two releases after UTF-8 mode is turned on per default.
4 Likes

Note that this paragraph is not about making UTF-8 mode default.
This paragraph is about locale. getpreferredencoding() ignores the UTF-8 mode.

This function is very old and (ab)used for many purpose. I think excluding it from UTF-8 mode is safe.
Additionally, there is a discussion about deprecating it.
Since locale module is about locale, no strong reason to lie.

I want to promote “Please try UTF-8 mode.” That’s why I want to fix UTF-8 mode behavior before making UTF-8 mode default.

1 Like

Ah, I misread the paragraph then. Perhaps change the heading to “Changes to locale.getpreferredencoding()”? (Making it the default is a change to UTF-8 mode, and since that’s what the whole document is about, it seems reasonable to infer that this section is about changing that aspect.)

Would we not just recommend locale.getdefaultencoding() instead? That’s the proper source of the actual system encoding. Opting in to UTF-8 is a preference, and so having the preferred encoding reflect that seems to be expected. I’m mainly thinking that in 10 years time, when UTF-8 mode is the default and nobody remembers the old way, it’ll seem weird for the “preferred” encoding to be anything other than UTF-8.

1 Like

Awesome! I’ll repeat what I said here: PEP 597: Use UTF-8 for default text file encoding


As a Windows-only user for the past 10+ years, the absolute only time I’ve written/read things in something other than UTF-8 was when burning in subtitles to video that were created by others. In these cases one can only guess and therefore chardet was used.

Having the default be UTF-8 would have saved me lots of pain over the years.

8 Likes

I’d like to advertise UTF-8 mode before making it default too. It’s hard to decide how long advertising period.

I chose 3.12 because PEP 12 says “add a Python-Version header and set the value to the next planned version of Python, i.e. the one your new feature will hopefully make its first appearance in.” UTF-8 mode will hopefully become default on Python 3.12, but it might delay to 3.13.

We can get feedback from:

  • Python users opt-in UTF-8 mode.
  • Java users upgraded to JDK18
  • And Python 3.12 alpha releases.

If we see much troubles from such users, we can postpone the change several years.

When I added EncodingWarning, I added the section about it in the io module doc.

I thought I had updated the tutorial, but I had not. I just opened an issue but not fixed it. I will do.

https://bugs.python.org/issue41507

I want to change the default encoding as fast as possible because many people start learning Python every year.
Instead, I am very conservative about removing backward compatibility option. For example, we have not deprecated PYTHONLEGACYWINDOWSFSENCODING and PYTHONLEGACYWINDOWSSTDIO yet. Mercurial have a plan to support UTF-8 path on Windows, instead of fragile ANSI codepage path. But they couldn’t implement the plan and still relying on PYTHONLEGACYWINDOWSFSENCODING.

Makes sense. I will add “in their first year”.

I agree that this title is misleading. I will fix the title and add some description why I want to fix UTF-8 mode behavior before making it default.

Getting encoding from locale.getdefaultencoding() is not so straightforward. We may need to add something like locale.get_encoding().

I agree that and that’s why I asked to Victor to let getpreferredencoding(False) returns UTF-8 on UTF-8 mode.

But when considering making UTF-8 mode default, it makes difficult to transition. So I am proposing to exclude it from UTF-8 mode.
Fixing locale module should be discussed separately.

I wholeheartedly support the spirit of this proposal, but it needs refinement before it applies to source code. Specifically, UTF-8 default should not apply to identifiers. Allowing non-ascii characters in identifiers (specifically invisible unicode characters) is a serious security vulnerability, that would allow malicious code to pass through visual audits by humans.

https://www.lightbluetouchpaper.org/2021/11/01/trojan-source-invisible-vulnerabilities/
https://www.schneier.com/blog/archives/2021/11/hiding-vulnerabilities-in-source-code.html

PEP 8:

All identifiers in the Python standard library MUST use ASCII-only identifiers

Thank you for your feedback, but this PEP is not about Python source code and identifiers.

That ship already sailed in Python 3.0 - the default source encoding has been utf-8 for more than a decade at this point. 2. Lexical analysis — Python 3.12.1 documentation spells out the Unicode categories that are permitted as part of an identifier (and yes, full Unicode support in source code does lead to security problems: https://trojansource.codes/)

This PEP is about interfaces that aren’t source code specific where the interpreter currently defaults to using the system encoding. The PEP adding UTF-8 mode is a good resource for the changes that enabling UTF-8 mode actually makes: PEP 540 – Add a new UTF-8 Mode | peps.python.org

I’m not following the rationale for changing this override.

The key benefit of PEP 540 changing this function is that the open() builtin docs for the encoding option remained accurate even in UTF-8 mode:

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

(Tangent: encoding="locale" isn’t currently mentioned either here or in the codecs module docs, it only appears to be in the io module docs)

Having to change the definition of the default encoding for open() when in UTF-8 mode still seems like a bad idea, regardless of whether UTF-8 mode is the default or not.

I do think you’re right that if the default is going to change we’ll want an API that specifically reports “The encoding that encoding="locale" will use”.

I just don’t see a compelling reason for locale.getpreferredencoding() to be that API - a new locale.getlocaleencoding() would seem more reasonable to me (and a new API could default to NOT doing setlocale()). A new API should also be relatively non-controversial to get into 3.11.

2 Likes

Also, see PEP 672 which is specifically about this specific issue in Python and strategies for mitigating it; as others have said, it is entirely orthogonal to this PEP.

1 Like

I am sending a PR adding little more background about this change.

Currently, locale.getpreferredencoding(False) returns “UTF-8” when UTF-8 mode is enabled. This is because there was no plan to make UTF-8 mode default when it is designed and we want to change most applications to use UTF-8 as possible.

But this behavior makes it difficult to make UTF-8 mode default. There is no “one obvious way” to get the locale encoding other than locale.getpreferredencoding(False) .

So this PEP proposes to change the behavior to ease the transition. UTF-8 mode will not affect locale module anymore. People will need to rewrite locale.getpreferredencoding(False) to "utf-8" when they want to use UTF-8.

Yes, this PEP affects the definition. (although the current definition is bit wrong on Windows – when the first argument is 1, 2, or 3, (stdio), GetConsoleCP/GetConsoleOutputCP() is used. If the first argument is WindowsConsoleIO, UTF-8 is used.)

On the other hand, it reduces the change between UTF-8 mode on and off. This is important for smooth transition.

I will update the open() doc. Since encoding="locale" is just an option for TextIOWrapper, I didn’t describe it in codecs module.

Now I came up with new idea that adding codec search function to codecs module that returns encoding based on current locale when “locale” is searched. But this idea is out of scope of this PEP.

Yes, this is an option. I chose current PEP because it seems simple and easy for transition.

  • “UTF-8 mode affects default encoding of runtime and IO” is much clearer than “
 and locale.getpreferredencoding(False)”
  • Users may use locale.getpreferredencoding(False) to get locale specific encoding. Making UTF-8 mode don’t affect it will reduce backward compatibility issues.
    • Neither EncodingWarning nor DeprecationWarning is emitted for locale.getpreferredencoding(False) for now.

FWIW, that there is a discussion about deprecating locale.getpreferredencoding(False):

If locale.getpreferredencoding(False) is deprecated, we can not use it as the definition of the encoding used when it is not specified anyway.
Users using locale.getpreferredencoding(False) can migrate to new APIs slowly if UTF-8 mode doesn’t affect it.

The problem I have with changing what getpreferredencoding() does now is that the function currently serves two purposes:

  • answering “What encoding will encoding=None use in open()?” (regardless of mode)
  • answering “What encoding will encoding='locale' use in open()?” (only in the default mode, not in UTF-8 mode)

Even if you changed what getpreferredencoding() does to cover the second use case, you’d still have to add a second function to restore coverage of the first use case.

Better to leave the existing function alone, and add a new function to cover the case that isn’t being handled adequately.

3 Likes

OK. You are right.

@vstinner and @malemburg suggested to add new API.
https://bugs.python.org/issue47000#msg415146

I re-read the document of the getpreferredencoding(). It is:

Return the locale encoding used for text data, according to user preferences. User preferences are expressed differently on different systems, and might not be available programmatically on some systems, so this function only returns a guess.

Enable or disable UTF-8 mode is a part of “user preferences”. So keeping it follow UTF-8 mode makes sense.

I will update the PEP. And I will add a new API for encoding="locale" without waiting this PEP accepted because it seems like a bug compared to the purpose of the PEP 597.

4 Likes

The PEP update at PEP 686: Update based on discussion by methane · Pull Request #2446 · python/peps · GitHub nicely addresses the API concern I raised.

From a Linux perspective, I think the proposed change now largely looks good (on a suitable timeline), with the main concern being the potential impact on legitimate use of locales like GB-18030.

I’m wondering if we could actually make the EncodingWarning machinery a bit smarter, and promote it to a regular deprecation warning in the following case:

  • encoding=None is used in a text-mode call to open() (or any other API that relies on locale.getpreferredencoding()); and
  • locale.getpreferredencoding() returns something other than "utf-8"

Those are the cases where the behaviour will change if this PEP were to be implemented. Cases where either the locale encoding is already UTF-8 or else the C locale is being used and hence triggering UTF-8 mode and/or locale coercion anyway (which I believe is most modern Linux systems outside China at this point, although I admit I don’t have hard data on that point) won’t change, they’ll just get “utf-8” because UTF-8 mode is enabled rather than because the locale said it should be used. Emitting a warning for those cases is genuine noise, hence PEP 597 leaving EncodingWarning off by default.

The non-optional deprecation warning could reference three potential resolutions:

  • pass “encoding=‘utf-8’” if the file is meant to be in UTF-8 regardless of platform
  • pass “encoding=‘locale’” if the file is meant to be in the default system encoding, even when Python is in UTF-8 mode
  • pass “encoding=locale.getpreferredencoding()” if the file should keep its current UTF-8-mode-dependent behaviour
2 Likes

This idea doesn’t work nice when developer uses Unix and end users use Windows:

  • DeprecationWarning is suppressed for end users. So end users don’t see the warning.
  • Developers uses UTF-8 locale (macOS, or Linux) so EncodingWarning is not emitted.

If we want to show warning for developers, we should emit warning even if locale encoding is UTF-8. But it is too verbose.

If we want to show warnings for end users, we can just emit EncodingWarning even when warn_default_encoding is not true.

But promoting opt-in EncodingWarning is more simple. We can promote UTF-8 mode and EncodingWarning in Python 3.11 and 3.12 release note.

Java changed the default encoding without opt-in or opt-out warning. We can learn from Java users.
If this backward incompatible change is really painful, we can postpone making UTF-8 mode default and reconsider warning shown by default.

1 Like

The cases I’m mainly concerned about improving are:

  • developer on Linux using and developing for the GB-18030 locale (that works fine now, switching to UTF-8 mode by default will break it, so a standard deprecation warning is appropriate)
  • developer on Windows that’s developing for Windows (lots of Windows apps don’t default to UTF-8 yet, and I know Steve Dower has raised concerns about this aspect of the transition)

It definitely won’t catch everything (such as the cross-platform portability concerns for Windows), but it should help, and cross-platform testing will pick up regular deprecation warnings even if the specific encoding warnings aren’t enabled.

2 Likes

No, the console code pages are only supposed to be used for a console file if legacy mode is enabled by setting the environment variable PYTHONLEGACYWINDOWSSTDIO, and only for standard file descriptors 0-2. (This should all just apply to a console, but C isatty() is also true for nul and serial/parallel ports. It’s not intentional in those cases, just sloppy.) Even this hasn’t been implemented for a few versions now, since Victor broke it.

C:\>chcp.com
Active code page: 850

C:\>py -3.6 -c "import sys; print(sys.stdin.encoding)"
utf-8

C:\>set PYTHONLEGACYWINDOWSSTDIO=1
C:\>py -3.6 -c "import sys; print(sys.stdin.encoding)"
cp850
C:\>py -3.7 -c "import sys; print(sys.stdin.encoding)"
cp850

Broken in 3.8+:

C:\>py -3.8 -c "import sys; print(sys.stdin.encoding)"
cp1252
C:\>py -3.9 -c "import sys; print(sys.stdin.encoding)"
cp1252
C:\>py -3.10 -c "import sys; print(sys.stdin.encoding)"
cp1252
C:\>py -3.11 -c "import sys; print(sys.stdin.encoding)"
cp1252

I don’t think anyone uses legacy mode, so it’s not a pressing concern. There would be complaints from someone else if it mattered. I’d prefer to drop support for PYTHONLEGACYWINDOWSSTDIO since no one seems to be using it.