Abstract
This PEP proposes making UTF-8 mode [1] on by default.
With this change, Python uses UTF-8 for default encoding of files, stdio, and pipes consistently.
Motivation
UTF-8 becomes de-facto standard text encoding.
- Default encoding of Python source files is UTF-8.
- JSON, TOML, YAML uses UTF-8.
- Most text editors including VS Code and Windows notepad use UTF-8 by default.
- Most websites and text data on the internet uses UTF-8.
- And many other popular programming languages including node.js, Go, Rust, Ruby, and Java uses UTF-8 by default.
Changing the default encoding to UTF-8 makes Python easier to interoperate with them.
Additionally, many Python developers using Unix forget that the default encoding is platform dependant. They omit to specify encoding="utf-8"
when they read text files encoded in UTF-8 (e.g. JSON, TOML, Markdown, and Python source files). Inconsistent default encoding caused many bugs.
Specification
Changes to UTF-8 mode
Currently, UTF-8 mode affects to locale.getpreferredencoding()
.
This PEP proposes to remove this override. UTF-8 mode will not affect to locale
module.
After this change, UTF-8 mode affects to:
- stdin, stdout, stderr
- User can override it with
PYTHONIOENCODING
.
- User can override it with
- filesystem encoding
-
TextIOWrapper
and APIs using it includingopen()
,Path.read_text()
,subprocess.Popen(cmd, text=True)
, etcâŠ
This change will be introduced in Python 3.11 if possible.
Enable UTF-8 mode by default
Python enables UTF-8 mode by default.
User can still disable UTF-8 mode by setting PYTHONUTF8=0
or -X utf8=0
.
Backward Compatibility
Most Unix systems use UTF-8 locale and Python enables UTF-8 mode when its locale is C or POSIX. So this change mostly affects Windows users.
When a Python program depends on the default encoding, this change may cause UnicodeError
, mojibake, or even silent data corruption. So this change should be announced very loudly.
To resolve this backward incompatibility, users can do:
- Disable UTF-8 mode
- Use
EncodingWarning
to find where the default encoding is used and useencoding="locale"
option to keep using locale encoding. [2]
Preceding examples
- Ruby changed the default
external_encoding
to UTF-8 on Windows in Ruby 3.0 (2020). [3] - Java changed the default text encoding to UTF-8 in JDK 18. (2022). [4]
Both Ruby and Java have an option for backward compatibility. They donât provide any warning like EncodingWarning
[2] in Python for use of the default encoding.
Rejected Alternative
Deprecate implicit encoding
Deprecating use of the default encoding is considered.
But there are many cases user uses the default encoding when just they need ASCII. And some users use Python only on Unix with UTF-8 locale.
So forcing users to specify the encoding
option everywhere is too painful.
Java also rejected this idea [4].
How to teach this
For new users, this change reduces things that need to teach.
Users can delay learning about text encoding until they need to handle non-UTF-8 text files.
For existing users, see Backward compatibility section.
Resources
[1] PEP 540 â Add a new UTF-8 Mode
[2] (1, 2) PEP 597 â Add optional EncodingWarning
[3] Set default for Encoding.default_external to UTF-8 on Windows
[4] (1, 2) JEP 400: UTF-8 by Default