Hi, all.
Microsoft changed default text encoding of notepad.exe to UTF-8 from 2019 May Update!
I propose to change Python’s default text encoding too, from 2021. I believe 2021 is not too early for this change. (If we release 3.9 in 2020, this PEP will applied to 3.10, although deprecation warning is raised from 3.8)
Abstract
Currently, TextIOWrapper
uses locale.getpreferredencoding(False)
(hereinafter called “locale encoding”) when encoding
is not specified.
This PEP proposes changing the default text encoding to “UTF-8”
regardless of platform or locale.
Motivation
People assume it is always UTF-8
Package authors using macOS or Linux may forget that the default encoding
is not always UTF-8.
For example, long_description = open("README.md").read()
in
setup.py
is a common mistake. If there is at least one emoji or any
other non-ASCII character in the README.md
file, many Windows users
cannot install the package due to a UnicodeDecodeError
.
Active code page is not stable
Some tools on Windows change the active code page to 65001 (UTF-8), and
Microsoft is using UTF-8 and cp65001 more widely in recent versions of
Windows 10.
For example, “Command Prompt” uses the legacy code page by default.
But the Windows Subsystem for Linux (WSL) changes the active code page to
65001, and python.exe
can be executed from the WSL. So python.exe
executed from the legacy console and from the WSL cannot read text files
written by each other.
But many Windows users don’t understand which code page is active.
So changing the default text file encoding based on the active code page
causes confusion.
Consistent default text encoding will make Python behavior more expectable
and easier to learn.
Using UTF-8 by default is easier on new programmers
Python is one of the most popular first programming languages.
New programmers may not know about encoding. When they download text data
written in UTF-8 from the Internet, they are forced to learn about encoding.
Popular text editors like VS Code or Atom use UTF-8 by default.
Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019
Update. (Note that Python 3.9 will be released in 2021.)
Additionally, the default encoding of Python source files is UTF-8.
We can assume new Python programmers who don’t know about encoding
use editors which use UTF-8 by default.
It would be nice if new programmers are not forced to learn about encoding
until they need to handle text files encoded in encoding other than UTF-8.
Specification
From Python 3.9, the default encoding of TextIOWrapper
and open()
is
changed from locale.getpreferredencoding(False)
to “UTF-8”.
When there is device encoding (os.device_encoding(buffer.fileno())
),
it still supersedes the default encoding.
Unaffected areas
Unlike UTF-8 mode, locale.getpreferredencoding(False)
still respects
locale encoding.
stdin
, stdout
, and stderr
continue to respect locale encoding
as well. For example, these commands do not cause mojibake regardless of the
active code page:
> python -c "print('こんにちは')" | more
こんにちは
> python -c "print('こんにちは')" > temp.txt
> type temp.txt
こんにちは
Pipes and TTY should use the locale encoding:
-
subprocess
andos.popen
use the locale encoding because the
subprocess will use the locale encoding. -
getpass.getpass
uses the locale encoding when using TTY.
Affected APIs
All other code using the default encoding of TextIOWrapper
or open
are
affected. This is an incomplete list of APIs affected by this PEP:
-
lzma.open
,gzip.open
,bz2.open
,ZipFile.read_text
socket.makefile
-
tempfile.TemporaryFile
,tempfile.NamedTemporaryFile
trace.CoverageResults.write_results_file
These APIs will always use “UTF-8” when opening text files.
Deprecation Warning
From 3.8 onwards, DeprecationWarning
is shown when encoding is omitted and
the locale encoding is not UTF-8. This helps not only when writing
forward-compatible code, but also when investigating an unexpected
UnicodeDecodeError
caused by assuming the default text encoding is UTF-8.
(See People assume it is always UTF-8
_ above.)
Rationale
Why not just enable UTF-8 mode by default?
This PEP is not mutually exclusive to UTF-8 mode.
If we enable UTF-8 mode by default, even people using Windows will forget
the default encoding is not always UTF-8. More scripts will be written
assuming the default encoding is UTF-8.
So changing the default encoding of text files to UTF-8 would be better
even if UTF-8 mode is enabled by default at some point.
Why not change std(in|out|err) encoding too?
Even when the locale encoding is not UTF-8, there can be many UTF-8
text files. These files could be downloaded from the Internet or
written by modern text editors.
On the other hand, terminal encoding is assumed to be the same as
locale encoding. And other tools are assumed to read and write the
locale encoding as well.
std(in|out|err) are likely to be connected to a terminal or other tools.
So the locale encoding should be respected.
Why not always warn when encoding is omitted?
Omitting encoding is a common mistake when writing portable code.
But when portability does not matter, assuming UTF-8 is not so bad because
Python already implements locale coercion (:pep:538
) and UTF-8 mode
(:pep:540
).
And these scripts will become portable when the default encoding is changed
to UTF-8.
Backward compatibility
There may be scripts relying on the locale encoding or active code page not
being UTF-8. They must be rewritten to specify encoding
explicitly.
-
If the script assumes
latin1
orcp932
,encoding="latin1"
orencoding="cp932"
should be used. -
If the script is designed to respect locale encoding,
locale.getpreferredencoding(False)
should be used.There are non-portable short forms of
locale.getpreferredencoding(False)
.- On Windows,
"mbcs"
can be used instead. - On Unix,
os.fsencoding()
can be used instead.
- On Windows,
Note that such scripts will be broken even without upgrading Python, such as
when:
- Upgrading Windows
- Changing the language setting
- Changing terminal from legacy console to a modern one
- Using tools which do
chcp 65001
How to Teach This
When opening text files, “UTF-8” is used by default. It is consistent with
the default encoding used for text.encode()
.
Open Issues
Alias for locale encoding
encoding=locale.getpreferredencoding(False)
is too long, and
"mbcs"
and os.fsencoding()
are not portable.
It may be possible to add a new “locale” encoding alias as an easy and
portable version of locale.getpreferredencoding(False)
.
The difficulty of this is uncertain because encodings
is currently
imported prior to _bootlocale
.
Another option is for TextIOWrapper
to treat "locale"
as a special
case:
if encoding == "locale":
encoding = locale.getpreferredencoding(False)