PEP 597: Use UTF-8 for default text file encoding

methane · June 5, 2019, 12:18pm

Hi, all.

Microsoft changed default text encoding of notepad.exe to UTF-8 from 2019 May Update!

I propose to change Python’s default text encoding too, from 2021. I believe 2021 is not too early for this change. (If we release 3.9 in 2020, this PEP will applied to 3.10, although deprecation warning is raised from 3.8)

Abstract

Currently, TextIOWrapper uses locale.getpreferredencoding(False)
(hereinafter called “locale encoding”) when encoding is not specified.

This PEP proposes changing the default text encoding to “UTF-8”
regardless of platform or locale.

Motivation

People assume it is always UTF-8

Package authors using macOS or Linux may forget that the default encoding
is not always UTF-8.

For example, long_description = open("README.md").read() in
setup.py is a common mistake. If there is at least one emoji or any
other non-ASCII character in the README.md file, many Windows users
cannot install the package due to a UnicodeDecodeError.

Active code page is not stable

Some tools on Windows change the active code page to 65001 (UTF-8), and
Microsoft is using UTF-8 and cp65001 more widely in recent versions of
Windows 10.

For example, “Command Prompt” uses the legacy code page by default.
But the Windows Subsystem for Linux (WSL) changes the active code page to
65001, and python.exe can be executed from the WSL. So python.exe
executed from the legacy console and from the WSL cannot read text files
written by each other.

But many Windows users don’t understand which code page is active.
So changing the default text file encoding based on the active code page
causes confusion.

Consistent default text encoding will make Python behavior more expectable
and easier to learn.

Using UTF-8 by default is easier on new programmers

Python is one of the most popular first programming languages.

New programmers may not know about encoding. When they download text data
written in UTF-8 from the Internet, they are forced to learn about encoding.

Popular text editors like VS Code or Atom use UTF-8 by default.
Even Microsoft Notepad uses UTF-8 by default since the Windows 10 May 2019
Update. (Note that Python 3.9 will be released in 2021.)

Additionally, the default encoding of Python source files is UTF-8.
We can assume new Python programmers who don’t know about encoding
use editors which use UTF-8 by default.

It would be nice if new programmers are not forced to learn about encoding
until they need to handle text files encoded in encoding other than UTF-8.

Specification

From Python 3.9, the default encoding of TextIOWrapper and open() is
changed from locale.getpreferredencoding(False) to “UTF-8”.

When there is device encoding (os.device_encoding(buffer.fileno())),
it still supersedes the default encoding.

Unaffected areas

Unlike UTF-8 mode, locale.getpreferredencoding(False) still respects
locale encoding.

stdin, stdout, and stderr continue to respect locale encoding
as well. For example, these commands do not cause mojibake regardless of the
active code page:

   > python -c "print('こんにちは')" | more
   こんにちは
   > python -c "print('こんにちは')" > temp.txt
   > type temp.txt
   こんにちは

Pipes and TTY should use the locale encoding:

subprocess and os.popen use the locale encoding because the
subprocess will use the locale encoding.
getpass.getpass uses the locale encoding when using TTY.

Affected APIs

All other code using the default encoding of TextIOWrapper or open are
affected. This is an incomplete list of APIs affected by this PEP:

lzma.open, gzip.open, bz2.open, ZipFile.read_text
socket.makefile
tempfile.TemporaryFile, tempfile.NamedTemporaryFile
trace.CoverageResults.write_results_file

These APIs will always use “UTF-8” when opening text files.

Deprecation Warning

From 3.8 onwards, DeprecationWarning is shown when encoding is omitted and
the locale encoding is not UTF-8. This helps not only when writing
forward-compatible code, but also when investigating an unexpected
UnicodeDecodeError caused by assuming the default text encoding is UTF-8.
(See People assume it is always UTF-8_ above.)

Rationale

Why not just enable UTF-8 mode by default?

This PEP is not mutually exclusive to UTF-8 mode.

If we enable UTF-8 mode by default, even people using Windows will forget
the default encoding is not always UTF-8. More scripts will be written
assuming the default encoding is UTF-8.

So changing the default encoding of text files to UTF-8 would be better
even if UTF-8 mode is enabled by default at some point.

Why not change std(in|out|err) encoding too?

Even when the locale encoding is not UTF-8, there can be many UTF-8
text files. These files could be downloaded from the Internet or
written by modern text editors.

On the other hand, terminal encoding is assumed to be the same as
locale encoding. And other tools are assumed to read and write the
locale encoding as well.

std(in|out|err) are likely to be connected to a terminal or other tools.
So the locale encoding should be respected.

Why not always warn when encoding is omitted?

Omitting encoding is a common mistake when writing portable code.

But when portability does not matter, assuming UTF-8 is not so bad because
Python already implements locale coercion (:pep:538) and UTF-8 mode
(:pep:540).

And these scripts will become portable when the default encoding is changed
to UTF-8.

Backward compatibility

There may be scripts relying on the locale encoding or active code page not
being UTF-8. They must be rewritten to specify encoding explicitly.

If the script assumes latin1 or cp932, encoding="latin1"
or encoding="cp932" should be used.
If the script is designed to respect locale encoding,
locale.getpreferredencoding(False) should be used.

There are non-portable short forms of
locale.getpreferredencoding(False).
- On Windows, "mbcs" can be used instead.
- On Unix, os.fsencoding() can be used instead.

Note that such scripts will be broken even without upgrading Python, such as
when:

Upgrading Windows
Changing the language setting
Changing terminal from legacy console to a modern one
Using tools which do chcp 65001

How to Teach This

When opening text files, “UTF-8” is used by default. It is consistent with
the default encoding used for text.encode().

Open Issues

Alias for locale encoding

encoding=locale.getpreferredencoding(False) is too long, and
"mbcs" and os.fsencoding() are not portable.

It may be possible to add a new “locale” encoding alias as an easy and
portable version of locale.getpreferredencoding(False).

The difficulty of this is uncertain because encodings is currently
imported prior to _bootlocale.

Another option is for TextIOWrapper to treat "locale" as a special
case:

   if encoding == "locale":
       encoding = locale.getpreferredencoding(False)

vstinner · June 6, 2019, 10:34pm

Hi INADA-san,

First, thanks for proposing that I think that it’s a popular request for years, so it’s good to have a document to discuss it in depth.

I made a similar proposition in 2011, but it was way too early Python-Dev: open(): set the default encoding to ‘utf-8’ in Python 3.3?

I also added UTF-8 Mode (PEP 540) and you was my BDFL-delegate I chose to disable it by default, because I was not brave enough to enable it by default (I was afraid of breaking some legit use cases). But the UTF-8 Mode uses surrogateescape by default and make locale.getpreferredencoding() always return UTF-8 (ignore the locale and so lie!) which are big differences. PEP 540: Encoding and error handler This PEP uses UTF-8 encoding with the strict error handler which makes decoding fail at the first undecodable byte.

It would be interesting to add a new encoding named “locale” which would be locale.getpreferredencoding(False). If you want to decode a text file from the current locale, you would have to opt-in for that one. IMHO it deserves to be added because it is the default for 10 years, so people are used to it and it would be strange to really “loose a feature”. “latin1”, “cp932” or “mbcs” are all different than “locale” which is a different encoding. “locale” really means the current locale encoding which changes depending on the platform and the user locale.

Would you mind to say something about Unicode Byte Order Mark (BOM)? IMHO we should advice to not use BOM. It causes more harm than anything else. And I’m quite sure that someone will come up with the idea of using BOM to guess the encoding of a file. Note: Python already supports “utf-8-sig” encoding which is useful to read a file which may or may not start with a BOM.

From 3.8 onwards, DeprecationWarning is shown when encoding is omitted and the locale encoding is not UTF-8.

Honestly, I’m not sure about that one. I don’t expect that users will be able to deal with such warning, whereas DeprecationWarning are now displayed by default in the main module: PEP 565 – Show DeprecationWarning in main

I would prefer to either report any warning, or “hide it” better. For example, only emit a warning in the development mode (-X dev). Or maybe use ResourceWarning rather than DeprecationWarning. I don’t know.

Maybe it should be an opt-in option, rather than opt-out (configure Python to hide it).

I don’t recall any application or other programming language complaining about my locale.

vstinner · June 6, 2019, 10:48pm

Ok, now the unpleasant part. My main worry about this PEP is the risk of mojibake and getting more UnicodeDecodeError exceptions. First, compared to the UTF-8 Mode, this risk is limited by the usage of the strict error handler. That’s a good start.

The general issue of the PEP 597 is that it uses different encoding for things which are supposed to be isolated. But data are not isolated, data move from one place to another place. Data which are commonly impacted by encodings:

file content
command line arguments: sys.argv
standard streams: sys.stdin, sys.stdout, sys.stderr
environment variables: os.environ
filenames: os.listdir(str) for example
pipes: subprocess.Popen using subprocess.PIPE for example
error messages: os.strerror(code) for example
user and terminal names: os, grp and pwd modules
host name, UNIX socket path: see the socket module
etc.

Currently, the PEP uses a different encoding for file content (UTF-8) and standard streams. Example:

ls > files
python3.9 -c 'for name in open("files"): print(name)'

The Unix ls command uses the locale encoding. If a filename cannot be decoded from UTF-8, the loop will raise a UnicodeDecodeError. Python decodes the text file from UTF-8 but then encode it to stdout encoding. That might lead to mojibake.

In early days of Python 3, we had PYTHONFSENCODING environment variable and sys.setfilesystemencoding() function which let the user choose the encoding used by Python for “some” operation. These features caused many implementation issues, but even worse: caused a lot of mojibake. I removed both features and it made the code way more simple and reliable! I wrote a serie of 6 articles about the history of encodings used by Python over the last 10 years, this article is about PYTHONFSENCODING and setfilesystemencoding():
https://vstinner.github.io/painful-history-python-filesystem-encoding.html

I would like to generalize: using 2 different encodings in Python is a high risk of mojibake, and that’s what the PEP proposes when the locale encoding is not UTF-8.

See also the “Use Cases” section of an old version of my PEP 540: it tries to explain different practical issues caused by using different encodings between different applications exchanging data:
http://haypo.alwaysdata.net/tmp/pep-0540.html#use-cases

(Oops, I sent an incomplete message. I edited it to finish it.)

methane · June 6, 2019, 11:30pm

Hmm, doesn’t UTF-8 mode uses “strict” error handler for files?
This PEP doesn’t propose changing anything about stdio, pipes between subprocess, and TTY.

Yes, I added in “Open Issues” section in this PEP after you wrote it in Github.
I’ll try to implement it to know how it is simple. If it is simple, I’ll propose to backport it to 3.8 for smooth transition.

OK, I will add Rational section about why we don’t use BOM by default.

Some users can, and some users can not.

Think about programmers write script in Unix, and copy it into Windows.
The script may raise “UnicodeDecodeError” because the default encoding is not UTF-8. (It is very common scenario in setup.py)

They may assume (or believe) the default encoding is UTF-8. They will be very confused by the error.
This warning will teach them that the default encoding is not UTF-8 on Windows, in Python 3.8.

Another use case: User may report UnicodeDecodeError on Github Issue Tracker or Stack Overflow. When it seems they are reading UTF-8 file (e.g. “… can’t decode byte 0x81 in position …”), we can advice to use -Xdev option to the reporter.

User can see Warning and report it. Then we can know where is “encoding should be UTF-8, but be omitted” bug.

steve.dower · June 6, 2019, 11:35pm

I still believe the backwards compatibility impact is bad enough to not make this worthwhile, as we decided while discussing my two encoding PEPs (528 and 529, from memory).

My impression is that the rationales in this proposal have not yet been validated, and are based on a narrow view of Python’s userbase. We should not so lightly cause user data to become “corrupt” between Python 3.8 and Python 3.9.

I always teach that if you don’t know the encoding of a file, you can’t read that file, so make somebody tell you. If you’re writing the file, either someone will demand a particular encoding, or you should choose (and often UTF-8 is a good choice). I would prefer to make the encoding parameter required, if we’re going to make a breaking change here!

That said, if this is going to be voted in anyway like everything else recently I’ve had concerns about, having a “best guess current locale” encoding (which is not the same as the console code page anyway) and a noisy warning for unspecified encoding in open(), etc., is about the best way to make the transition painful enough now that any code currently being maintained will be ready for when 3.9 is released.

vstinner · June 7, 2019, 12:54am

Oops, you’re right! I recall that you had an argument about prevent users to attempt to open a JPEG binary file in text mode by default So I changed the open() error handler from surrogateescape to strict.

Why not just enable UTF-8 mode by default?

In which case the default encoding would not be UTF-8 if the UTF-8 mode is enabled by default? Do you mean when an user would opt-in to disable the UTF-8 Mode?

methane · June 7, 2019, 12:56am

It may be still bad enough for now, but Python 3.9 will be released in 2021. I expect most of all Python users will use UTF-8 editor, UTF-8 terminal, and UTF-8 text files in 2021.

If it is still bad in 2021, we can postpone it to 2023 (Python 3.10).

Would you elaborate what “corrupt” mean here please? (“corrupt user data” is very strong words and we should be very explicit when we use the words, I think.)

It is one option, of course.

I chose this PEP because there are many users who live in “always UTF-8” world already, and it’s popularity will be bigger in 2021. They already omit encoding option for UTF-8 files because they don’t care running their scripts on Windows.

If we make the encoding option required, we broke all scripts written in it.

On the other hand, “UTF-8 if encoding is omitted” is consistent with “str.encode()” or “bytes.decode()”.

vstinner · June 7, 2019, 12:58am

Would it be possible to design an helper function to attempt to decode a text file from UTF-8 or fallback to the locale encoding? The only reliable option would require to read the whole file as a first step (but it doesn’t have to load the whole content in memory, checking UTF-8 encoding can be done on small chunks).

I’m not sure if it would be a good idea to promote such function, but it might to make the adoption of this PEP easier.

Note: Link to early discussions: https://github.com/python/peps/pull/1099

methane · June 7, 2019, 1:05am

Yes. People will assume (or already assume) “UTF-8 is always default text file encoding”. But it is not true if we can opt-out UTF-8 mode, unless this PEP is accepted.

steve.dower · June 7, 2019, 1:05am

By “corrupt” I mean it does not produce the same content when read as when it was written. UTF-8 is nice in that it’s very good at detecting errors, but if someone had say a pip-generated config file in 3.8 that was ACP-encoded with their perfectly valid username in it, and then that caused pip on 3.9 to crash on start-up, then we’ve totally broken them.

Having a “smart” encoding isn’t a bad idea, as long as we can be good enough about setting its rules. I’d love an encoding that would read/skip UTF-8 BOM but wouldn’t write it, and also would detect UTF-16 BOM and switch to that.

steve.dower · June 7, 2019, 1:07am

You can test this with a quick Twitter poll - “What is the encoding used for the file contents when I do open(path, “w”) in Python?”

utf-8
current locale
system locale
something else

methane · June 7, 2019, 1:25am

If we don’t care about universal newline, it’s easy:

def decode_file_contents(path: PathLike) -> str:
    with open(path, "b") as f:
        b = f.read()
    try:
        return b.decode()
    except UnicodeDecodeError:
        return b.decode(locale.getpreferredencoding(False))

methane · June 7, 2019, 1:40am

I think my Twitter followers are very biased…

As far as I read this search result, many people omit encoding option when reading markdown files.

https://github.com/search?l=Python&q="open("README.md").read()"&type=Code

methane · June 7, 2019, 1:57am

Oh, I didn’t read PEP 596. Python 3.9 will be released in 2020, not 2021.

If we are going to one release / year cycle, this would be better to postpone to 3.10 (2021).

steve.dower · June 7, 2019, 4:42am

Maybe we could define another trigger besides the version number? We can deprecate unspecified encodings now and see how people react, and then base it on that.

This is basically why it was okay to change the default filesystem encoding on Windows when I did that - using bytes as paths had already been officially deprecated for 3 releases, and they were the only affected scenario, so we could undeprecate them with different semantics.

methane · June 7, 2019, 5:36am

I think we should have at least 1 year deprecation period. Since Python 3.8 will be released at 2019-10-21, I think “2021 or later” is OK.

But I don’t think show deprecation warning every time people omit encoding is good idea, because there are a lot of code omits encoding.

It is totally different in this time. If we deprecate default text encoding completely, all users will be affected. Many Python users lives in “always UTF-8” world already. And many Python scripts encoding option. It is not like bytes filesystem path.

So we should decide which is future before start warning, default text encoding is UTF-8, or prohibit omitting encoding option.

pf_moore · June 7, 2019, 9:09am

Thanks for raising this.

Hopefully, people here will be open to the idea that there are a lot of users of Python whose views are under-represented in the discussions around language changes, and that we should be very cautious about ensuring that we don’t get into a situation where we are consistently making decisions that disadvantage such users.

You missed a possible response:

What’s an encoding?

In my experience, there are a lot of people who even today, even in a world full of emoji and international characters, don’t know what an encoding is. They rely on “the operating system” getting it right for them, and when it doesn’t (for example typing an ANSI-encoded file in an OEM-encoded console window) they are used to the patterns of mojibake that they see, and know how to deal with them, even though they have no real understanding of the underlying issue (and probably don’t even know what the term “mojibake” means). Their concept of a “file” is quite possibly extremely limited - they may not even understand why a “text file” is easier to read in Python than a Word file or a PDF file - after all, they all “just contain text”. And yet these people are competent developers in their field, and may be using Python with little or no formal training, relying on Google and Stack Overflow (which will take years to adapt to a change like this).

As an individual, I’d find it convenient if Python defaulted to UTF-8. But as I said when I suggested that this needs a PEP, it’s not me that suffers from this change, it’s the vast number of Python programmers who don’t know about encodings, and who typically aren’t involved in discussions on changes. I fully expect that many of them will be bitten by this change - not in a massive “OMG Python is broken!” way, but more in a “sigh, I hate dealing with files in Python, I always end up with weird errors about Unicode” way (it wouldn’t surprise me if we see a resurgence of “Unicode error” bug reports in applications, similar to the spike we got with the Python 2 → Python 3 transition).

I believe that the most “encoding naive user” friendly way to deal with encodings is to leave it to the OS. Consistent behaviour with “everything else the user works with” is more important than anything else here - and if that means educating people who write cross-platform code that their assumptions about encodings are not true, then so be it. PEPs like PEP 528 and PEP 529 work with the OS, by in effect making the necessary “(Python) string → bytes → (OS API) string” encoding dance lossless and consistent by using UTF-8. That’s very different from this change, which exposes Python’s encoding choice directly to the user.

So, to summarise, I am -1 on this change, even though I wish for personal reasons that I could be +1.

methane · June 7, 2019, 9:37am

I agree that there can be many users who are bitten by this change.

But please note that, there are many users who are bitten by current behavior too.

And usage of cp65001 will grow quickly (thanks to Microsoft improves Developer eXperiment),
users who are bitten by unstable default encoding will be more larger in next few years.

Of course, my prediction will be wrong. If my worrying is not happened in next few years, we can postpone the change.

So my question is, do you -1 on this change in any foreseeable future? Or do you just -1 on this change it in 2021?

If later, I hope you are not -1 on start showing deprecation warning for future change, regardless when the change happens. We may be able to change it in Python 4.0.

pf_moore · June 7, 2019, 9:48am

But as that happens, users will learn about it as part of changing behaviour in all of their apps.

I obviously can’t predict the future, but I’m basically -1 because I think the principle should be that Python defaults to “do what the OS does”, not to a Python-specific value. So I’m -1 for the foreseeable future.

To put it another way, I’d prioritise consistency with “other applications on the user’s machine” over “other machines and platforms”. We should be designing for end users, not for library maintainers (and I say that as a library maintainer )

methane · June 7, 2019, 9:53am

How many apps changes text file encoding by code page?
At least, modern text editors (including notepad from 2019) use UTF-8 by default regardless of active code page.

So many users are bitten by only Python and some legacy applications. When considering new Python programmers, they are bitten by almost only Python, because they don’t use any app which uses legacy encoding.

What is “OS does” here? OS doesn’t care about text file encoding, does?