PEP 597: Use UTF-8 for default text file encoding

pf_moore · June 7, 2019, 10:23am

I have no idea. But probably anything that uses a C fopen(), for a start. I suspect that nearly all cross-platform open source code would use the OS default encoding (just like Python does at the moment).

The users I’m referring to are quite probably using an older OS - I have no figures, but my impression is that a significant number of corporate users are not yet using Windows 10, and even if they are, they are quite likely not using the latest releases.

The default ANSI code page, set at the OS level, used in the C runtime. Maybe I should say “the C runtime default” - you’re right, at the lowest level the OS doesn’t deal in text (at least it doesn’t on Unix, and on Windows, PEPs 528 and 529 address the 2 key places where Python interacts directly with OS APIs that do deal with text).

But whether it’s the OS default or the C runtime default makes little difference in practice.

Not many developers use Notepad to write text files anyway Notepad++ (which is what I see most of my colleagues using) defaults to the ANSI codepage (which is what I mean by the “OS default”, BTW). PyCharm “Global Encoding → System Default” will use the ANSI encoding. VS Code defaults to UTF-8 but some users appear to want the system default encoding because this clashes with the text files they typically read (I hope I got the essence of that bug report right - but it’s worth a look as it gives some insight into the frustration a choice to explicitly prefer UTF-8 over the system codepage can cause). The point being that I don’t think the majority of text files users see and write are in UTF-8, unless (1) the user is encoding-aware enough to configure things that way, or (2) “the OS” defaults to UTF-8 (ANSI codepage on Windows, NLS environment variables on Unix, …).

methane · June 7, 2019, 11:44am

fopen() doesn’t care about encoding. It only translate LF ↔ CRLF when text mode.
node.js, Go, and Rust use UTF-8 by default too. Many cross platform OSSes use UTF-8 nowadays.

I agree about that. So this is problem of balance between new users and users live in legacy environment. How many modern vs legacy users will use Python 3.10 in 2021?
If it is 50 : 50, there is no doubt about backward compatibility is more important than new users.
But if it is 90 : 10, or even 99 : 1?

I think we should change the default for new user eventually, although 2021 may be not the time.

OK, this is key difference between our opinions.

There is beta option to change system code page to cp65001. But I think it is far future that every one use cp65001 for system code page, because some legacy applications are broken by the option.

On the other hand, tools around text files are moving away from “ANSI code page” to “UTF-8” without waiting system code page changed, like Notepad.

I expect Notepad++ and PyCharm will change the default too in next several years. And majority users will use UTF-8 regardless they know encoding or not and system code page.
(I expect majority of Python users already use UTF-8 by default, but we don’t have data about that.)

Of course, this is just my optimistic expectations. If it doesn’t happen, we can just postpone the change.
There is no problem about deprecation period longer than 2 years.

But when it happen, almost all tools around text files use UTF-8 by default except Python. Users are forced to learn encoding only for Python. It will be very frustrating.

pf_moore · June 7, 2019, 1:18pm

Doh. My bad - ignore this comment, but my feelings about following the OS standard remain.

I’d argue that if it’s the right default, then we can wait and the OS will switch to that default. So why try to change faster than the OS?

Maybe. Actually, quite probably. But what’s the rush for Python to be the first here? Maybe we should wait until the majority of tools have changed, and only then reconsider if the OS default hasn’t changed too.

I expect the opposite, at least on Windows (and I believe Steve Dower agrees, which I’d put a lot of weight on, as he has access to much more accurate data about Windows users than either of us do ) but as you say we don’t have data. And that makes me feel that we should follow the spirit of the Zen and “refuse to guess”.

If I understand the proposal, you are talking about a warning that will affect every single Windows user (as Windows never has UTF-8 as the default encoding currently, and most uses of open() don’t specify an encoding). IMO, there’s a huge problem with having a deprecation warning like that, for any period, much less an extended period. (I actually hadn’t considered what the spec said about the warning until now).

I don’t know what more I can say. I remain a strong -1 on this proposal, because of the impact I expect it to have on other users (I’ll be fine either way). I can’t offer hard data on who will be affected, so the whole discussion basically comes down to whether people agree with my instincts or not. IMO Steve Dower’s talk from PyCon is pretty relevant here, though.

steve.dower · June 7, 2019, 2:32pm

Thanks Paul for arguing so clearly. I agree 100%.

Because Python previously did what the OS did, we need to continue doing that until the OS changes. When they change to UTF-8 by default, we will also change. (FWIW, I believe this applies to all operating systems or we wouldn’t have needed any of Victor’s locale coercion work and UTF-8 mode work.)

I heard recently that the C Runtime on Windows will be getting UTF-8 support soon (they currently don’t really have it, cp65001 is not close enough), so I’ll ask them what their plans are for enabling it by default. It is likely that they have better insight into usage than we do, simply because they can afford to reach out to a range of users and ask them.

methane · June 7, 2019, 2:43pm

As I wrote in PEP, active code page will be changed by various reason, although system code page will be kept legacy for long term.

So python will change default code page by various reason. Sometime, Python use UTF-8, and sometime legacy one. Only users knowing “active code page” can understand what happened.

OK, I can not find any survey which encoding is used by majority Python users.
But I found one interesting data from JetBrains survey, which editor is used by Python users.

Of course, this data is very biased to PyCharm. But it gives us data about which editor is major for Python users.

BTW, I’m not so rushed about this PEP, since release cycle will become shorter.
I thought “warn from 2021 & change in 2023 will be too late.” It’s not true for now.

steve.dower · June 7, 2019, 3:58pm

Though there’s still discussion about how deprecation cycles will work under a shorter schedule…

And the problem isn’t so much what people do use, but when they don’t specify anything. The only real way forward here is to warn on not providing a specific encoding (that still may explicitly be the default code page or locale setting) as it “isn’t reliable cross-platform”. The main concern is libraries who need lead time to make fixes, but eventually end users will need to see the warning to understand any new UnicodeDecodeErrors.

Long term it’s probably worth it. Those languages with no back-compat constraints made the right choice (even .NET suffered an early breaking change for this), but of all the changes we’ve been discussing recently this has the biggest impact on users by far, because they may not be able to read their own files that were written out with an earlier version of Python!

vstinner · June 7, 2019, 10:49pm

Do you have a proof-of-concept (incomplete) implementation to play with it? I would like to compare how Python 3.8, (Python 3.8) UTF-8 Mode and PEP 597 behave in corner cases which can lead to decoding error or mojibake.

methane · June 8, 2019, 4:43pm

DeprecationWarning is suppressed by default in most case to avoid making much noise to users, because the warning is only for Python developer.

Note that even pip show multiple DeprecationWarning for several years. DeprecationWarning is useful to change something in long term. And most users don’t see the warning.

It’s very interesting. WSL2 and new MS Terminal are coming this year too. Let’s stop guessing what will be happened on Windows in next few years for now.

OK, while I prefer “UTF-8 by default” to “force every one to specify encoding”, omitting encoding is very common, and hard to find mistake for now. The warning can teach Python developers that “default text encoding is not always UTF-8”, and reduce bugs caused by open("...").read().

Maybe, we can use PendingDeprecationWarning instead of DeprecationWarning.
PendingDeprecationWarning makes less noise to end users. And it can be used before we decide what change will be happened later.

Note that changing language setting or active code page make this trouble already.
Python can not read text files which are written out with even same version of Python.

Python users on Windows are suffered from more UnicodeDecodeError compared to macOS and Linux.

Changing default encoding to always UTF-8 is the best way to solve this problem at last. Although it will increase troubles in transition period.

I will create it in next week.

steve.dower · June 9, 2019, 12:22am

The rate that we release new versions vs. the rate people learn a new alphabet probably means the version change is more likely. Unless someone is working with lists of files from another machine or has the wrong codepage set (fairly common with international students), they’re unlikely to ever notice it.

And I think once you exclude Python 2 and filesystem encoding related issues, the rate of affected users are probably similar. (Correcting for the higher absolute number of Windows users who will hit this and not know how to deal with it immediately is too difficult for me to guess.)

methane · June 9, 2019, 2:26am

I’m sorry but I can not get what do you mean in this paragraph.

Python is used not only for reading text written in Python in same machine.
Python is used to read text file written text editors or Python in other machine too.
Many Python users are affected by legacy encoding by default.

On the other hand, problem is not happened when the text files contains only ASCII characters. It reduces much “rate” of people affected by the legacy default encoding. But it will reduce rate of people affected by “change default to UTF-8” too.

Legacy encoding is very hated by Python users, at least in Japan. We don’t want to produce any new text files with cp932, unless explicitly specified it. We believe “keep using legacy encoding” is creating more troubles than changing the default. We recommend people to configure their tools to use UTF-8 always. But Python doesn’t have such configuration (PYTHONUTF8 is too aggressive, because it affects reading from subprocess output).

But as far as I read reactions in this topic, situation may be different in other countries.
I wonder why many new text editors which was born not in Japan don’t read latin1 or other legacy encoding by default.

vstinner · June 10, 2019, 11:16am

At Red Hat, we are rebuilding the operating system for Python 3.8. Python 3.8 alpha 4 introduces new deprecation warnings in the inspect module. These warnings broke the build of many packages, because we run the test suite of each project and more and more projects are run their test suite using -Werror. It means that adding a single warning can break dozens and dozens of packages.

Sometimes, it’s a deliberate change. Python 3.7 introduced a new syntax warning for invalid escape sequence:

$ python3.7 -Wdefault
>>> "c:\windows"
<stdin>:1: DeprecationWarning: invalid escape sequence \w

The question is if we want to force all users to make their tiny script portable on all platforms with all locales. Some scripts are only run on one computer which has a well known configuration (good or bad, it’s doesn’t matter) and the script “just works”. Many applications are designed for Unix and don’t make sense on Windows. Some applications are even designed for Linux, and that’s fine.

IMHO the encoding problem is so special that it would deserve its own new warning category like EncodingWarning. The idea would be to enable all warnings but EncodingWarning. Or display EncodingWarning but don’t fail with a hard error.

I made an experiment: introduce a new EncodingWarning (inherit from Warning) which is emitted by open() if encoding=None:
https://bugs.python.org/issue37214

If this warning is treated as an error, at least 32% of tests (136 tests/421) of the Python test suite and it seems like 7 tests hang forever.

I would like to believe that Python code base is sane in general and that we handle encodings properly (I spent a significant amount of time to ensure that it’s the case ;-)). So even if Python itself emits so many warnings, I don’t think that it would be a good idea to introduce this warning and make it an hard error (exception) in Python 3.9.

I even don’t think that such warning must be displayed by default. As DeprecationWarning, it’s more a notice that should only be displayed to developers. Users don’t control the code and so cannot fix the warning just by tuning their locale or locale encoding.

methane · June 10, 2019, 4:26pm

I don’t want to force it. Many text files contains only ASCII characters. I don’t know any locale encoding which can’t decode ASCII files. So they don’t need to fix anything until they need to process text files contains non ASCII characters. And nothing happened when default encoding is changed from legacy one to UTF-8.

I still believe UTF-8 is the best default encoding for text files. When people failed to read text files encoded in legacy locale, they can fix the code or re-encode the text file and continue. But when they failed to write text files, they may lost important data forever.

My current idea is change it in Python 4.0 and warn it only in docs. We may be able to ask popular linters to add warning about omitting encoding too.
Until Python 4, we may be able to advocate UTF-8 mode for Python users who don’t use Python for maintain legacy systems.
Currently, UTF-8 mode is too aggressive. But Microsoft improves UTF-8 support quickly.
We may be able to advertise “Use modern terminal, chcp 65001, and set PYTHONUTF8=1” later in this year.

vstinner · June 10, 2019, 11:36pm

I really dislike putting any large backward incompatible change in Python 4.0. It should just be the version following the last 3.x version.

According to my quick experiment, all Python projects will get dozens of such warning and it will take years until all these warnings will be fixed. Moreover, in my experience, many projects will not bother and never fix these warnings.

Do you mean that the UTF-8 Mode should not modify locale.getpreferredencoding()?

What about adding a second opt-in UTF-8 Mode which leaves locale.getpreferredencoding() unchanged? I’m not comfortable with this idea. As I said above, UTF-8 Mode has been designed to reduce the risk of mojibake.

Why not right now? If we cannot do it right now, maybe it means that it’s a bad idea to change the default encoding to UTF-8.

steve.dower · June 11, 2019, 12:39am

You should research and verify this belief. When you discover that it’s not true, feel free to reevaluate your position.

There’s no judgement for changing your mind given new information, but choosing to remain deliberately ignorant will cause you to propose dangerous changes.

steve.dower · June 11, 2019, 12:50am

“chcp 65001” is not recommended except to work around some (not all) problems in software that assumes the whole world is UTF-8. Python is the exact opposite - we correctly use code pages and Unicode APIs.

How about this: we make TextIOWrapper default to UTF-8 instead of the current locale, and if its read or write methods fail to encode correctly we chain an exception with a clearer message saying what was changed and including the encoding="acp" or whatever specification we decide means “use what the system would have guessed on 3.7”, as well as whatever environment variable we decide will revert the behavior? Hopefully the deprecation warning will break enough test suites that maintained libraries will be fixed, and escape hatches that work both interactively and when using unmaintained libraries should cover user needs. Maybe we could also provide a contextvar so that libraries can “fix” their dependencies as well?

Unfortunately, such a big, complex and noisy set of mitigations is necessary for such a big, invasive change.

methane · June 11, 2019, 1:45am

Sorry, what do you mean here? About changing the default encoding? Or about show warning or not?

I and Victor are talking about the warning. I meant just the warning can be too noisy for people or program which use only ASCII files.

If the change breaks 5% of cases, it is clearly backward incompatible. But if the warning is shown for 95% of harmless cases, it can be too noisy. It would be considerable how we can reduce the noise.

BTW, I don’t think all locale encodings are 100% compatible with ASCII. Some encodings may have different mapping for punctures or controls. For example,

>>> b"\x04".decode("cp424")
'\x9c'
>>> b"\x04".decode("ascii")
'\x04'

Currently, some Python users assume default text encoding is UTF-8.
And more Python users assume open(...).read() is safe for reading ASCII files.

Both assumptions are wrong, and changing default text encoding to UTF-8 is long term solution for fixing these wrong assumptions.

methane · June 11, 2019, 2:14am

UTF-8 mode is OK for Unix systems which doesn’t have UTF-8 locale, but all application on the system are “UTF-8 everywhere” (rust, Go, node, etc…)

On the other hand, on systems which using legacy encoding locale, other processes talks non UTF-8. But UTF-8 mode changes encoding for pipes between subprocesses.

I don’t think there are 100% solution here. But at least on Windows, many new Python users are suffered from default text encoding is not UTF-8. They can not read text files they wrote with VS Code or Atom. They can not read text files downloaded from the internet. Even pip install may raise UnicodeDecodeError.

So if we add 2nd UTF-8 mode, changing only default text encoding to “UTF-8” whould be helpful to them. I think it is the best short term solution.

But when thinking long term, most Python users in Python community will go “always UTF-8” world more by this option, leaving Python users outside of community.
More and more scripts and sample codes are written with assumption “default text encoding is UTF-8”, while this assumption is wrong for users who don’t use the option.

chcp is required to stop other processes talking legacy encoding. It is not ideal, just a workaround:

C:\Users\inada-n>chcp
現在のコード ページ: 932

C:\Users\inada-n>date
現在の日付: 2019/06/11
新しい日付を入力してください: (年-月-日)

C:\Users\inada-n>chcp
Active code page: 65001

C:\Users\inada-n>date
The current date is: 2019/06/11
Enter the new date: (yy-mm-dd)

And this solution is not perfect too. Some legacy application on Windows are written with hard coded legacy encoding. For example, printf("こんにちは\n") in C source, which are encoded in ANSI code page. They speak cp932 regardless current code page. (Modern Windows application should use wprintf(L"こんにちは\n");, but it is far from typical C code in Unix world.)

Additionally, cp65001 support in current (legacy) console is not perfect too.
We need to wait new MS Windows Terminal for using UTF-8 always in the terminal.

steve.dower · June 11, 2019, 2:17am

Read the part I quoted. You can hit the little up arrows at the top of the quote to get back to the original source.

Maybe you aren’t seeing people’s quotes in your client? Because it seems like you’re treating each reply as a reply to the whole post and not the bit that’s been copied in.

methane · June 11, 2019, 2:19am

I read the quoted part, and I can not understand you are talking about warning or change…

I and Victor are talked about how warning can be noisy in many cases. Reading ASCII-only text with default encoding is just on of example.

When thinking about breaking change, we never ignore other cases. I never said “most people uses only ASCII, so changing default text encoding to UTF-8 is safe.”
But I’m afraid you’re misunderstanding me as such extremists.

steve.dower · June 11, 2019, 4:21am

Assert the very least go and discover the encodings that are not ASCII compatible. You say you don’t know of any, but they do exist and are supported and used by real operating systems.

You’re also going to find that far more people use non-ASCII characters than you think, which means far more people will be affected than you think, which means “nothing happened when they change to UTF-8” may technically be a majority of experiences, but you may create the next bytes/str debacle if it’s not clearly communicated well. And the first step in communicating that well is showing that you understand those who are going to be affected and are trying to help them. So far, this is not being communicated well.

Topic		Replies	Views
PEP 597: Enable UTF-8 mode by default on Windows PEPs	67	20869	July 20, 2020
PEP-597: Emit a Warning when encoding is omitted PEPs	27	3914	February 1, 2021
PEP 686: Make UTF-8 mode default (Round 2) PEPs	24	5178	January 3, 2023
PEP 686: Make UTF-8 mode default PEPs	61	8710	April 27, 2022
Add legacy_text_encoding option to make UTF-8 default Ideas	6	1366	March 18, 2022

PEP 597: Use UTF-8 for default text file encoding

Related Topics