PEP 597: Use UTF-8 for default text file encoding

Before updating the PEP, I want to reply this part.

TL;DR: There are so many non-UTF-8 files in Japanese. It made the strong motivation for “UTF-8 by default everywhere, everytime”. No contradiction here.

I and many experienced developer in Japan are all suffered by complex situation before UTF-8 dominates.

We used, at least three major encodings. (cp932 for Windows, euc_jp for Unix, and ISO-2022-JP for IRC, e-mail, etc). cp932 used 0x9c (\) in multibyte characters. Many applications were broken because they treats it as escape character. iso-2022 is stateful. Converting between encodings are always lossy (no round tripping). And legacy mobile phone had added custom "emoji"s in unused area. It was nightmare age.

Now we (“modern” users) are happy with UTF-8. We use UTF-8 every day. We teach to new programmers with only UTF-8. Of course, we are forced to use cp932 on Windows sometime. (And that’s one of major reason Windows is hated…) But we use UTF-8 in all other cases.

And changing default encoding doesn’t mean we can not handle legacy text files anymore.
Note that text files are living longer than “locale”.
About 2005~2015, some user used UTF-8 locale and others used euc_jp locale even on same server, and there were cp932 and ISO-2022-JP files on the same server.

Single default text encoding never worked well for legacy text files in such systems. We checked the encoding of the text files manually, always. That’s why I thought “specify encoding always when handling legacy text files” make sense.

Of course, there are many systems where “single legacy encoding” works fine. We shouldn’t dismiss them. But Python is not a language only for them. We shouldn’t dismiss people who are happy by “UTF-8 by default” too.

This is background of my motivation. I hope this explains why “believe UTF-8 is the best default encoding” didn’t mean “dismiss non-UTF-8 users”.

Your motivation is fine, we’re not questioning that. I agree completely that you should always know what encoding something will be read/written with (see my recent PyCon talk where I briefly talked about this topic), and I also agree that if someone doesn’t care what they write with (and neither does the recipient), then UTF-8 is a good default choice.

Where we disagree is in our estimation of the real impact this change would have on users and the amount of pain and frustration it will cause. And all of it will be blamed, quite rightly, on us.

Those of us who have been receiving that user pain for a long time, particularly around encodings, are very nervous that changing the default will cause more pain than it prevents, despite agreeing that UTF-8 would have been a better choice in the first place. And despite the fact that we will have to continue receiving the pain if we don’t change anything.

I guess what I’m saying is we’re making a product management tradeoff, not a technical decision. The correct technical choice is obvious, but the correct choice for serving our users well is not obvious, and sometimes it really is better to just leave it alone.

Really? I think I never underestimate it. I know that I can not estimate.

I think where we disagree is estimation of pain of new users, maybe because pain of cp932 is worse than code page you are using.

I saw many pain too. But I conclude changing the default will prevents more pain than it cause in 2020s. Current situation (legacy vs UTF-8) looks like Python 2/3 to me. This is what we couldn’t agree.

Anyway, I withdraw the PEP and pushed 2nd version to stop you blaming me.
In new version, I proposed to add PYTHONTEXTENCODING envvar to override the default encoding.

But, again, I don’t hurry about this PEP for now. We should wait to see what happened in Windows 10 20H1 before adding anything to Python 3.9. MS may add some mode for better UTF-8 support. If it is really nice and we can assume other processes use UTF-8 too, we can just use UTF-8 mode in the new mode.

I will create a new thread for 2nd version of my PEP after PEP 596 discussion is end.
I burned out for now. I want to read/write code, instead of English.

Until then, please share good news about future Windows here.

Post mortem analysis

I intentionally made first version as idiomac and optimistic. I avoid adding any options. I used regular deprecation period before breaking change.

I wanted to discuss how we can mitigate pain of breaking change. I wanted to discuss about adding options after we conclude we can not mitigate the pain enough. It’s my fault that I didn’t share this intension. But I didn’t expect I am blamed so much.

We have different throughput and time zone. I need 3~5+ hour to write long reply. My wife get ill so often (5~10+ days / month) and I need to take care of her and my daughter.

So I can not reply to all comment from all of you. I can only focus on one subtopic (e.g. warning) in one day. I used almost 100% of my time on PC for this thread. It is so hard job that writing long English.

You might thought I dismissed / ignored some of your concern. But I just overflowed from weekend.
Please don’t assume someone intensionally ignore you without waiting enough time.

Since you feeled I ignored you, you sent many reply to my comment about other topic. It confused me more. I didn’t want to ignore your concern, but I just couldn’t understand what you wrote.

While discourse provides “reply to this comment” feature, it didn’t help us. I really confused why I blamed when I am talking “warning may be too noisy when reading/writing only-ASCII text”, because I’m focused on warning subtopic.

And when I felt I was blamed, I was too defensive. It was too bad about discussion too.

Additionally, Python 2/3 metaphor is bad for constructive discussion. It can be abused so easy to shut up someone. For example, I can use this metaphor as:

  • Still using legacy encoding <-> Still using Python 2.
  • Make UTF-8 default in mid-2020 <-> Make Python 3 default from 2020.
  • UTF-8 should be used by new people who don’t know encodings <-> Python 3 should be used by new people who don’t know Python 2/3 problem.

This metaphor explains my personal feeling well. But can we start constructive discussion from here? I think no.

If I used this metaphor, you might assume I am strongly neglecting legacy encoding users. But this metaphor just explains how I want to help new people from unnecessary UnicodeErrors by changing default encoding. I wanted to discuss about what can we do for legacy encoding users from the beginning.

Lesson learned

  • Make clear the intension at first
  • Stop assuming other’s bad intension – Use Hanlon’s razor.
  • Wait – You are not ignored. Others have different time zone, throughput, and private.
  • When using “reply to this comment”, focus on the subtopic – Other subtopic should be written in separate comment.
  • Metaphor is misunderstood or abused too easily.
1 Like

To be honest, none of those bother me as far as the “add checks to linters” proposal goes. As a general principle, linters don’t have to be perfect, they just have to helpful. We’re all worried about the possible bad effects of changing the default, to some degree. This proposal will bring the day that “enough” of the code we’re reading is immune to change of default that we’re willing to risk it closer. The better the linters are, the faster we get there, of course. I don’t think we’re there yet, based on this thread, but we’ll see.

With respect to your specific bullet points, my answer to the encoding=None problem is (1) for open() and io.open(), parse harder, you linters! :wink: and (2) for the implicit cases linters can warn on “encoding=None” in function definitions (perhaps after checking for how it’s used in the body) and also go back and check calls in such a case. I don’t see how subprocess.Popen(text=True) is different from open() in this respect (although the PEP distinguished them!) As far as encodings go, the zip and tar file formats are a goat rodeo, not our problem unless somebody volunteers to deal with them. Heck, I did volunteer to deal with some of the problems with zip, where you currently cannot specify a legacy encoding, but got blocked.

As far as the opt-in warning goes, that’s a great idea. But I never enable those warnings myself. I use linters occasionally, but for me the most effective would be to have the docs consistently specify encodings in examples. The point is not that I’m representative of all programmers, rather that the more widely we cover this, the sooner we get default-proof code.

I also like your proposal for a ‘locale’ (or maybe ‘system_default’) codec that looks up the system default and returns it. Personally, I am happy enough to use sys.getdefaultencoding when that’s what I want, but having it registered in codecs would make a lot of other people happy, I’m sure!

2 Likes

Thank you for you efforts here, Naoki. I am sorry to hear about your wife; please enjoy your time with your family as much as you can. I’m sure I speak for everybody in saying that we respect your efforts, and that you should take care of yourself and your family first, with no regrets.

I don’t know if it will make you feel better, but I proposed making UTF-8 the default encoding for Python programs more than 15 years ago, during the PEP 263 discussions. Guido and MAL convinced me I was wrong then, and though both the particular encoded channel and the arguments for and against defaulting to UTF-8 have changed now with Python 3, I’m still pretty conservative about this for the same “backward compatibility” reason they put forward then.

Finally, I’m a fan of your work (even if sometimes I oppose implementing it :grimacing:). I know your pain: I lecture in Japanese once a week and prepping that hour occasionally costs as much as two days. If I can help to reduce the language burden, I’m always willing! (I’m not so good at answering email, sometimes it goes a couple days. If you want I’ll give you my phone number for LINE or other messaging app, or DM me @yasegumi on Twitter, I’ve been following you.) よろしくお願いいたします。:grinning:

2 Likes

Thank you for this explanation. I now understand better - in Japan, having a single default codepage was not helpful because so many codepages were in use. My experience is where a single codepage is (usually) sufficient, and UTF-8 is useful for occasional cases of “foreign” characters. That explains to me why our views are so different, and I should have thought things through a little further when considering your proposal.

I’m absolutely sure that I am part of the reason for this, so I want to apologise. Thanks for sharing your experience, it gave me a lot to think about. I am definitely guilty of replying too quickly, with too many messages. I’ll try to improve on that in the future.

Thank you for all the effort you put into the discussion.

As for filenames, I would hope most applications use the Unicode APIs?

Usually, a linter is only used in the code of your application. If a open() called with encoding=None is hidden is a 3rd party function like read_config_file(), the linter doesn’t help. My intent with a warning is to ensure that issues are spotted anywhere, your code, 3rd party code from PyPI, stdlib, etc.

I suggest you to run your test suite with -X dev, ir enables a wide range of extra checks and warnings which you help you to spot real bugs. I recently modified to option to log close() exception in a file descriptor. It can spot subtle bug when a file descriptor is closed twice.

By the way, if we decide to go to the opt-in warning way, we should also add a “locale” encoding to turn off the warning for legit usage of the locale encoding. INADA-dan seems to prefer a Windows code page number like “cp1252” or an encoding name like “latin1”, buy in my experience, the locale encoding is the best compromise sadly.

Create a filename not encodavle to the ANSI code page and try your favorite application. I expect that you will be badly surprised.

1 Like

FYI, My 2nd version of this PEP doesn’t propose adding any warning. It add one option like UTF-8 mode. PEP 597 – Add optional EncodingWarning | peps.python.org

And the PEP introduces encoding="locale" option. It is not codec name or alias, because locale may be changed after Python started. TextIOWrapper treats this name special; call encoding = locale.getpreferredencoding(False) like encoding is omitted.

Because I uses very few developer applications, I expect all apps use wchar_t filenames.

But… It is different in console world. While Microsoft improve console recently, Microsoft had neglected console for long time.
activate.bat needs chcp 65001 hack. It is very ugly and introduce bugs like this:
Issue 34144: venv activate.bat reset codepage fails on windows 10 - Python tracker

Pessimistically speaking, WSL (and coming WSL2!) is the best environment to learn Python on Windows.

In my experience, the console is 100% Unicode capable, even on Windows 7. Powershell is also Unicode capable. I tried both of these using a test like Victor suggested. However, the ancient cmd.exe shell is very definitely not Unicode capable, it uses the OEM codepage. Personally, I haven’t used cmd for years, and I would strongly recommend Windows developers not to use it. But it is still in wide use :sob:

(Edit: From what I recall, cmd.exe uses the OEM code page, which is not even compatible with GUI applications on the same machine!)

Yup, but it’s not in UTF-8, at least in my Windows 10 setup. If you pipe some Unicode text to a file in powershell, it comes out as UTF-16 or something, with a BOM even.

Making UTF-8 the default encoding would be a great move for Python regardless, IMO.

That’s Powershell doing that, not the console, and you can work around it via | Out-File -Encoding UTF8. But I agree it’s a lousy default. Powershell isn’t perfect, but it is Unicode compatible (UTF-8 isn’t Unicode).

Thanks for following it through this far, I know how draining PEP threads can be, and have definitely dropped a few in my time.

As Stephen said, we’ve all tried to change this default in the past and have been convinced out of it, so we are all going to be looking for the new and unique part of the proposal that will make it work.

We’re all here to make Python better, so focus on that. There’s nothing personal about it.

Would you create new thread about Unicode on Windows Console?

It seems interesting topic how Python behave there now and how Python should behave in the future.
But this thread is too long already.

It looks like PowerShell 6 provides UTF-8 world! See this issue:

I confirmed Python uses UTF-8 when stdout is redirected:

PS C:\Users\inada-n> python -c "print('こんにちは')" > x
PS C:\Users\inada-n> cat x
こんにちは
PS C:\Users\inada-n> python3
Python 3.7.3 (tags/v3.7.3:ef4ec6ed12, Mar 25 2019, 22:05:12) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("x").read()
'縺薙s縺ォ縺。縺ッ\n'
>>> open("x", encoding="utf-8").read()
'こんにちは\n'

This is because PowerShell don’t detach console from Python. So os.device_encoding(1) is “cp65001”.

This is not perfect. Python still uses cp932 for communicating to subprocess, because it doesn’t use os.device_encoding().

But it seems Power Shell 6 + Python with UTF-8 mode is almost “UTF-8 World”.

Hm… What’s the difference between “the console” and “the ancient cmd.exe shell”? :thinking:

The console host (conhost.exe) is the UI surface that renders text and processes keyboard and mouse input. Its APIs are all Unicode, but there’s also emulation via file handles so that software written for POSIX can work. Any application can request a console, and will get an identical implementation (or will attach to the parent process’s active console), rather than having to reimplement all the text layout and rendering again.

The cmd.exe shell is one such application, and it converts text into executed commands, including piping and streaming, as well as handling legacy code page changes (which are technically a per-thread setting). It also knows how to process .bat and .cmd files.

It’s actually a very sensible separation IMO, especially when you decide you want a “normal” console alongside the GUI app you’re building. But it can be a little obtuse if you’re used to there only being one shell at a time.

But then, how do you use the “console” like @pf_moore claimed he did? Am I misunderstanding something?