PEP 597: Use UTF-8 for default text file encoding

To be honest, none of those bother me as far as the “add checks to linters” proposal goes. As a general principle, linters don’t have to be perfect, they just have to helpful. We’re all worried about the possible bad effects of changing the default, to some degree. This proposal will bring the day that “enough” of the code we’re reading is immune to change of default that we’re willing to risk it closer. The better the linters are, the faster we get there, of course. I don’t think we’re there yet, based on this thread, but we’ll see.

With respect to your specific bullet points, my answer to the encoding=None problem is (1) for open() and io.open(), parse harder, you linters! :wink: and (2) for the implicit cases linters can warn on “encoding=None” in function definitions (perhaps after checking for how it’s used in the body) and also go back and check calls in such a case. I don’t see how subprocess.Popen(text=True) is different from open() in this respect (although the PEP distinguished them!) As far as encodings go, the zip and tar file formats are a goat rodeo, not our problem unless somebody volunteers to deal with them. Heck, I did volunteer to deal with some of the problems with zip, where you currently cannot specify a legacy encoding, but got blocked.

As far as the opt-in warning goes, that’s a great idea. But I never enable those warnings myself. I use linters occasionally, but for me the most effective would be to have the docs consistently specify encodings in examples. The point is not that I’m representative of all programmers, rather that the more widely we cover this, the sooner we get default-proof code.

I also like your proposal for a ‘locale’ (or maybe ‘system_default’) codec that looks up the system default and returns it. Personally, I am happy enough to use sys.getdefaultencoding when that’s what I want, but having it registered in codecs would make a lot of other people happy, I’m sure!

2 Likes

Thank you for you efforts here, Naoki. I am sorry to hear about your wife; please enjoy your time with your family as much as you can. I’m sure I speak for everybody in saying that we respect your efforts, and that you should take care of yourself and your family first, with no regrets.

I don’t know if it will make you feel better, but I proposed making UTF-8 the default encoding for Python programs more than 15 years ago, during the PEP 263 discussions. Guido and MAL convinced me I was wrong then, and though both the particular encoded channel and the arguments for and against defaulting to UTF-8 have changed now with Python 3, I’m still pretty conservative about this for the same “backward compatibility” reason they put forward then.

Finally, I’m a fan of your work (even if sometimes I oppose implementing it :grimacing:). I know your pain: I lecture in Japanese once a week and prepping that hour occasionally costs as much as two days. If I can help to reduce the language burden, I’m always willing! (I’m not so good at answering email, sometimes it goes a couple days. If you want I’ll give you my phone number for LINE or other messaging app, or DM me @yasegumi on Twitter, I’ve been following you.) よろしくお願いいたします。:grinning:

2 Likes

Thank you for this explanation. I now understand better - in Japan, having a single default codepage was not helpful because so many codepages were in use. My experience is where a single codepage is (usually) sufficient, and UTF-8 is useful for occasional cases of “foreign” characters. That explains to me why our views are so different, and I should have thought things through a little further when considering your proposal.

I’m absolutely sure that I am part of the reason for this, so I want to apologise. Thanks for sharing your experience, it gave me a lot to think about. I am definitely guilty of replying too quickly, with too many messages. I’ll try to improve on that in the future.

Thank you for all the effort you put into the discussion.

As for filenames, I would hope most applications use the Unicode APIs?

Usually, a linter is only used in the code of your application. If a open() called with encoding=None is hidden is a 3rd party function like read_config_file(), the linter doesn’t help. My intent with a warning is to ensure that issues are spotted anywhere, your code, 3rd party code from PyPI, stdlib, etc.

I suggest you to run your test suite with -X dev, ir enables a wide range of extra checks and warnings which you help you to spot real bugs. I recently modified to option to log close() exception in a file descriptor. It can spot subtle bug when a file descriptor is closed twice.

By the way, if we decide to go to the opt-in warning way, we should also add a “locale” encoding to turn off the warning for legit usage of the locale encoding. INADA-dan seems to prefer a Windows code page number like “cp1252” or an encoding name like “latin1”, buy in my experience, the locale encoding is the best compromise sadly.

Create a filename not encodavle to the ANSI code page and try your favorite application. I expect that you will be badly surprised.

1 Like

FYI, My 2nd version of this PEP doesn’t propose adding any warning. It add one option like UTF-8 mode. https://www.python.org/dev/peps/pep-0597/

And the PEP introduces encoding="locale" option. It is not codec name or alias, because locale may be changed after Python started. TextIOWrapper treats this name special; call encoding = locale.getpreferredencoding(False) like encoding is omitted.

Because I uses very few developer applications, I expect all apps use wchar_t filenames.

But… It is different in console world. While Microsoft improve console recently, Microsoft had neglected console for long time.
activate.bat needs chcp 65001 hack. It is very ugly and introduce bugs like this:
https://bugs.python.org/issue34144

Pessimistically speaking, WSL (and coming WSL2!) is the best environment to learn Python on Windows.

In my experience, the console is 100% Unicode capable, even on Windows 7. Powershell is also Unicode capable. I tried both of these using a test like Victor suggested. However, the ancient cmd.exe shell is very definitely not Unicode capable, it uses the OEM codepage. Personally, I haven’t used cmd for years, and I would strongly recommend Windows developers not to use it. But it is still in wide use :sob:

(Edit: From what I recall, cmd.exe uses the OEM code page, which is not even compatible with GUI applications on the same machine!)

Yup, but it’s not in UTF-8, at least in my Windows 10 setup. If you pipe some Unicode text to a file in powershell, it comes out as UTF-16 or something, with a BOM even.

Making UTF-8 the default encoding would be a great move for Python regardless, IMO.

That’s Powershell doing that, not the console, and you can work around it via | Out-File -Encoding UTF8. But I agree it’s a lousy default. Powershell isn’t perfect, but it is Unicode compatible (UTF-8 isn’t Unicode).

Thanks for following it through this far, I know how draining PEP threads can be, and have definitely dropped a few in my time.

As Stephen said, we’ve all tried to change this default in the past and have been convinced out of it, so we are all going to be looking for the new and unique part of the proposal that will make it work.

We’re all here to make Python better, so focus on that. There’s nothing personal about it.

Would you create new thread about Unicode on Windows Console?

It seems interesting topic how Python behave there now and how Python should behave in the future.
But this thread is too long already.

It looks like PowerShell 6 provides UTF-8 world! See this issue:

I confirmed Python uses UTF-8 when stdout is redirected:

PS C:\Users\inada-n> python -c "print('こんにちは')" > x
PS C:\Users\inada-n> cat x
こんにちは
PS C:\Users\inada-n> python3
Python 3.7.3 (tags/v3.7.3:ef4ec6ed12, Mar 25 2019, 22:05:12) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("x").read()
'縺薙s縺ォ縺。縺ッ\n'
>>> open("x", encoding="utf-8").read()
'こんにちは\n'

This is because PowerShell don’t detach console from Python. So os.device_encoding(1) is “cp65001”.

This is not perfect. Python still uses cp932 for communicating to subprocess, because it doesn’t use os.device_encoding().

But it seems Power Shell 6 + Python with UTF-8 mode is almost “UTF-8 World”.

Hm… What’s the difference between “the console” and “the ancient cmd.exe shell”? :thinking:

The console host (conhost.exe) is the UI surface that renders text and processes keyboard and mouse input. Its APIs are all Unicode, but there’s also emulation via file handles so that software written for POSIX can work. Any application can request a console, and will get an identical implementation (or will attach to the parent process’s active console), rather than having to reimplement all the text layout and rendering again.

The cmd.exe shell is one such application, and it converts text into executed commands, including piping and streaming, as well as handling legacy code page changes (which are technically a per-thread setting). It also knows how to process .bat and .cmd files.

It’s actually a very sensible separation IMO, especially when you decide you want a “normal” console alongside the GUI app you’re building. But it can be a little obtuse if you’re used to there only being one shell at a time.

But then, how do you use the “console” like @pf_moore claimed he did? Am I misunderstanding something?

With PowerShell, or Python, or any other application that supports the Unicode APIs for writing to/reading from the console.

cmd.exe (and many of the tools you’d use from it) are not such applications.

(If you’re an end-user, then you don’t get to choose to use the console in any way other than how your applications will let you. It’s only as a developer that you get to choose.)

Just my anecdotal 2c:

As a Windows-only user for the past 10+ years, the absolute only time I’ve written/read things in something other than UTF-8 was when burning in subtitles to video that were created by others. In these cases one can only guess and therefore chardet was used.

Having the default be UTF-8 would have saved me lots of pain over the years.

2 Likes

Thanks for your reaction.
When I saw aws-cli repository for discussion in other thread, I found this issue too.

It’s very obvious that this is common bug, and many Windows users are suffered by default encoding is not UTF-8.

On the other hand, it’s very unobvious that how many (or how few) Windows users are suffered by the backward incompatible change in mid-2020s. It’s devil’s proof.

So my PEP 597 (2nd) propose environment variable to configure default encoding. If it is accepted, you can change the default text encoding. We can postpone the discussion about when change the default of “default text encoding”.

But we have PYTHONUTF8 already. The most important part of PEP 597 is why UTF-8 mode is not enough for Windows users.

So, if you would like to contribute this discussion, it’s very helpful that trying UTF-8 mode now (maybe with chcp 65001).
If it is enough, we don’t need to add yet another configuration option.

Actually it’s not right to say that fopen doesn’t care about encoding – at least not in Windows. For many years, the C runtime in Windows has supported UTF-8 and UTF-16 text files – and even UTF-16LE for console access. [_w]fopen[_s] takes a ccs mode flag, which can be ccs=UTF-8, ccs=UTF-16LE, or ccs=UNICODE – e.g. "a,ccs=UTF-8". At a lower level, these ccs flag values correspond to the _[w]open[_s] flag values _O_U8TEXT, _O_U16TEXT, and _O_WTEXT. The behavior when opened for reading or appending depends on the presence of a BOM. When opened for writing in these Unicode text modes, a BOM is always written.

The catch for UTF-8 mode is that Unicode text is wchar_t UTF-16LE characters. This is not encoding-neutral support for Unicode via UTF-8 as a sequence of arbitrary bytes. The CRT translates UTF-16LE -> UTF-8 when writing and UTF-8 -> UTF-16LE when reading. This is similar to Python’s str <-> bytes translation between text and buffered /raw layers. In text mode, the CRT also implements CRLF <-> LF newline translation.