PEP 597: Enable UTF-8 mode by default on Windows

pf_moore · February 7, 2020, 9:12am

In my experience, people new to Python are generally fine with the current defaults. (Although I admit I’m in a country which is mostly-ASCII, so my experience is privileged in this regard).

Anyhow, I’ll let others comment here. I’m -1 on this proposal, I’ll leave it at that.

encukou · February 7, 2020, 11:20am

In my experience from a not-only-ASCII country, beginners are not at all fine with Python’s defaults on Windows.
Since code editors use UTF-8 and Python I/O defaults to something else*, they often can’t correctly read in a text file they just saved – unless they’re careful about encodings.

* (That’s not necessarily some common “national codepage”. Many people set computers to English because it has the best localization, and I often meet multilingual people who moved in from neighboring countries.)

In my experience, “always use encoding='utf-8' when opening text files” the best advice I can give to beginners. When they want to share their files with non-Windows systems or differently-configured Windows, having everything in utf-8 solves a lot of problems (compared to trying to properly keep track of and convert encodings of every piece of text). And exceptions are fine: when they meet a non-UTF8 file (most commonly when interacting with government data/APIs, required by some crusty old standard to use the national encoding), they’re in a good place to learn about encodings. Arguably, teaching encodings separately from file I/O basics is better pedagogy.

IMO, viewing non-UTF8 applications (like mingw or old config files) as exceptions that need to be treated specially is not much difference to the status quo, where basically anything is an exception to be treated specially – as soon as you share files. Specifically, if your tox.ini or setup.cfg could be shared across systems, relying on a machine-local encoding is already a bug (unless you rely on a smart sharing tool).
But we can start treating these as exceptions now, rather than wait until everything uses UTF-8.

Whereas now, people learn about the problem very slowly and – eventually – painfully.

If the default changes to a single specific encoding, then everyone will magically start writing their files with a known encoding.
Consider a person who designs a system with a configuration file, but doesn’t know or care about encodings at first. When someone points out the issue, they could just say “Oh, it’s UTF-8! Thanks for asking, let me document that!” rather than “Oops, it’s system-dependent! I should have saved the encoding into the file, or specified it explicitly. Sorry!”

IMO it can be a deprecation-cycle change.
If not specifying an encoding is either a hard-to-detect-at-first data-loss bug, or can be fixed easily since you know the encoding, we should start issuing warnings now.

animalize · February 7, 2020, 3:43pm

I’m not too keen on UTF-8 encoding, for me it wasting disk space.

Most Chinese characters are in BMP, one such character needs to be represented by 3 bytes with UTF-8. But with local encodings (GBK/GB18030), it only takes 2 bytes.

I tested a text book, the file size:
GBK: 116 KB
UTF-8: 172 KB
If there are many such files, the storage space is considerable.

A concern is backward compatibility, maybe a lot code will be broken.
How about adding an open_utf8() convenience function?

steve.dower · February 7, 2020, 9:30pm

I agree, but the only reasonable warning is “you did not specify an encoding in this call to open”. We can’t base it on the data being decoded (this is what was so bad about str+bytes and auto-decoding), the warning has to appear regardless, on all platforms and in all test suites.

Users can’t necessarily fix the warning themselves, which makes them noise on the level of the invalid escape sequence warning. So the warning needs to reach library maintainers.

At the end of it, only unmaintained libraries will rely on the default encoding to read back their own files, as everyone else will be warned/linted into specifying an encoding. So then we can decide whether to break all the unmaintained code or not - historically, we’ve decided not to break it.

It’s certainly a multi-year effort, including marketing to make people aware that the change is coming and why, none of which is covered in the proposal. PEPs 528 and 529 could slip by (in part because of existing years of deprecation of bytes paths), but I don’t believe this one can without directly causing data loss.

Perhaps we should see whether Jupyter would launch Python kernels in UTF-8 mode? That would reach a lot of people who have to worry about encodings (though they’d probably prefer Latin-1, as they generally just want best effort to read an existing file…)

brettcannon · February 7, 2020, 11:45pm

Maybe that’s the first step? If Pylint, flake8, etc. started to lint on not specifying an ‘encoding’ argument to open() that would hopefully start nudging people slowly towards better practices such that we can then flip on a deprecation warning in a few years and change the semantics in Python 4.

eryksun · February 8, 2020, 2:32am

Python uses the active process codepage (i.e. GetACP()) as the preferred encoding. This is normally the legacy codepage of the system locale. The locale in C, on the other hand, uses the user locale. Recently, the CRT has staked out a middle-ground position. If the process active codepage is UTF-8, setlocale(category, "") uses UTF-8 instead of the legacy codepage from the user locale. For example:

C:\>python -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
English_Canada.1252

C:\>python.utf8 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
English_Canada.utf8

Switching the whole process to UTF-8 (not just Python code) during testing may help to highlight code that makes fragile assumptions. Code that assumes that a given default encoding (e.g. process, locale, preferred encoding) is the legacy codepage from the system or user locale will hopefully fail hard on an encoding error and get fixed. However, to be fair, in some cases it could also mask programming errors. This includes extension modules and linked DLLs, which may call MBS ‘ANSI’ functions such as CreateFileA. Python core is guilty of this mistake. _winapi.CreateFile calls CreateFileA with a UTF-8 string. If the active codepage is a legacy encoding, Windows will happily ‘decode’ the UTF-8 string as mojibake, but this at least can potentially get flagged as an error. If the active codepage is UTF-8, however, this bad code won’t raise any red flags because it just happens to be using the right encoding.

Suggestion

We could replace “python[w].exe” in an installation with a launcher that runs the base executable. This generalizes how venv virtual environments are currently implemented. The base executable would be distributed as both UTF-8 and locale variants, maybe named “python[w].utf8.exe” and “python[w].locale.exe”. The UTF-8 version would set the “activeCodePage” to UTF-8 in the embedded manifest. The launcher would detect UTF-8 mode when it’s enabled by “-X utf8”, PYTHONUTF8, or a new “UTF8” registry value (not set by default). The PYTHONUTF8 environment variable would take precedence over the “UTF8” registry value. The system “py” launcher would also be updated to support this scheme and thus bypass the middle-man launcher.

Also, at startup in the base process, if GetACP() returns CP_UTF8 (65001) and UTF-8 mode isn’t explicitly disabled by the environment/registry, then automatically enable it. Thus UTF-8 mode is automatically set, if it’s not explicitly disabled, when either “python[w].utf8.exe” is run directly or when the system locale is configured to use UTF-8. This supports any code that special cases UTF-8 mode, such as setting the preferred encoding value to “utf-8” instead of “cp65001”.

New “system_ansi”, “system_oem”, “user_ansi”, and “user_oem” encodings could be added that use the system locale and user locale legacy codepages, which can be queried directly via GetLocaleInfoEx. “mbcs” (alias “ansi”, “dbcs”) and “oem” would continue to use the process active codepages from GetACP() and GetOEMCP().

If UTF-8 mode eventually becomes the default (maybe after Windows 8.1 support ends on 2023-01-10 – Python 3.12?), then the only required change would be to set the registry “UTF8” value to true for a particular installation.

pf_moore · February 8, 2020, 12:27pm

So if I’m understanding it, the proposal is now roughly to have UTF8 mode enabled when the user has cp65001 set. Is that correct?

If so, then how is telling people “set cp65001” any different from telling people “set PYTHONUTF8=1”? Both involve making an environment configuration change, and both would have exactly the same impact on Python. Where they differ is in how they affect other aspects of the system, and it’s not up to us to tell users how to configure the non-Python parts of their system.

By the way - one issue here is that the proposal is about enabling UTF-8 mode, but it doesn’t really explain what problems this would solve. I’ve no idea whether any of the encoding issues I’ve encountered over the years would have been fixed by UTF-8 mode, for example. It’s hard to discuss the question without a proper feel for what it will achieve in practice.

eryksun · February 10, 2020, 9:26am

Is this is in reference to Inada’s suggestion to “enable UTF-8 mode automatically when console code page is 65001”? A few console applications, including CMD, use the console input or output codepage even for non-console files. If they’re run without a console (i.e. as a detached process), they usually just default to the process active codepage because GetConsoleCP() and GetConsoleOutputCP() return 0 (i.e. CP_ACP) .

It’s a weird decision. The encoding of a particular file used by a process shouldn’t determine the preferred I/O encoding of the whole process. I’d reluctantly accept this if Microsoft’s own programs consistently did it, and if it was documented as such, but they don’t, and it’s not documented as such.

Many console applications use the legacy OEM codepage of a process for non-console I/O, including:

icacls.exe
attrib.exe
tree.com
fc.exe
net.exe

When their output is redirected to a pipe or disk file, non-OEM characters in strings (typically filepaths) get corrupted by best-fit and default character (“?”) translation.

Otherwise, most applications, especially GUI applications, use the process active codepage, assuming they don’t use the wide-character API with UTF-16. This sets a common codepage for non-Unicode text on a multi-user system. If a user’s preferred language isn’t compatible with the active codepage, then a program should at least warn the user and provide a chance to switch to UTF-8 or UTF-16. Even Notepad gets this right.

In Windows 10, both the active codepage and active OEM codepage can be forced to UTF-8. This affects the entire process, including extension modules and DLLs. It’s useful in testing, such as looking for cases where the encoding of files isn’t explicit, or cases where multibyte-string API calls use the wrong encoding of arguments. The MBS API is far less forgiving when UTF-8 is the active codepage. It still doesn’t use a strict decoding of arguments, but invalid sequences get mapped to the replacement character (U+FFFD), so there’s no chance that a string (commonly a filepath) with the wrong encoding will silently rountrip to and from an incorrect Unicode string, as is often silently the case with legacy codepages.

That said, my suggestion in a previous reply to force the active codepage when UTF-8 mode is enabled was looking at the wrong setting. Forcing the active codepage to UTF-8 is much closer to Unix PYTHONCOERCECLOCALE. Except it’s coercing the system and user locales away from their legacy ANSI codepage instead of coercing the POSIX and C locales away from ASCII. Also, there’s no support for an LC_CTYPE environment variable in Windows to coerce the locale of child processes.

methane · February 10, 2020, 9:56am

Yes. I want an easy way to enable UTF-8 mode for environments which are popular/recommended for new Python users. Jupyter is executed from command, VSCode, and PyCharm.

My idea is adding “set PYTHONUTF8=1 enviornment variable” option to the installer. But the installer is not used when Python is installed from conda or Windows Store.

Latin-1 can read UTF-8, shift-jis, etc without any error. It makes Python 3 more danger than Python 2.
I rarely use latin-1 when I want byte transparent behavior. But it must be a hack. It shouldn’t be recommended, especially for new users.

methane · February 10, 2020, 10:15am

Maybe, it is too early to discuss about it.

Currently, chcp 65001 (or [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding in PowerShell Core) is a hack for users who want to use tools like TeX, node.js, GHC, etc…

I expect many terminals will have “UTF-8 session” by default in near feature. For example, see this issue on Windows Terminal.

So I want to discuss that is there something we should fix before enabling UTF-8 mode automatically, or UTF-8 mode is ready for “UTF-8 console on Windows” era already?

Many sample code in books, and production code omit the encoding option even though they expect UTF-8. For example, when reading json, markdown, reST, and csv download from the internet.

I wrote the example in this PEP. Even packaging.python.org and json module omit encoding option to read README.md and json file.

encukou · February 10, 2020, 10:26am

Can it be done without warnings, though? if there would be a “Python 4.0” that allows backcompat breaks without a deprecation period, we’d probably need to support “3.x” for ten years again. And if we can’s skip a deprecation period anyway, what’s the reason to wait for 4.0?

But yeah, that’s a lot of warnings.
I live in a very comfortable bubble of UTF-8 everywhere (with a few explicit exceptions), and I don’t think I can convince everyone else in this bubble that such warnings would be useful :‍(

I’m afraid that this is a more general problem with Python’s warnings :‍(

pf_moore · February 10, 2020, 10:39am

But that still only describes the symptom. What’s the problem? I use

    with open("filename") as f:
        for line in f:
            # whatever

quite often in my scripts. It’s perfectly fine, because I know the data, know the program, and have no intention of distributing my code. People not ensuring that they know the correct encoding when consuming textual data is a problem, yes, but it’s a people problem, not a technical one.

I don’t want to make it harder to do the convenient thing when it’s correct. That damages one of Python’s strengths for me (that it’s useful for quick scripts as well as for large programs).

takluyver · February 10, 2020, 10:40am

I’ve opened an issue to consider changing the default in Jupyter, as @steve.dower suggested above.

For the record, I’m strongly in favour of eventually making UTF-8 the default for text-mode file I/O everywhere. This is a significant Python wart: open('blah', 'r') is almost always wrong, but it works for me, so it’s hard to remember that it’s wrong, and even then it’s easy to go ‘eh, good enough for now’.

Maybe the way forwards is to split up the different ‘default encoding’ parts more? At present, UTF-8 mode affects at least:

Text mode file I/O
Text mode streams to/from subprocesses
Python’s own std* streams
Filesystem encoding (file and folder names)
Command line arguments passed to Python
Environment variables

For things like command line arguments or talking to subprocesses, a locale-dependent default makes more sense. For text-mode file I/O, UTF-8 seems a clearly better choice, because it’s so common now to move files between computers.

methane · February 10, 2020, 11:42am

pf_moore:

But that still only describes the symptom. What’s the problem? I use
    with open("filename") as f:
        for line in f:
            # whatever
quite often in my scripts. It’s perfectly fine, because I know the data, know the program, and have no intention of distributing my code.

Yes, it’s fine. And if you want to run your program which expects legacy encoding in UTF-8 console, you can disable UTF-8 mode anyway.

Or do you against raising warning when the encoding option is omitted?
Maybe, this should be checked statically by linters, instead of runtime warning.

encukou · February 10, 2020, 11:44am

If you do that, then for each part you need to know/guess correctly. For all possible systems/configs. If you guess wrong and release that guess in a library, you can’t correct that later without a painful deprecation process. What’s more, if there are edge cases that could reasonably fit in two or more of those categories, all systems would need to draw the “line” the same way.

People developing on “UTF-8 everywhere” systems will not appreciate the added complexity at all, and will probably choose wrongly.

methane · February 10, 2020, 12:02pm

On Windows:

When std* streams are console, it is UTF-8 already. UTF-8 mode affects only pipe and redirected file.
Filesystem encoding is UTF-8 already, too.
Command line arguments passed to Python are parsed from wchar_t. So UTF-8 mode doesn’t affect to it.
Environment variables are also processed in wchar_t.

So subprocess is the most considerable part in the stdlib.

PYTHONSTDIOENCODING can overwrite the encoding for stdio.

Adding one more option for subprocess module is a considerable option.
But I’m -0.5 on it because it makes Python more complex…

pf_moore · February 10, 2020, 12:26pm

Yes, I am.

Yes, at an absolute minimum, having linters report this as an issue would mean that we can find out if it’s a reasonable position to take before we make changes to Python.

But once again I’m confused. I’m not using a UTF-8 console, because as I said a number of times, it breaks my other applications. I thought we’d established that we were only talking about a world where UTF-8 mode would only be enabled when the user has set cp 65001. So what I am saying here, is what problem is being solved by asking users to set cp 65001 on, which would not just as well be solved by asking them to set PYTHONUTF8=1.

Or maybe I’m asking what problems people who have set cp 65001 see that people like me who don’t, are not encountering. And why aren’t we telling those people to either set PYTHONUTF8=1 or to change their code page to something other than 65001? That latter is not a serious suggestion - I don’t expect to ask anyone to change their default codepage just to satisfy Python!

What I’m really asking is for an example of a program that someone could write, which gives the wrong results, so that I can understand the options for fixing that issue and why PEP 597 is a better option than the alternatives.

eryksun · February 10, 2020, 1:14pm

The problem that Inada wants to solve is pretty much just the result of locale.getpreferredencoding(). Currently it’s the active codepage of the process, CP_ACP. Inada wants to change it to CP_UTF8.

What do you mean by “set cp 65001 on”? If you mean the input and output codepages of the console, I can hardly believe that a program would use the terminal encoding as a locale setting. This is an oddity that a small fraction of console programs, such as CMD, follow for their own historical reasons. The console is just a set of files. It’s as if the encoding of a configuration file dictated the preferred encoding of the entire process. That’s just too strange to seriously consider.

pf_moore · February 10, 2020, 1:53pm

But why is the current value a problem? No-one seems to be able to explain that It’s the default encoding used in open(), but the current value matches what (for example) notepad uses by default so changing this would mean I can no longer write a file with the default settings in notepad, and read it in Python with the default settings. Maybe that’s not the use case that the PEP considers to be the most important, but if we’re proposing to break one usage while fixing another, surely it’s necessary to detail both in the PEP so that the trade-offs can be made clear?

Sorry. I know the situation is more complicated in this in reality. I was over-simplifying in the interests of not getting bogged down in detail.

The point here is that we seem to be proposing that Python deliberately choose a different default encoding from the one that the user has set (accepting that “setting a default encoding” is a lot more nuanced than it might seem to the user who just sets the “system locale” in control panel). But there’s no justification given for why we should go against the system setting, other than a bald “UTF-8 is better” comment.

To be clear, I prefer UTF-8. I’ve done a lot of customisation on my PC to use UTF-8 wherever I can. I would always recommend that programs write files in UTF-8. But I do not think that we should unilaterally default to UTF-8 in contradiction of the user having selected a different value in the system settings.

For context, PEP 540 only proposes using UTF-8 mode on Unix if the POSIX locale is in use. So I don’t see why we’re being more aggressive than that on Windows. Or were there later changes to how UTF-8 mode works on Unix that I didn’t locate? And again, why do I need to do all this research myself? Surely the PEP should explain the problem and the context better in the first place? At this point, all I’m trying to ask is that the PEP justify the proposal, rather than appealing to a simple “UTF-8 is good” claim.

The closest to a justification I can see is “loads of other programs ignore the default, so we should too”.

steve.dower · February 10, 2020, 3:29pm

I assume Thomas understands this already (I didn’t read the bug he created), but Jupyter is special because it doesn’t need UTF-8 mode enabled by the user, but it needs it added to the kernel spec for IPyKernel.

So it’s a change within that tool to affect how it creates Python subprocesses (by enabling UTF-8 mode), which also then means it could safely use UTF-8 decoding, as it has explicitly agreed on an output encoding with the subprocess. This is the “perfect” case, but not one that really benefits from a different default value.

The rest of the discussion is more interesting than this point, so I don’t think there’s any further to go. Paul’s last point about the PEP doing the justification rather than the reader having to do their own research is also my primary criticism.