"JEP 400: UTF-8 by Default" and future of Python

methane · March 14, 2022, 5:20am

JDK 18 is now release candidate, and will be final in this month.
JDK 18 contains JEP 400: UTF-8 by Default. This JEP is interesting to Python too.

Summary of the PEP:

JDK 17 introduced “native.encoding” system property.
JDK 18 changed the default encoding. User can use old behavior by setting file.encoding property to “COMPAT”.
No deprecation.

Deprecate all methods in the Java API that use the default charset — This would encourage developers to use constructors and methods that take a charset parameter, but the resulting code would be more verbose.

Then, what Python can learn from the JEP?

`native.encoding`

Python has counterpart already (encoding="locale") since Python 3.10. No problem.

`file.encoding`

I think Python should provide such a backward-compatible option too.

Python on Windows provided similar backward-compatible options when Python changed stdio and fsencoding (e.g. PYTHONLEGACYWINDOWSFSENCODING and PYTHONLEGACYWINDOWSSTDIO).

Additionally, I want to make it “forward compatible” option. Users can opt-in the “UTF-8 as the default encoding” with the option. I want to add the option to Python 3.11 if possible as preview/experimental.

We have UTF-8 mode (PYTHONUTF8), but its semantics is slightly different from “change Python’s default encoding to UTF-8”. It means “Python works as in UTF-8 locale regardless real locale”.

For example, encoding="locale" become UTF-8 in UTF-8 mode. It totally breaks the motivation of encoding="locale". And UTF-8 mode changes some other edge cases – os.device_encoding(), _locale._get_locale_encoding().

Instead of changing semantics of the UTF-8 mode, we may be able to add yet another option for forward/backward compatibility.

io.text_encoding() becomes UTF-8
locale.getpreferredencoding() becomes UTF-8
- We may add another API (e.g. locale.get_locale_encoding() and deprecate locale.getpreferredencoding() in the future)
files, stdio, pipes (discuss later) become UTF-8

The idea of the new option name is -X text_encoding / PYTHONTEXTENCODING:

case insensitive
“UTF-8”, “UTF8” – Change the default text encoding to UTF-8. User can use this for “opt-in”/“forward compatible” option.
“locale” – Keep status quo. User will be able to use this for “backward compatible option” after Python changed the default text encoding.

No deprecation

We already have EncodingWarning which is disabled by default. So we are more conservative than Java already!!

I don’t want to start discussion about showing the warning by default or not. Before the discussion, I want to see:

How Java users feel about JEP 400 – how many Java users want warning for use of default encoding?
How Python users fix EncodingWarning – how often encoding="utf-8" or encoding="locale" (*)

I had created the feedback thread. And I will create new discussion thread after Python 3.11 become beta.

(*) I fixed several EncodingWarning in Python and some essential tools around Python like pip and tox. I very rarely used encoding="locale". I think changing the default encoding will fix more programs than it breaks. But I want to see more wide area of Python OSS.

stdio and pipe encoding.

stdio and pipes in Java are byte stream. Java user need to use TextInputStream and TextOutputStream to get text stream, and its default encoding is file.encoding – becomes UTF-8 by default since JDK 18.

When I wrote PEP 597, I excluded subprocess from the EncodingWarning. It is because I think PIPE encoding should be consistent with stdio and I was afraid changing stdio encoding.

But for now, I think we should change the encoding of PIPE and stdio when we change the default text encoding. Keeping use of legacy encoding will confuse users more than change.

So I will change the subprocess to emit EncodingWarning like open() in Python 3.11 if no opposite.

steve.dower · March 14, 2022, 4:55pm

subprocess.Popen should just deprecate any way to switch to text mode other than specifying encoding. Then once those are gone, the only way users encounter encoding is by explicitly setting it.

For file reads, I’d still like to see a smarter default encoding that:

starts in UTF-8
if there’s a BOM, silently consumes it (like utf-8-sig) and commits to UTF-8 codec
at the first invalid UTF-8 character, raises a warning (visible by default) and switches to locale

Anyone who specifies an encoding doesn’t get this behaviour, so if you don’t want the warning then the “workaround” is to specify the encoding when you open the file. We can also have an invisible-by-default warning anytime you open a file without specifying the encoding, so that developers get a warning regardless of the content of the file. That warning can suggest specifying encoding, while the other warning could suggest converting the file to UTF-8.

For file writes I don’t think we have a choice but to deprecate (essentially) open(file, "w") - no encoding argument or 'b' mode means you don’t get to open a file anymore. After a full deprecation, we can bring it back defaulting to a different encoding.

Text encoding is just too complex for us to guess correctly. We’ve already provided the APIs for callers to get it right, so we probably just have to force them to use them.

EpicWink · March 14, 2022, 10:27pm

I think the deprecation period for this would measure in the decades, not years. Whenever I watch a beginner discover how to write a file, I’ve never seen them research text encodings. Even in my case, when I know I don’t have to worry about inter-platform differences, I have lots of production code that doesn’t bother

But text=True is so convenient

Why is default UTF-8 so bad for all of these APIs (as in, changing the default to a static value "UTF-8", away from "locale")?

What do you mean by this? How would the community know if changing the default would break things without actually changing the default?

steve.dower · March 14, 2022, 11:02pm

We know that the default encoding on Windows has never been UTF-8, and so any files written with default encoding by any version of Python could be using the current locale encoding. Which means any existing file that contains non-ASCII characters (such as paths, usernames, real names, addresses, etc.) becomes unreadable in the new version, or non-transferable between Python versions.

Given a time machine, we’d have changed the default file encoding to UTF-8 with 3.0. But it’s too late now. I fear this change would be as disruptive as the rest of the bytes->str changes, if not more (at least if it showed up in a minor release).

They still don’t have to. Deprecation just means that they’ll get a warning displayed, and if they don’t care they can continue to not care. If they do care, they’ll learn something new. One day, the warning goes away and their files are no longer compatible with earlier versions of Python unless they use the correct encoding for those scripts, whatever that happens to be.

Library developers are the ones who need to make changes. Anyone storing configuration files outside of a Python-version-specific location may find that the encoding changes between releases, and they’re unlikely to be handling it gracefully. They need big loud warnings on that so they can figure out a transition plan to get to a known encoding (either by encoding the encoding somehow, or choosing a universal one and migrating existing files to use it).

methane · March 15, 2022, 4:44am

Make sense. The difference between text=True and encoding="utf-8" is only 7 characters.
I will create a new vote for deprecating text=True.

I think we discussed it in last year. It is not safe and encoding can not behave like that, because there are some byte sequence that can be interpreted both of UTF-8 and locale encoding.

We can have something like read_file_contents_with_utf8_or_locale_encoding(path).
But it must be discussed separately from changing the default encoding.

We already have EncodingWarning.

Again, Java changed default encoding without such warning, so we are more conservative than Java already.

Java rejected the idea:

Deprecate all methods in the Java API that use the default charset — This would encourage developers to use constructors and methods that take a charset parameter, but the resulting code would be more verbose.

I agree with Java at the moment. As far as I fixed many EncodingWarning, most of them are just forget encoding="utf-8" to read JSON, TOML, or Python sources. Changing the default encoding soon will fix such problems without long deprecation period and stressing Python users.

I will gather how I and other users on GitHub fixed EncodingWarning and report it. Please wait.

methane · March 15, 2022, 5:08am

We provide EncodingWarning since Python 3.10. Pylint also provide warning too.

Some Python users including me start fixing them. How we fix them can be used to estimate how changing the default encoding will fix or break existing code.

If we fix it by adding encoding="utf-8", changing the default encoding also fix the (potential) bugs.
If we fix it by adding encoding="locale" or encoding=locale.getpreferredencoding(False), changing the default encoding will break such code.

I will gather that information. If you interesting it, this is incomplete list:

methane · March 15, 2022, 6:29am

I created the new topic about text option.

methane · March 15, 2022, 7:15am

I created another subtopic about adding new option for forward/backward compatibility.
This is counterpart of file.encoding=UTF-8/COMPAT in java.

encukou · March 15, 2022, 9:40am

For maintainers of platform-specific scripts on platforms where the legacy encoding is UTF-8:

the existing code is correct, so switching from text=True to encoding="utf-8" would be unnecessary busywork, and
getting warnings about having to specify encoding explicitly because the default is switching fro UTF-8 to UTF-8 is silly.

I don’t know how much breakage and/or user complaints the proposed switch will get, but as a maintainer of a UTF-8-by-default platform, I’m worried I might need to side with such users and do things like patch out the warnings in our builds of Python.
I wish there was a good way to target the warnings to cross-platform code only.

pf_moore · March 15, 2022, 11:45am

To the extent that encoding="locale" is exactly equivalent to text=True, this is just churn for no practical benefit. We’re asking users to change their existing code, which expresses the intent (“I want to get the result as text”) to functionally identical code, that expresses mechanism instead (“I want to use the locale encoding”). And given that the “locale” encoding is obscure (it’s not mentioned in the list of standard codecs), the new alternative is much less discoverable, particularly for casual users.

I’d confidently expect that if we did this, Unix users will start using encoding="utf-8", creating a whole new form of non-portable code that Windows users^[1] will need to start raising “please use the locale encoding rather than utf-8” bug reports for.

Do we have any figures for how often text=True caused actual bugs which weren’t easy to fix (e.g., by specifying the encoding explicitly!)? This feels awfully like we’re abandoning “practicality beats purity” in favour of treating every program as if it’s mission-critical with a multilingual user base…

And users of Unix systems that don’t default to UTF-8, which still exist, I believe… ↩︎

methane · March 15, 2022, 11:57am

We are talking about “If we want to change the default encoding used for text=True”. So encoding="locale" and text=True is not same in this discussion.

I thought changing stdio and PIPE encoding at once when default text file encoding is simple, easy to understand for everyone.
But “practicality beats purity” point of view, not changing stdio and PIPE is still an option.

And that’s what I want to discuss before I start writing next PEP. Thank you for your response.

steve.dower · March 15, 2022, 5:05pm

Not changing subprocess also seems fine, though "locale" is an equally bad default for Windows, where it really depends on exactly which process you’re running (e.g. if you’re launching some versions of Java, you’ll want “locale”; for others you’ll want “utf-8”).

Best we have here is off-by-default warnings that are enabled by test runners, on the assumption that platform-specific scripts are far less likely to have unit tests than code that is developed entirely on macOS but intended for cross-platform use. (Or alternatively, that cross-platform developers are more responsible than single-platform scripters… but I’m not at all confident in that assumption!)

If it’s not safe to have a default encoding that accidentally interprets locale-encoded text as if it’s UTF-8 encoded, then it’s also not safe to change the default to UTF-8, because it will fail in exactly the same way on the same text. The only option you’re leaving here is a noisy deprecation and then undeprecating in a few releases time with changed behaviour (which is fine, though everyone else here is opposed to the noise).

It hasn’t caused any bugs yet. The bugs would start arriving when text=True changes from implying encoding="locale" to encoding="utf-8" (which is what was proposed here).

steve.dower · March 15, 2022, 5:08pm

Perhaps we should just plan to change the default in >=3.13, and do a campaign to raise awareness outside of actually running code? Banners/warnings on download pages, possibly add a message to the start of the interactive console, regular tweets, conference talks, etc.

Of course, we actually have to do those things, and probably build in some kind of escape hatch whereby we abort the change if they haven’t been done. But it does mean that people’s code is not at all impacted until that release, so there are no warnings to deal with.

pf_moore · March 15, 2022, 7:14pm

Then why are we suggesting that we do that? What’s the benefit? I’m genuinely confused here. I can 100% see the value in everything being UTF-8, but it feels like the current reality (on Windows at least) is that it’s not. So what’s the point in Python denying that reality?

steve.dower · March 15, 2022, 7:26pm

Mainly consistency, I think. Also, subprocess currently uses the same default as anyone else who instantiates TextIOWrapper, so it requires a special case to not change it.

It’s definitely more justifiable to have file contents be UTF-8 by default, but the rest merely follows from there. And really, subprocess defaulting to a different probably-incorrect encoding (as it currently does) isn’t much worse than a probably-incorrect UTF-8.

So being able to say “everything defaults to UTF-8, and you probably need to override subprocess calls on Windows” is slightly nicer than “open() defaults to UTF-8, Popen() defaults to mbcs, and you probably need …”

pf_moore · March 15, 2022, 9:41pm

OK, that’s fair. As long as we document how people should override subprocess calls on Windows, otherwise we’ll get a bunch of code that is no longer cross platform

And yes, I know “it depends”. I’m not asking for 100% reliability, I’m asking that we document something that’s at least as good as the current text=True behaviour. Anything less than that is not only a regression, it’s also a regression that we haven’t provided a workaround for.

Edit: Just to clarify why this bothers me so much, I spent a big chunk of time when projects were migrating to Python 3, helping people to make their code cross-platform to Windows - precisely because they assumed the default behaviour was sufficient. I have no wish to go through that whole exercise again.

methane · March 16, 2022, 2:47am

Java provides file.encoding=COMPAT option and I am proposing to provide similar option for backward compatibility. So I don’t propose breaking change without workaround.

Sadly, there is no best “default” encoding for Windows.
Some tools output UTF-8. Some tools uses ANSI CodePage. Some tools uses ConsoleOutputCP.
PowerShell uses UTF-8 ($OutputEncoding) to pass text to external command (stdin in Python), but use OEM code page ([console]::OutputEncoding) to read text from external command (stdout/stderr in Python). (See this issue.)

One obvious point is more and more tools will use UTF-8 and less and less tools will use legacy encoding. Go, node, Rust tools use UTF-8. And now Java uses UTF-8 by default.

We can keep PIPE and stdio encoding legacy and change only file encoding for now. But we need to do the breaking change at some point. Then we will need to have two backward-compat options; one for file encoding, and one for stdio/PIPE encoding.

Another idea is using ConsoleCP for stdin and ConsoleOutputCP for stdout/stderr. But should we do same thing for subprocess PIPE? There is no guarantee that Python have attached console.

eryksun · March 16, 2022, 6:39am

I’d prefer for the “locale” encoding to be the ANSI code page of the user locale. By default, the user locale is set to match the user’s preferred UI language. The latter determines application resource strings returned by FindResourceExW() / LoadResource() or LoadStringW() and localized system messages from FormatMessageW() (e.g. system error messages). The locale itself also has strings such as localized weekday names and symbol characters, from GetLocaleInfoEx().

A multilingual user might set a different preferred UI language and start a new session, or create multiple accounts on a machine, one for each language, and switch between running sessions. Or a machine could be shared by multiple users, each with an account that uses a different language. Changing the system code page in these cases isn’t practical. It requires administrator access and restarting the machine. It’s not a user preference.

Facts on the ground are stubborn, however. The application ecosystem in Windows is stuck in the mentality of Windows 9x (1995-2000) when it comes to using system code pages for text files and IPC.

The Windows API in Windows 9x didn’t support Unicode, so it was critical to use the system ANSI code page.
A “DOS box” in Windows 9x literally ran MS-DOS in a VM. This always used the system OEM encoding; SetConsoleCP() and SetConsoleOutputCP() weren’t implemented.
Windows 9x systems also didn’t enable user profiles by default, and, even if enabled, the logon screen could be easily bypassed by pressing escape.

What I’d like is for the “locale” encoding to track the current LC_CTYPE encoding in Windows, exactly as it does in most POSIX systems. Python calls setlocale(LC_CTYPE, "") at startup, which sets a default locale the combines the user locale and the system ANSI code page, so the current default behavior would remain the same.

What changes is that a script can call locale.setlocale(locale.LC_ALL, ".ACP") to get consistent support for the user locale as the “locale” encoding. Or call locale.setlocale(locale.LC_ALL, ".utf8") to use UTF-8. One might use the latter if the process ANSI code page is UTF-8, which indicates a non-legacy process or system.

Later on, if Python wants to force UTF-8 the default in Windows, just call setlocale(LC_CTYPE, ".utf8") and setlocale(LC_TIME, ".utf8") [*] at startup instead of setlocale(LC_CTYPE, "").

[*] Setting LC_TIME to the same locale as LC_CTYPE is required by time.strftime() in Windows. This doesn’t have to be the case. time.strftime() calls C strftime() instead of wcsftime() due to an old bug in the “%Z” format code (time zone name) that was fixed in ucrt a very long time ago. time.strftime() should revert to calling wcsftime() in Windows.

eryksun · March 16, 2022, 8:20am

This would also make the “locale” encoding basically consistent [*] with the C API locale encoding, sans support for UTF-8 mode. The C API calls mbstowcs() and wcstombs(), which use the current LC_CTYPE encoding.

[*] Except the C API has a bug in Windows. The implementation in Python/fileutils.c assumes that wchar_t is a 32-bit value with respect to lone surrogates and the “surrogateescape” error handler, but wchar_t is 16-bit in Windows. A surrogate code in a high-low pair is a non-BMP ordinal. Encoding non-BMP ordinals is relevant if the locale encoding is UTF-8.

methane · March 18, 2022, 12:52am

I heared that you want to support LC_CTYPE on Windows. I’m neutoral on that.

On the other hand, I want to make Python use UTF-8 by default regardless of locale. User can use UTF-8 by default without knowing about locale.

This is same to Go, Rust, node.js, Ruby, and now Java. User don’t need to learn about locale until they really need locale support.

Locale is difficult for not only young developers. I feel locale difficult too.

I do not propose “force UTF-8 the default”.
Ruby and Java changed the default encoding to UTF-8, but they provide a way to opt-out. I think Python should do same.