PEP 686: Make UTF-8 mode default

See also the painful history of the PYTHONFSENCODING env var: Python 3.2 Painful History of the Filesystem Encoding — Victor Stinner blog 3 (hopefully, it was never part of any Python release).

2 Likes

Just to note, your whole series of blog articles on this on this topic (linked in the post above) are invaluable to understand the current situation, how we got here and why things are the way they are now.

4 Likes

After due consideration, the Steering Council is in favour of this change, and we are inclined to accept the PEP. However, we’re worried about the amount of churn in this area already, as well as the reliance on a feature new in 3.10 for emulating old behaviour, and so we would like the change to the default to be for Python 3.15 instead of 3.13. Unless you have strong objections to the longer timeline, please update the PEP accordingly, and consider it accepted.

3 Likes

I would like to make it default as soon as possible, but we can promote UTF-8 mode more widely after having official schedule for making it default.
So I happily accept postponing target version 2 years.

FWIW, RFC 2044 (first UTF-8 RFC) is published at October 1996. October 2026 (expected schedule for Python 3.15) is 30th birth month of the UTF-8.

Thank you!

2 Likes

Python 3.15 is far in the future. I don’t get why this change cannot be done in Python 3.12. Do you expect issues? Do you expect that in 2026, these issues will be behind us? If yes, what will fix these issues in the meanwhile?

I don’t get the reference about “churn in this area”. Would you mind to elaborate?

Are you talking about fixing projects which expect to use UTF-8, but call open(filename) without specifying an encoding and so use the locale encoding instead? IMO passing explicitly encoding=“utf-8” is useful on all Python versions (even with PEP 686). PEP 686 will avoid to have to do that, since UTF-8 will be default (except if the Python UTF-8 Mode is disabled explicitly, in this case, you still need to pass encoding="utf-8").

Are you talking about encoding=“locale”? To use the locale encoding, you can write encoding=locale.getpreferredencoding(False), it works on Python 2.7-3.11. Or you can get the locale encoding once (ENCODING = locale.getpreferredencoding(False)) and then reuse it (open(filename, encoding=ENCODING)).

Obviously, the devil is in the details. locale.getpreferredencoding(False) ignores the locale and always returns "utf-8" if the Python UTF-8 Mode is enabled. That’s why the PEP adds locale.getencoding() (already added to Python 3.11).

5 Likes

Unless the SC approval is contingent on the longer timeline, I also don’t understand why this can’t be done sooner.

2 Likes

I think 3.12 is too early. Python backward compat policy requires two release deprecation period.
Although EncodingWarning is emitted since Python 3.10, we didn’t have official plan to change the default encoding when 3.10 is released. So Python 3.13 is the fastest timing.

I wanted to schedule it on Python 3.13 because we can postpone it if moving to UTF-8 mode is more difficult than we think now.
On the other hand, If we schedule it on Python 3.15 once, it is difficult to move forward even if we learn that moving to UTF-8 mode is really straightforward.

I think Python 3.14 (π) is also good version for such important change because it is easy to remember.

1 Like

You can have the change earlier: just enable UTF-8 mode in your Python installations :slight_smile:

For the general public and there esp. for the Windows users, there needs to be more time to raise awareness and have users adapt, though, since the change can cause silent data corruption.

We first have to tell people that the change is coming, point them to the UTF-8 mode to test their installations and give them time to make any necessary changes.

By reminding people of the upcoming major change for a few releases we make sure that everyone is aware and can plan for the change. This is esp. important for larger companies with millions of LOC written in Python.

1 Like

FWIW, what Ruby did is:

  • RubyInstaller 2.4 provided option to set external encoding to UTF-8 by environment variable. (e.g. RUBYOPT=-Eutf-8)
  • RubyInstaller 2.7 make the option default.
  • Ruby 3.0 changed the default external encoding to UTF-8 on Windows.

ref: https://bugs.ruby-lang.org/issues/16604#Will-it-work

Maybe, we can do similar to get more feedback from Windows users.

  • Add “Enable UTF-8 Mode” checkbox to Python 3.11 installer. It is not checked by default. It sets PYTHONUTF8=1 environment variable.
  • Python 3.13 installer check the option by default.
  • Python 3.15 make the UTF-8 mode default.
1 Like

Sorry, I have not been following this discussion in detail.

As a Linux user, do I need to care about this UTF-8 mode? I expect that I am already using UTF-8 on any mainstream Linux distro.

Hmm… As a redistributor (Linux distro), can I make UTF-8 mode the default early?

No and yes.
If you need to care, you’re most likely using a pretty special system and you wouldn’t be asking the question :‍)

1 Like

What will be done to raise awareness? My fear is that if nobody is being done, the situation will be identical in 2026 than in 2022: nobody will fix any issue in the meanwhile.

How do you plan to tell people?

My colleague Tomáš Hrnčiar rebuilt all Python packages in Fedora with PYTHONWARNDEFAULTENCODING enabled. On 3891 packages, 143 packages (4%) failed to build: Please try PYTHONWARNDEFAULTENCODING (PEP 597) - #5 by hrnciar

Do we need to wait until this number (143) or this ratio (4%) is moving below a limit? For example, is it ok if only 50 fail to build? Or 10?

We don’t have access to private code written behind closed door, but we can use public code to estimate how private code will be affected. Also, we can make sure that most important packages on PyPI are ready for these changes. For example, are top 50 PyPI packages ready for PEP 686? What about top 100, or top 5000?

If there are important milestone and actions must be done, wan we put such migration path as part of PEP 686?

PEP 387 usually requires to emit a warning at runtime. But EncodingWarning is quiet by default. Should we emit EncodingWarning by default at a specific Python version to “raise awareness”? For example, can we emit EncodingWarning by default in Python 3.11?

Honestly, most developers ignore warnings. So if EncodingWarning is even quiet by default, nobody will pay attention to it and nothing will happen in the next 4 years.

That is not a reason to ignore our policy. It might be a reason to change our policy - feel free to make that case in a separate thread if you want to. But we don’t just ignore policy because we disagree with it :slightly_frowning_face:

1 Like

Maybe, we can add option like --enable-utf8mode to test “Enable UTF-8 Mode by default” environment.

Note that they are broken by EncodingWarning, not by UTF-8 mode.

I had checked a few failures. They are failed because their test checks subprocess output.
So EncodingWarning broke their tests.

I expect less than 1% packages will be broken by UTF-8 Mode. --enable-utf8mode would help it.

Note that DeprecationWarning is also quiet by default unless running test or in main script.
Warning shown by default is not strict requirement of the PEP 387.

And I explicitly declined to show it by default in the PEP 686 because it will be too noisy and most of them are false positive.

As you know, we enable UTF-8 mode automatically when C locale is used.
There were few complaints about C locale coercion, but I don’t know any complaints about UTF-8 mode.
So EncodingWarning will be just a noise for most users.

I expect that this PEP most affects scripts running on Windows. And many of them may not have tests.
Users using such scripts need to disable UTF-8 mode until their script supports UTF-8 mode.
And they can enable EncodingWarning when they are making their scripts support UTF-8 Mode.

We will need to educate people by running blog posts, setting up prominent notices on python.org, do talks at conferences, set up an FAQ website to point people to, etc. You know: public relations and marketing work :slight_smile:

Since this is not necessarily something we are particularly good at (both core developers and the PSF), we should find someone willing to orchestrate this campaign and have the PSF pay the person.

It would be a good precedent to set, since as an organization (or two depending on how you see things), we need such a communication guru to help get the word out about new features or changes we have in the making.

1 Like

I’m not asking to ignore PEP 387. I’m trying to understand why PEP 686 cannot be done as soon as Python 3.11. I’m not sure that I understand well the relationship between PEP 686 (use UTF-8 by default), PEP 597 (EncodingWarning if the encoding parameter is omitted or is None), and PEP 387 (deprecation policy). If we must follow a specific deprecation process, I would prefer to make it more explicit.

As for whether it has an effect on Linux, I ran into a case the
other day where a user reported they were getting an encoding
exception when trying to run a script. As it turns out, their system
locale was for some reason set to ISO-8859-1/Latin-1 rather than a
UTF-8 based locale. I didn’t ask what distro it was, but I had them
re-run the script with PYTHONUTF8=1 in the calling environment
(after upgrading to a new enough interpreter to support that) and
they reported that it completed without error.

I didn’t dig much deeper, but their exception indicated the script
was attempting to print() a string which could not be successfully
encoded with their chosen encoding. It probably resulted in some
mojibake output in places, if I were to guess.

1 Like

By the way, the latest discussion thread was here.

Maybe, creating new thread after update PEP was bad idea.

It’s worth requesting an administrator close the first thread when you create a second. Normally I see Brett making new threads and closing the first one himself :wink:

Not a chance. Environment variables bleed into other processes too easily and can’t be easily suppressed. The most I’d consider is some kind of marker file that only affects a particular install, so that embedded copies of the same version aren’t affected, and neither are other versions.

One of the most common reasons for Python failing to start is that people have set PYTHONHOME or PYTHONPATH globally, and the referenced directories are for a different version of Python. We should discourage environment variables for this sort of thing, not create more.

I agree. This was one of my conditions for supporting the PEP. If nobody does any promotional work, then I’ll argue to delay the change. If we’re going to decide now that it won’t be done, I’ll argue to reject the PEP right away.

1 Like