JDK 18 is now release candidate, and will be final in this month.
JDK 18 contains JEP 400: UTF-8 by Default. This JEP is interesting to Python too.
Summary of the PEP:
- JDK 17 introduced “native.encoding” system property.
- JDK 18 changed the default encoding. User can use old behavior by setting
file.encoding
property to “COMPAT”. - No deprecation.
Deprecate all methods in the Java API that use the default charset — This would encourage developers to use constructors and methods that take a charset parameter, but the resulting code would be more verbose.
Then, what Python can learn from the JEP?
native.encoding
Python has counterpart already (encoding="locale"
) since Python 3.10. No problem.
file.encoding
I think Python should provide such a backward-compatible option too.
Python on Windows provided similar backward-compatible options when Python changed stdio and fsencoding (e.g. PYTHONLEGACYWINDOWSFSENCODING
and PYTHONLEGACYWINDOWSSTDIO
).
Additionally, I want to make it “forward compatible” option. Users can opt-in the “UTF-8 as the default encoding” with the option. I want to add the option to Python 3.11 if possible as preview/experimental.
We have UTF-8 mode (PYTHONUTF8
), but its semantics is slightly different from “change Python’s default encoding to UTF-8”. It means “Python works as in UTF-8 locale regardless real locale”.
For example, encoding="locale"
become UTF-8 in UTF-8 mode. It totally breaks the motivation of encoding="locale"
. And UTF-8 mode changes some other edge cases – os.device_encoding()
, _locale._get_locale_encoding()
.
Instead of changing semantics of the UTF-8 mode, we may be able to add yet another option for forward/backward compatibility.
io.text_encoding()
becomes UTF-8locale.getpreferredencoding()
becomes UTF-8- We may add another API (e.g.
locale.get_locale_encoding()
and deprecatelocale.getpreferredencoding()
in the future)
- We may add another API (e.g.
- files, stdio, pipes (discuss later) become UTF-8
The idea of the new option name is -X text_encoding
/ PYTHONTEXTENCODING
:
- case insensitive
- “UTF-8”, “UTF8” – Change the default text encoding to UTF-8. User can use this for “opt-in”/“forward compatible” option.
- “locale” – Keep status quo. User will be able to use this for “backward compatible option” after Python changed the default text encoding.
No deprecation
We already have EncodingWarning
which is disabled by default. So we are more conservative than Java already!!
I don’t want to start discussion about showing the warning by default or not. Before the discussion, I want to see:
- How Java users feel about JEP 400 – how many Java users want warning for use of default encoding?
- How Python users fix EncodingWarning – how often
encoding="utf-8"
orencoding="locale"
(*)
I had created the feedback thread. And I will create new discussion thread after Python 3.11 become beta.
(*) I fixed several EncodingWarning in Python and some essential tools around Python like pip and tox. I very rarely used encoding="locale"
. I think changing the default encoding will fix more programs than it breaks. But I want to see more wide area of Python OSS.
stdio and pipe encoding.
stdio and pipes in Java are byte stream. Java user need to use TextInputStream
and TextOutputStream
to get text stream, and its default encoding is file.encoding
– becomes UTF-8 by default since JDK 18.
When I wrote PEP 597, I excluded subprocess
from the EncodingWarning. It is because I think PIPE encoding should be consistent with stdio and I was afraid changing stdio encoding.
But for now, I think we should change the encoding of PIPE and stdio when we change the default text encoding. Keeping use of legacy encoding will confuse users more than change.
So I will change the subprocess to emit EncodingWarning
like open()
in Python 3.11 if no opposite.