PEP-597: Emit a Warning when encoding is omitted

This is a draft of the third version of the PEP 597.

In this version, I propose just adding an option to raise a warning.

I am still considering introducing a new option to opt-in this warning, or just use dev mode.


Abstract

This PEP proposes:

  • TextIOWrapper raises a PendingDeprecationWarning when the
    encoding option is not specified, and dev mode is enabled.

  • Add encoding="locale" option to TextIOWrapper. It behaves
    like encoding=None but don’t raise a warning.

Motivation

People assume the default encoding is UTF-8

Developers using macOS or Linux may forget that the default encoding
is not always UTF-8.

For example, long_description = open("README.md").read() in
setup.py is a common mistake. Many Windows users can not install
the package if there is at least one emoji or any other non-ASCII
character in the README.md file.

I found 489 packages that use non-ASCII characters in README,
and 82 packages of them can not be installed from source package
when locale encoding is ASCII [#]_.

… [#] GitHub - methane/pep597-pypi-ascii

Another example is logging.basicConfig(filename="log.txt").
Some people expect UTF-8 is used by default, but locale encoding is
used actually. [#]_

… [#] Issue 37111: Logging - Inconsistent behaviour when handling unicode - Python tracker

Even Python experts assume that default encoding is UTF-8.
It creates bugs that happen only on Windows. See [#]_ [#]_.

… [#] Use utf-8 to read README by methane · Pull Request #682 · pypa/packaging.python.org · GitHub
… [#] Issue 33684: parse failed for mutibytes characters, encode will show in \xxx - Python tracker

Raising a warning when the encoding option is omitted will
help to find such mistakes.

Prepare to change the default encoding to UTF-8

We chose to use locale encoding for the default text encoding
in Python 3.0. But UTF-8 has been adopted very widely since then.

We might change the default text encoding to UTF-8 in the future.
But this change will affect many applications and libraries.
Many DeprecationWarning will be raised if we start raising
the warning by default. It will be too noisy.

While this PEP doesn’t cover the change, this PEP will help to reduce
the number of DeprecationWarning in the future.

Specification

Raising a PendingDeprecationWarning

TextIOWrapper raises the PendingDeprecationWarning when the
encoding option is omitted, and dev mode is enabled.

encoding="locale" option

When encoding="locale" is specified to the TextIOWrapper, it
behaves same to encoding=None. In detail, the encoding is
chosen by:

  1. os.device_encoding(buffer.fileno())
  2. locale.getpreferredencoding(False)

This option can be used to suppress the PendingDeprecationWarning.

io.text_encoding

TextIOWrapper is used indirectly in most case. For example, open, and pathlib.Path.read_text() use it. Warning to these
functions doesn’t make sense. Caller of these functions should be warned instead.

io.text_encoding(encoding, stacklevel=1) is a helper function for it.
Pure Python implementation will be like this::

   def text_encoding(encoding, stacklevel=1):
       """
       Helper function to choose the text encoding.

       When encoding is not None, just return it.
       Otherwise, return the default text encoding ("locale" for now),
       and raise a PendingDeprecationWarning in dev mode.

       This function can be used in APIs having encoding=None option.
       But please consider encoding="utf-8" for new APIs.
       """
       if encoding is None:
           if sys.flags.dev_mode:
               import warnings
               warnings.warn(
                       "'encoding' option is not specified. The default encoding "
                       "will be changed to 'utf-8' in the future",
                       PendingDeprecationWarning, stacklevel + 2)
           encoding = "locale"
       return encoding

pathlib.Path.read_text() can use this function like this::

   def read_text(self, encoding=None, errors=None):
       """
       Open the file in text mode, read it, and close the file.
       """
       encoding = io.text_encoding(encoding)
       with self.open(mode='r', encoding=encoding, errors=errors) as f:
           return f.read()

subprocess module

While subprocess module uses TextIOWrapper, it doesn’t raise
PendingDeprecationWarning. It uses the “locale” encoding
by default.

Rationale

“locale” is not a codec alias

We don’t add the “locale” to the codec alias because locale can be
changed in runtime.

Additionally, TextIOWrapper checks os.device_encoding()
when encoding=None. This behavior can not be implemented in
the codec.

subprocess module doesn’t warn

The default encoding for PIPE is relating to the encoding of the stdio.
It should be discussed later.

Reference Implementation

Copyright

This document has been placed in the public domain.

5 Likes

If we add a dedicated option like PYTHONWARNTEXTENCODING, users need to use it with an option like -Wd because DeprecationWarning is suppressed by default.
So, enabling this warning with dev mode looks easy and simple for users.

On the other hand, if some users don’t like this warning but want to use dev mode, the dedicated option is better.

1 Like

I like this proposal, and I think it should just be a regular deprecation warning (no extra options for it). Great job :+1:

Maybe we should also emphasise that the plan is to eventually bring back a default value of (presumably) UTF-8. But we need to deprecate the old default first because of the high risk of data loss when the change happens.

1 Like

I am afraid that it makes too noisy warnings. How about this plan?

  • Python 3.9a6 – Implement the PEP with regular warning (no option).
  • Python 3.9bN – We may remove the warning from 3.9 branch, regarding to feedback.
  • Python 3.10~ – DeprecationWarning

In this plan, Python 3.9 may not raise DeprecationWarning. But Python 3.9 supports encoding="locale" option and io.text_encoding(). Users can use them in Python 3.9+ code in the future.

By the way, PendingDeprecationWarning looks better than DeprecationWarning at the moment. We don’t have actual plan to change the default encoding yet.

1 Like

This has the potential to be another case where working code in libraries generates warnings that end users of those libraries end up needing to deal with. And libraries that still support Python 2 will have to switch to io.open, as open doesn’t have an encoding argument in Python 2.

I just did a quick check of pip, and we have a few such cases. And our vendored dependencies have quite a lot as well. I doubt all of those will get fixed for 3.9, so if I’m understanding correctly, pip will be flagging this warning to 3.9 users.

I don’t know the best answer here, but please be aware of the “end user who has no real control over the libraries used in apps they need” issue when deciding how to make this transition.

I’m not against this idea in principle, I’ve just had bad experience in the past of being stuck with annoying warnings for extended periods.

1 Like

I can never remember whether these warnings are on or off by default. Off by default is fine, and turned on with other deprecation warnings.

Chances are end users will be impacted negatively by this (eventually) if their dependencies haven’t been updated, so it’s probably not terrible to warn them too.

I’m also a big believer in being noisier during prereleases. So let’s go on by default as soon as it lands, then turn them off for the final release (maybe this should just be the overall policy?)

1 Like

DeprecationWarning and PendingDeprecationWarning are suppressed by default.
So end users will not see the warnings.
But it makes sense to me. We should wait to enable the warning by default until we fix all warning in stdlib, tests, and bundled pip.

1 Like

I updated the draft to exclude subprocess module.

I found Python test uses subprocess heavily to run Python in child process. The locale encoding is used here for now because Python uses locale encoding for stdio.

Should we change the stdio encoding when we change the default text encoding? I don’t want to discuss it for now. That’s why I exclude PIPE encoding in this PEP.

1 Like

Thanks, I was in the same situation as @steve.dower - I can never remember whether these are off or on by default :slightly_smiling_face:

1 Like

The locale encoding is only used for process pipes because that’s the current TextIOWrapper default. So when you change one it should change the other.

If subprocess keeps the locale default then you won’t be able to communicate with subprocesses of itself (except in bytes mode) unless you also avoid changing TextIOWrapper. Since the latter is the point, I think you need to change both.

1 Like

This PEP provides a way to suppress warning without breaking backward compatibility.
subprocess module can encoding = encoding or "locale" and pass it to TextIOWrapper.

sys.stdin, sys.stdout, sys.stderr do not use the default text encoding too. (ref)

So if subprocess changes the default encoding, we need to change stdio encoding too or we can not communicate child Python process by default.

I am not against changing the default PIPE encoding. But I want to postpone warn about PIPE encoding because we can not provide recommended way to communicate with child subprocess in text mode.

subprocess.run([sys.executable, script], encoding="utf-8")   # (1)
subprocess.run([sys.executable, script], encoding="locale")  # (2)
subprocess.run([sys.executable, script], encoding=sys.stdin.encoding)  # (3)
  • (1) may not work now, because child Python process will use lcoale encoding for stdio
  • (2) might not work in Python 4.0 if Python changed the stdio encoding to UTF-8.
  • (3) may not work when current process doesn’t have a valid stdin.

subprocess will raise warning after we decide how future Python change the default encodings.

On the other hand, warn when users open text files (JSON, YAML, TOML, Markdown, reST, etc…) without encoding is worth enough even though we do not decide to change the default text encoding yet.

1 Like

Okay, that’s fair enough.

I wonder if we could safely add PYTHONIOENCODING to the environment for subprocesses, to help close the divide for at least our own processes? That seems like something that could go into 3.9 anyway in Popen.

1 Like

I am worrying I can not fix all warnings in stdlib and test at once.
(Current progress: https://github.com/python/cpython/pull/19481/files)

So I think I can not enable it by default as soon as it lands.
I think we enable the warning by default after 3.9b1, and before 3.10a1.

1 Like

In some cases we may have to expose new encoding/errors parameters - anywhere we’re decoding from a user provided file that doesn’t have a clear spec. And those will likely need a default or a warning.

But it’s probably worth getting the PEP more widely reviewed and accepted before sinking too much effort in. Once the idea is approved, we can share the effort of fixing warnings.

1 Like

Great idea! I’ll repeat what I said here: PEP 597: Use UTF-8 for default text file encoding

2 Likes

how about

# io.py
class _Locale(enum.Enum):
    LOCALE = enum.auto()

LOCALE = _Locale.LOCALE

in io.py

so you pass with open(fn, encoding=io.LOCALE) as f:

this way there’s no confusion with the codecs module, as I can see someone trying to run a py3.10 script and seeing LookupError: unknown encoding: locale and running codecs.register(...) to fix it

I think AttributeError: module 'io' has no attribute 'LOCALE' would be a clearer issue to investigate

and universal py3.6,7,8,9,10 would be easier:

with open(fn, encoding=getattr(io, "LOCALE", None)) as f:

1 Like

I don’t want to use enum in io.py because io is very low level module. enum is high-level, slow and heavy module. It makes Python startup slower. (Note that _io.open is implemented in C)

Additionally, I don’t want to make signature of open complicated. encoding parameter should be Optional[str].

Then string constant like LOCALE = "locale" in io.py must be enough. No need for enum.
Do you have any suggestion about constant name and value? How about LOCALE_ENCODING = "__locale_encoding__"?

4 Likes

I like LOCALE_ENCODING = "__locale_encoding__" and I like LOCALE = "__locale__"

2 Likes

If you don’t want an enum you can use a singleton of a class:

# io.py
class _Locale:
    pass

LOCALE = _Locale()

def open(..., encoding: None | str | _Locale, ...) -> ...: ...

or a singleton class:

# io.py
class LOCALE :
    def __init__(self):
        raise Something("cannot be constructed directly")

def open(..., encoding: None | str | Type[LOCALE], ...) -> ...: ...
1 Like

Even singleton seems over engineering. The only purpose is AttributeError instead of LookupError. String constant is enough.

5 Likes