Use UTF-8 as default text file encoding?

How about changing default encoding for open() to UTF-8, from Python 3.9 (or even from 3.8)?

  • People who programs with Python uses UTF-8 in most cases.
  • From Windows 10 Build 1903, Notepad uses UTF-8 (without BOM) as default encoding.

Using legacy encoding by default is confusing. For example: https://bugs.python.org/issue37111

3 Likes

We probably need to be careful, but I agree that utf-8 as default makes sense eventually.

1 Like

The problem is more likely to be with reading files on Windows. I don’t know what proportion of tools will write text files in the default ANSI encoding, but we don’t want Python to start reporting “malformed UTF8” errors on such files.

For example, writing a CSV file from an older version of Excel with non-ASCII characters present. (New versions of Excel have a “UTF-8 encoded CSV” format, but I think that’s pretty new).

Also, it’s fine to say that new Windows 10 Notepad uses UTF-8, but we still support users on Windows 8 or Vista, so breaking things on those platforms must only be done with careful consideration.

I agree UTF-8 makes sense “eventually”, but I’m not sure we should rush.

PS I assume this isn’t just targeting Windows, and we’d also force UTF-8 on Linux systems which had non-UTF8 LC_CTYPE values (for example LC_CTYPE=es_ES.iso88591)?

1 Like

Of course, they can use “mbcs” or “cpNNNN” explicitly.

We can agree that changing the default eventually. So the problem “when inconvenience using legacy encoding by default become larger than inconvenience using UTF-8”?

I think it’s now. And inconvenience using codepage will be bigger quickly, because MS will use cp65001 more often.

Currently, Python changes encoding when user changed language setting.
But when Python 3.9 is released, codepage may be changed by how Python is started, or how Python is installed.

So I think it’s time to show warning when people use default encoding and it is not UTF-8, like “Python 3.9 will use UTF-8 for default encoding of text files. Use ‘mbcs’ if you want to use current codepage”.

If we decide to not change it in 3.9, we can just rewrite the warning message from 3.9 to 3.10.

I think so. It’s common mistake that assume default encoding is ‘UTF-8’.
There are some packages on PyPI which does long_description=open("README.md").read()) while README.md is UTF-8.

Not using UTF-8 by default is big pitfall even for now, and it will be bigger when most people start using UTF-8 even on Windows.

So how does this link to PEP 540, which says that utf-8 mode will “use the utf-8 encoding, regardless of the locale currently set by the current platform”, but that utf-8 mode is off by default. It seems as if this proposal is more or less saying that on Unix, Python should set utf-8 mode on by default. (At least in a broad sense, the details aren’t exactly the same).

While I understand the principle here, and I do believe that “UTF8 everywhere” is a good model, I’m not sure it’s right for Python to enforce the principle. At a minimum, I think it needs a PEP - after all, we have PEPs 528, 529, 538 and 540, and I don’t think this proposal is any less controversial than those.

I wonder if there are still systems like that. I think Windows is the main stake here.

I wonder if there are still systems like that.

I have no idea :slight_smile: But I found that example in a Stack Overflow question (I didn’t check the date on it though).

I think Windows is the main stake here.

PEP 540 suggests to me that Unix isn’t quite as “UTF8 everywhere” as we’d like, but maybe I’m misinterpreting it.

It’s most definitely not (I’ve dealt with a fair share of non-UTF8 servers). A lot of software also already assume UTF-8 anyway. I’ve lost count how many encoding='utf8' I needed to add to a setup.py because it’s reading UTF-8 (non-ASCII) README.

I would very much welcome UTF-8 being the default myself, even on non-UTF8 machines. That’s only one data point though, I can’t say about others.

Addendum: Although there are still cases one may want to access the platform encoding, i.e. encoding=None should still need to continue to work as it does now.

Personally, I would too. But my concerns are more about how such a change would affect all the people who use Python but don’t even know what an encoding is, let alone being able to deliberately choose to use UTF-8 in tools that don’t override the platform default.

That’s why I think this is worth a PEP - to allow such users to be properly represented, rather than needing me to argue a position that would actually be less beneficial for me personally :slight_smile:

2 Likes

I completely agree. My comment was mainly to present an anecdote on the non-UTF8 Unix machine side, and to express that more points of view are needed.

I kind of fear that nothing (PEP or not) would really help represent such users. If they don’t (need to) know much about encoding—considering the Python 3 switch would’ve tripped them up on this—they could be behind too closed a door to be reached by any discussion. The discussion should be had nonetheless though.

Thanks for your suggestion. I wrote an initial version of PEP.

After it is merged, I will create a new topic for the PEP.

1 Like

The new issue are more things like containers where there is no locale installed or the default locale doesn’t use UTF-8. PEP 538 and PEP 540 are designed to “enforce” UTF-8 for this case. PEP 540 is disabled by default, but enables on such case: when the LC_CTYPE locale is “C” or “POSIX”.

1 Like

There’s no new topic yet that I can see, and discussions are starting against the PR. When you create the topic, can you move those discussions onto Discourse, as they are too hard to follow as review comments :frowning:

Sorry, I’m still editing my PEP. Especially for following this comment.

No problem, I just noticed that @vstinner was making comments on the PR which seemed worthy of wider discussion, and I didn’t want them to get lost.

I created new topic for the PEP: