The problem is more likely to be with reading files on Windows. I donât know what proportion of tools will write text files in the default ANSI encoding, but we donât want Python to start reporting âmalformed UTF8â errors on such files.
For example, writing a CSV file from an older version of Excel with non-ASCII characters present. (New versions of Excel have a âUTF-8 encoded CSVâ format, but I think thatâs pretty new).
Also, itâs fine to say that new Windows 10 Notepad uses UTF-8, but we still support users on Windows 8 or Vista, so breaking things on those platforms must only be done with careful consideration.
I agree UTF-8 makes sense âeventuallyâ, but Iâm not sure we should rush.
PS I assume this isnât just targeting Windows, and weâd also force UTF-8 on Linux systems which had non-UTF8 LC_CTYPE values (for example LC_CTYPE=es_ES.iso88591)?
Of course, they can use âmbcsâ or âcpNNNNâ explicitly.
We can agree that changing the default eventually. So the problem âwhen inconvenience using legacy encoding by default become larger than inconvenience using UTF-8â?
I think itâs now. And inconvenience using codepage will be bigger quickly, because MS will use cp65001 more often.
Currently, Python changes encoding when user changed language setting.
But when Python 3.9 is released, codepage may be changed by how Python is started, or how Python is installed.
So I think itâs time to show warning when people use default encoding and it is not UTF-8, like âPython 3.9 will use UTF-8 for default encoding of text files. Use âmbcsâ if you want to use current codepageâ.
If we decide to not change it in 3.9, we can just rewrite the warning message from 3.9 to 3.10.
I think so. Itâs common mistake that assume default encoding is âUTF-8â.
There are some packages on PyPI which does long_description=open("README.md").read()) while README.md is UTF-8.
Not using UTF-8 by default is big pitfall even for now, and it will be bigger when most people start using UTF-8 even on Windows.
So how does this link to PEP 540, which says that utf-8 mode will âuse the utf-8 encoding, regardless of the locale currently set by the current platformâ, but that utf-8 mode is off by default. It seems as if this proposal is more or less saying that on Unix, Python should set utf-8 mode on by default. (At least in a broad sense, the details arenât exactly the same).
While I understand the principle here, and I do believe that âUTF8 everywhereâ is a good model, Iâm not sure itâs right for Python to enforce the principle. At a minimum, I think it needs a PEP - after all, we have PEPs 528, 529, 538 and 540, and I donât think this proposal is any less controversial than those.
Itâs most definitely not (Iâve dealt with a fair share of non-UTF8 servers). A lot of software also already assume UTF-8 anyway. Iâve lost count how many encoding='utf8' I needed to add to a setup.py because itâs reading UTF-8 (non-ASCII) README.
I would very much welcome UTF-8 being the default myself, even on non-UTF8 machines. Thatâs only one data point though, I canât say about others.
â
Addendum: Although there are still cases one may want to access the platform encoding, i.e. encoding=None should still need to continue to work as it does now.
Personally, I would too. But my concerns are more about how such a change would affect all the people who use Python but donât even know what an encoding is, let alone being able to deliberately choose to use UTF-8 in tools that donât override the platform default.
Thatâs why I think this is worth a PEP - to allow such users to be properly represented, rather than needing me to argue a position that would actually be less beneficial for me personally
I completely agree. My comment was mainly to present an anecdote on the non-UTF8 Unix machine side, and to express that more points of view are needed.
I kind of fear that nothing (PEP or not) would really help represent such users. If they donât (need to) know much about encodingâconsidering the Python 3 switch wouldâve tripped them up on thisâthey could be behind too closed a door to be reached by any discussion. The discussion should be had nonetheless though.
The new issue are more things like containers where there is no locale installed or the default locale doesnât use UTF-8. PEP 538 and PEP 540 are designed to âenforceâ UTF-8 for this case. PEP 540 is disabled by default, but enables on such case: when the LC_CTYPE locale is âCâ or âPOSIXâ.
Thereâs no new topic yet that I can see, and discussions are starting against the PR. When you create the topic, can you move those discussions onto Discourse, as they are too hard to follow as review comments