Python uses the active process codepage (i.e. GetACP()) as the preferred encoding. This is normally the legacy codepage of the system locale. The locale in C, on the other hand, uses the user locale. Recently, the CRT has staked out a middle-ground position. If the process active codepage is UTF-8, setlocale(category, "") uses UTF-8 instead of the legacy codepage from the user locale. For example:
C:\>python -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
English_Canada.1252
C:\>python.utf8 -c "from locale import *; print(setlocale(LC_CTYPE, ''))"
English_Canada.utf8
Switching the whole process to UTF-8 (not just Python code) during testing may help to highlight code that makes fragile assumptions. Code that assumes that a given default encoding (e.g. process, locale, preferred encoding) is the legacy codepage from the system or user locale will hopefully fail hard on an encoding error and get fixed. However, to be fair, in some cases it could also mask programming errors. This includes extension modules and linked DLLs, which may call MBS ‘ANSI’ functions such as CreateFileA. Python core is guilty of this mistake. _winapi.CreateFile calls CreateFileA with a UTF-8 string.
If the active codepage is a legacy encoding, Windows will happily ‘decode’ the UTF-8 string as mojibake, but this at least can potentially get flagged as an error. If the active codepage is UTF-8, however, this bad code won’t raise any red flags because it just happens to be using the right encoding.
Suggestion
We could replace “python[w].exe” in an installation with a launcher that runs the base executable. This generalizes how venv virtual environments are currently implemented. The base executable would be distributed as both UTF-8 and locale variants, maybe named “python[w].utf8.exe” and “python[w].locale.exe”. The UTF-8 version would set the “activeCodePage” to UTF-8 in the embedded manifest. The launcher would detect UTF-8 mode when it’s enabled by “-X utf8”, PYTHONUTF8, or a new “UTF8” registry value (not set by default). The PYTHONUTF8 environment variable would take precedence over the “UTF8” registry value. The system “py” launcher would also be updated to support this scheme and thus bypass the middle-man launcher.
Also, at startup in the base process, if GetACP() returns CP_UTF8 (65001) and UTF-8 mode isn’t explicitly disabled by the environment/registry, then automatically enable it. Thus UTF-8 mode is automatically set, if it’s not explicitly disabled, when either “python[w].utf8.exe” is run directly or when the system locale is configured to use UTF-8. This supports any code that special cases UTF-8 mode, such as setting the preferred encoding value to “utf-8” instead of “cp65001”.
New “system_ansi”, “system_oem”, “user_ansi”, and “user_oem” encodings could be added that use the system locale and user locale legacy codepages, which can be queried directly via GetLocaleInfoEx. “mbcs” (alias “ansi”, “dbcs”) and “oem” would continue to use the process active codepages from GetACP() and GetOEMCP().
If UTF-8 mode eventually becomes the default (maybe after Windows 8.1 support ends on 2023-01-10 – Python 3.12?), then the only required change would be to set the registry “UTF8” value to true for a particular installation.