PEP 686: Make UTF-8 mode default (Round 2)

methane · December 24, 2022, 2:45am

utf-8-sig doesn’t solve silent data corruption because it accepts UTF-8 without BOM.

>>> s="こんにちは".encode('utf-8').decode('latin1')
>>> s  # artificial "correct" string.
'ã\x81\x93ã\x82\x93ã\x81«ã\x81¡ã\x81¯'

>>> b = s.encode('latin1')
>>> b  # artifical "correct" latin1 string.
b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf'
>>> b.decode('utf-8') == s  # silent data corruption
False
>>> b.decode('utf-8-sig') == s silent data corruption
False

UTF-8 with BOM helped some MS applications chose encoding. When text file is starting with BOM, it is UTF-8(-sig). Otherwise, it’s legacy encoding.

It worked well on Windows when UTF-8 without BOM were minority on Windows.
But for now, it doesn’t work well. Microsoft now uses UTF-8 without BOM by default.