I have been working on Windows recently parsing files that contain ASCII encoded text.
I was lazy and didn’t specify an encoding with open and, long story short, I found that the cp1252 default codec on Windows is much slower at parsing ASCII data than the ascii codec (which is not that surprising) but also the utf8 codec (which is perhaps a little surprising.
import tempfile
import timeit
f = tempfile.NamedTemporaryFile(delete=False, encoding='ascii', mode='wt')
for _ in range(2048):
for d in range(10):
f.write(str(d) * 128 + '\n')
f.close()
PATH = f.name
# read the file a few times to warm up caches and make things fairer
for _ in range(100):
open(PATH, 'r').read()
# prove the parsed contents are the same no matter what the encoding used is
assert [l for l in open(PATH, 'rt', encoding='ascii')] == [l for l in open(PATH, 'rt', encoding='utf8')] == [l for l in open(PATH, 'rt', encoding='cp1252')]
def readlines_ascii():
with open(PATH, 'rt', encoding='ascii') as f:
for l in f:
pass
def readlines_utf8():
with open(PATH, 'rt', encoding='utf8') as f:
for l in f:
pass
def readlines_cp1252():
with open(PATH, 'rt', encoding='cp1252') as f:
for l in f:
pass
print('Read as ASCII:')
print(timeit.timeit(readlines_ascii, number=10000))
print('Read as UTF-8')
print(timeit.timeit(readlines_utf8, number=10000))
print('Read as CP-1252')
print(timeit.timeit(readlines_cp1252, number=10000))
It’s my understanding that cp1252 is a superset of ascii and all 7-bit values are the same in the two encodings. It seems to me - without much thinking - that if it’s possible for utf8 to be about as fast as ascii when parsing ascii-only data, then why couldn’t cp1252?
Yes, of course, you should explicitly specify the correct encoding when you open your files, but I bet there is an enormous quantity of Python code out there running on Windows which does not and largely deals with ascii data, in which case such an improvement would be generally useful.
If it’s possible I can raise this as a feature request on github, and possibly have a go at implementing it myself, but I am not sure if it is indeed possible so I opened this first.