The cp1252 codec is significantly slower than ascii or utf8 when reading a file that only contains ascii encoded text, could it be improved?

petedmarsh · September 14, 2023, 4:47pm

I have been working on Windows recently parsing files that contain ASCII encoded text.

I was lazy and didn’t specify an encoding with open and, long story short, I found that the cp1252 default codec on Windows is much slower at parsing ASCII data than the ascii codec (which is not that surprising) but also the utf8 codec (which is perhaps a little surprising.

import tempfile
import timeit

f = tempfile.NamedTemporaryFile(delete=False, encoding='ascii', mode='wt')
for _ in range(2048):
    for d in range(10):
        f.write(str(d) * 128 + '\n')
f.close()

PATH = f.name

# read the file a few times to warm up caches and make things fairer
for _ in range(100):
    open(PATH, 'r').read()

# prove the parsed contents are the same no matter what the encoding used is
assert [l for l in open(PATH, 'rt', encoding='ascii')] == [l for l in open(PATH, 'rt', encoding='utf8')] == [l for l in open(PATH, 'rt', encoding='cp1252')]


def readlines_ascii():
    with open(PATH, 'rt', encoding='ascii') as f:
        for l in f:
            pass

def readlines_utf8():
    with open(PATH, 'rt', encoding='utf8') as f:
        for l in f:
            pass

def readlines_cp1252():
    with open(PATH, 'rt', encoding='cp1252') as f:
        for l in f:
            pass

print('Read as ASCII:')
print(timeit.timeit(readlines_ascii, number=10000))
print('Read as UTF-8')
print(timeit.timeit(readlines_utf8, number=10000))
print('Read as CP-1252')
print(timeit.timeit(readlines_cp1252, number=10000))

It’s my understanding that cp1252 is a superset of ascii and all 7-bit values are the same in the two encodings. It seems to me - without much thinking - that if it’s possible for utf8 to be about as fast as ascii when parsing ascii-only data, then why couldn’t cp1252?

Yes, of course, you should explicitly specify the correct encoding when you open your files, but I bet there is an enormous quantity of Python code out there running on Windows which does not and largely deals with ascii data, in which case such an improvement would be generally useful.

If it’s possible I can raise this as a feature request on github, and possibly have a go at implementing it myself, but I am not sure if it is indeed possible so I opened this first.

storchaka · September 14, 2023, 6:45pm

It is not that the cp1252 codec is slow. It is that the ASCII codec is insanely fast. It reads 4 or 8 bytes at time, check that they all have the 7th bit clear, and write them all at once. The UTF-8 codec first try to decode the data as ASCII, so it has the same speed for ASCII-only data. It is the default codec, and most of encoded/decoded data is ASCII-only, so this case has high priority. The cp1252 codec is a charset codecs. It is also highly optimized, but for encoding/decoding non-ASCII data. Other difference is that ASCII and UTF-8 codecs are builtin, but for most of other codecs data is passed through the Python layer.

In future UTF-8 will be default encoding for open(), so you will automatically get a performance bonus.

petedmarsh · September 14, 2023, 8:21pm

The same kind of optimization could be made for cp1252 in principal right (7th-bit-clear)?

malemburg · September 14, 2023, 9:15pm

The cp1252 codec is just an instance of the charmap codec with a particular mapping applied. The way this works is somewhat different than the ascii or utf-8 codecs, which don’t have to apply lookups to find the correct mapping.

That said, feel free to send in PRs to speed up the charmap codec People are certainly always keen to get faster codecs.

kknechtel · September 15, 2023, 12:09am

Are there any charmap codecs that aren’t ASCII-transparent? Or a reason to support that capability? If not, then the issue would be resolved by just adding the same kind of ASCII fast path, right? (if there are, perhaps individual charmap mappings could be pre-tested for ASCII-transparency?)

Rosuav · September 15, 2023, 12:31am

Yes, codecs include things like base64. If there’s enough of a performance benefit, it might be worth creating a dedicated subcategory for ASCII-compatible encodings, but it’s also worth noting that there would be a performance loss any time there’s a non-ASCII byte value found, potentially making this of minimal value.

methane · September 15, 2023, 3:59am

Try UTF-8 Mode. It will be default from Python 3.15.

storchaka · September 15, 2023, 7:23am

Not all charset codecs are ASCII-transparent. Yes, a special flag can be added for this, but is it worth? 8-bit encodings are rarely used now, mostly for reading legacy data. You use it instead of ASCII if you expect non-ASCII data, otherwise you would use ASCII.

Also, there are other reasons why builtin codecs are faster than externally defined codecs.

malemburg · September 15, 2023, 7:44am

Most of the charmap codecs we have in Python are ASCII-transparent, but there are a few which are not. Out of the ones I checked, these are not (there may be more):

cp424
cp037
cp500
cp875
cp1026

The majority pas through ASCII chars unchanged.

If someone is willing to work on a PR to add a flag for the charmap codec to pass through ASCII, we could speed them up. The existing codecs would then have to be adapted to make use of the flag (easiest would be to have the helpers used for building the tables automatically set this flag when detecting ASCII transparency).

An alternative is to first try the ascii codec and only fall back to the more specific codec in case there are non-ASCII chars in your string.

Rosuav · September 15, 2023, 7:45am

Notably though, this would only speed them up when the text is entirely ASCII, right? Or would it improve performance of file reading when (say) one line of the file is all ASCII?

storchaka · September 15, 2023, 8:18am

It depends. If you first try to decode as ASCII and then switch to slower byte-per-byte decoding, it can speed them up when the whole buffer (8 KiB by default) or at least a significant prefix of the buffer is ASCII-only. If it is more sophisticated, it can also speed them up when simply the data contains large ASCII-only sequences. But it requires writing a large amount of specialized repeating code, and the existence of this code can affect other parts of the charset codec, other codecs, and other string operations. Not counting that switching between “fast” and “slow” parts will in general slow down the codec.

malemburg · September 15, 2023, 10:20am

I just had a look at the decoder of the charmap codec. I don’t think there is a lot to gain. Most of the charmap based codecs use 2-byte Unicode output buffers and so regardless of whether you improve the copying, the codec will always have to allocate more memory and spend more time copying to a two byte destination, than just one byte (as for the ascii codec).

If your input is likely ASCII, you’ll get better average performance overall (and lower memory usage on the Unicode side of things), if you first try the ascii codec and fall back to the full 8-bit charmap based codec in case of an error.

That’s something only the application can determine, so there isn’t much point in applying such a strategy at the low codec level.