Codecs.register_error() does not seem to be able to replace a built-in handler

I’d like to be able to modify the behaviour of the surrogateescape codec error handler when decoding, but codecs.register_error() does not seem to actually change a built-in handler.

For example:

import codecs

builtin_handler = codecs.lookup_error('surrogateescape')

def myhandler(e):
    print(f'myhandler(): {e=}')
    if isinstance(e, (UnicodeDecodeError, UnicodeTranslateError)):
        if e.object[e.start:e.start+2] == b'\xc0\x80':
            return '\0', e.start+2
    return builtin_handler(e)

print(f'Before codecs.register_error()')
print(f'{codecs.lookup_error("surrogateescape")=}')

codecs.register_error('surrogateescape', myhandler)
codecs.register_error('surrogateescape2', myhandler)

print(f'After codecs.register_error()')
print(f'{codecs.lookup_error("surrogateescape")=}')
print(f'{codecs.lookup_error("surrogateescape2")=}')

sb = b'hello\xc0\x80world'
print(f'{sb.decode(errors="surrogateescape")=}')
print(f'{sb.decode(errors="surrogateescape2")=}')

Prints:

Before codecs.register_error()
codecs.lookup_error("surrogateescape")=<built-in function surrogateescape>
After codecs.register_error()
codecs.lookup_error("surrogateescape")=<function myhandler at 0x9462b960540>
codecs.lookup_error("surrogateescape2")=<function myhandler at 0x9462b960540>
sb.decode(errors="surrogateescape")='hello\udcc0\udc80world'
myhandler(): e=UnicodeDecodeError('utf-8', b'hello\xc0\x80world', 5, 6, 'invalid start byte')
sb.decode(errors="surrogateescape2")='hello\x00world'

So using the non-built-in “surrogateescape2” handler works and decodes \xc0\x80 to \x00 but the built-in “surrogateescape” handler’s behaviour is unchanged.

Is this to be expected?

Thanks.

Yes, this is expected. PEP 293 states:

All encoders and decoders are allowed to implement the callback functionality themselves, if they recognize the callback name (i.e. if it is a system callback like “strict”, “replace” and “ignore”).

And this includes the new handlers that were introduced by PEP 293, i.e. "backslashreplace", "xmlcharrefreplace" and "namereplace".

1 Like

… and of course "surrogateescape", which was part of your question. :wink:

Ah, that’s good to know.

Many thanks for clarifying.