I’d like to be able to modify the behaviour of the surrogateescape
codec error handler when decoding, but codecs.register_error()
does not seem to actually change a built-in handler.
For example:
import codecs
builtin_handler = codecs.lookup_error('surrogateescape')
def myhandler(e):
print(f'myhandler(): {e=}')
if isinstance(e, (UnicodeDecodeError, UnicodeTranslateError)):
if e.object[e.start:e.start+2] == b'\xc0\x80':
return '\0', e.start+2
return builtin_handler(e)
print(f'Before codecs.register_error()')
print(f'{codecs.lookup_error("surrogateescape")=}')
codecs.register_error('surrogateescape', myhandler)
codecs.register_error('surrogateescape2', myhandler)
print(f'After codecs.register_error()')
print(f'{codecs.lookup_error("surrogateescape")=}')
print(f'{codecs.lookup_error("surrogateescape2")=}')
sb = b'hello\xc0\x80world'
print(f'{sb.decode(errors="surrogateescape")=}')
print(f'{sb.decode(errors="surrogateescape2")=}')
Prints:
Before codecs.register_error()
codecs.lookup_error("surrogateescape")=<built-in function surrogateescape>
After codecs.register_error()
codecs.lookup_error("surrogateescape")=<function myhandler at 0x9462b960540>
codecs.lookup_error("surrogateescape2")=<function myhandler at 0x9462b960540>
sb.decode(errors="surrogateescape")='hello\udcc0\udc80world'
myhandler(): e=UnicodeDecodeError('utf-8', b'hello\xc0\x80world', 5, 6, 'invalid start byte')
sb.decode(errors="surrogateescape2")='hello\x00world'
So using the non-built-in “surrogateescape2” handler works and decodes \xc0\x80 to \x00 but the built-in “surrogateescape” handler’s behaviour is unchanged.
Is this to be expected?
Thanks.