Print and surrogates not allowed

Many peoples encounter UnicodeEncodeError: surrogates not allowed when printing a filename encoded with something else than UTF-8. It can happen when you walk a filesystem, and you have a latin1 remain.

I found this issue doesn’t happen for a list of str.

from pathlib import Path
import os.path
import sys

def make_file(path: Path | str, path_encoding: str = 'utf8') -> None:
    _ = str(path).encode(path_encoding)
    print(f"make file '{_}'")
    if not os.path.exists(_):
        with open(_, 'w') as fh:
            fh.write('')

# Create a directory
ROOT = Path('./path-encoding')
ROOT.mkdir(exist_ok=True)

# Create a UTF-8 filename
éléphant = ROOT / 'éléphant-utf8'
make_file(éléphant, 'utf8')

# Create a bad encoding filename
éléphant = ROOT / 'éléphant-latin1'
make_file(éléphant, 'latin1')

for root, directories, files in ROOT.walk():
    print(files)
    #  ['éléphant-utf8', '\udce9l\udce9phant-latin1']
    for _ in files:
        #! print(_)
        #  UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed
        sys.stdout.buffer.write(str(_).encode('utf8', 'surrogateescape'))

I’m not sure I understand what your question is. What is your question? Are you getting an error you don’t understand? Are you expecting an error that you aren’t getting?

When you print a list of strings, it uses the __repr__ dunder method, which shows strings quoted and escaped.

When you print a string, it uses the __str__ dunder method, which prints the contents of the string.

Strictly speaking, strings shouldn’t contain surrogates - they belong to the UTF-16 encodings - but they’re a useful way of handling filenames that are encoded “wrongly”. They let you write tools to handle filesystems more robustly. Just don’t try to print them!