Paths with non standard characters can not be encoded with windows-1250

Hi guys, I hit an issue with encoding paths. I am trying to generate a CSV based on the contents of some PDF files, each row represents one file, so I would like to add a file path as a field in the CSV row. The application is going to run on Windows and will be used in Poland, thus, my app will need to support Polish characters and windows-1250 encoding. To add some more context (and complexity), I write and test this app on a Mac.

Here is what I am dealing with

>>> path = Path("resources/pdf/111_1111_Testyńska sz.pdf")
>>> path.exists()
True
>>> str(path).encode(encoding='windows-1250')                 
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/encodings/cp1250.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u0301' in position 31: character maps to <undefined>
encoding with 'windows-1250' codec failed

BEWARE: copying this code will most likely result in no error, as the character has already been re-encoded while publishing this post. I hit it only when fetching the files by the pathlib.Path.glob() function

In this particular example, the problem is the character ń. It is treated as two characters - n and an acute '. The letter is correctly encoded, but a UnicodeEncodeError is thrown while
dealing with the combining acute

I managed to find a solution, by using unicodedata.normalize()

What I want to know now is why does this happen? Why are the paths treated differently than other strings?

MacOS normalises filenames to NFD, which is why ń is 2 codepoints in the name, but cp1250 doesn’t encode ń like that - it uses 1 byte.

So, you’re going to have to normalise to NFC before encoding as cp1250.

3 Likes

Thanks for the explanation. It took me a while to understand what was causing a problem.
I understand that this is caused by the Unicode itself, that allows multiple representation of the same character, and there are some reasons for that. But then, I am thinking, is it something that could be potentially improved while handled in Python, and if so, how?

For illustration:

>>> [unicodedata.name(x) for x in unicodedata.normalize('NFD', 'ń')]
['LATIN SMALL LETTER N', 'COMBINING ACUTE ACCENT']
>>> [unicodedata.name(x) for x in unicodedata.normalize('NFC', 'ń')]
['LATIN SMALL LETTER N WITH ACUTE']

Python supports writing CSV content in Unicode. Maybe that would be better overall? (Unless the program that will read the file does not support that.)

Thanks for the example, it perfectly illustrates the issue.

I agree, that would be better overall, but the program that will consume this data requires windows-1250, so unicode is a no-go in my case. I guess I will just have to normalize all the strings (or paths in my case) to NFC.