Hi guys, I hit an issue with encoding paths. I am trying to generate a CSV based on the contents of some PDF files, each row represents one file, so I would like to add a file path as a field in the CSV row. The application is going to run on Windows and will be used in Poland, thus, my app will need to support Polish characters and windows-1250 encoding. To add some more context (and complexity), I write and test this app on a Mac.
Here is what I am dealing with
>>> path = Path("resources/pdf/111_1111_Testyńska sz.pdf")
>>> path.exists()
True
>>> str(path).encode(encoding='windows-1250')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/encodings/cp1250.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u0301' in position 31: character maps to <undefined>
encoding with 'windows-1250' codec failed
BEWARE: copying this code will most likely result in no error, as the character has already been re-encoded while publishing this post. I hit it only when fetching the files by the pathlib.Path.glob()
function
In this particular example, the problem is the character ń
. It is treated as two characters - n
and an acute '
. The letter is correctly encoded, but a UnicodeEncodeError
is thrown while
dealing with the combining acute
I managed to find a solution, by using unicodedata.normalize()
What I want to know now is why does this happen? Why are the paths treated differently than other strings?