I understand that pathlib insisting on str’s might be advantageous for Windows or MacOS, but what about Linux?
Linux filenames have no encoding.
For application programming, on Linux, using str’s that are UTF-8 might make some sense. But for systems programming, it seems inappropriate. EG: What if you write a backup program, and some file you need to back up doesn’t have a valid UTF-8 filename?
$ cat t
below cmd output started 2023 Thu Jul 06 05:24:55 PM PDT
above cmd output done 2023 Thu Jul 06 05:24:55 PM PDT
dstromberg@tp-mini-c:~/src/experiments/pathlib-touch x86_64-pc-linux-gnu 2939
below cmd output started 2023 Thu Jul 06 05:24:58 PM PDT
Traceback (most recent call last):
File "/home/dstromberg/src/experiments/pathlib-touch/./t", line 5, in <module>
File "/usr/lib/python3.9/pathlib.py", line 1071, in __new__
self = cls._from_parts(args, init=False)
File "/usr/lib/python3.9/pathlib.py", line 696, in _from_parts
drv, root, parts = self._parse_args(args)
File "/usr/lib/python3.9/pathlib.py", line 685, in _parse_args
TypeError: argument should be a str object or an os.PathLike object returning str, not <class 'bytes'>
above cmd output done 2023 Thu Jul 06 05:24:58 PM PDT
Python supports special surrogate values that can be used when something actually isn’t valid text. They will be returned when you list the directory’s contents, and can be used to open the file. If I create a file whose name is b"81-\x81" (four bytes, three are ASCII and one is 0x81) and check os.listdir(), I see that file as "81-\udc81". You can pass that to pathlib.Path() and it will accept it and successfully open the file.
Your backup program shouldn’t even need to care about this distinction, unless it needs to filter based on such files (in which case their special encoding will have to factor into your filtering rules).
That’s fair! xargs itself shouldn’t have a problem (since subprocess invocation accepts bytestrings for args), but perhaps you’re writing an archive extractor or something. The file names come from deep inside a binary file, and you want to faithfully recreate them. That CAN be done, but now you need to be explicit that you really do want to accept potentially-broken file names:
>>> fn = b"some-file-\x81-oopsie"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 10: invalid start byte
>>> fn.decode("utf-8", "surrogateescape")
This is a much less common use-case than the situations where you want real text in file names (file names are for humans, and humans want text that they can understand), so the way to do it does require a clear and intentional declaration that you want this surrogate handling.
Yep, I should have mentioned those. Went with the more basic decode() method to show what’s happening, and forgot to mention that you need to be using the correct parameters, which come from os.fs*. Thanks for filling that part in.
In itself, that shouldn’t be a problem ("mañana.pdf" is a perfectly reasonable file name), but it’s possible that it was errantly encoded Latin-1 or Windows-1252, resulting in the N-with-tilde being encoded as a single byte instead of the two-byte sequence C3 B1 which would be valid UTF-8.
Indeed, which is why the surrogateescape system exists.
Took me a while to figure out how $(echo | tr '\012' '\361') works. So, byte 0xF1 is expected to represent ñ? That matches Latin-1 (ISO-8859-1), at least… maybe the file is left over from a legacy system?