Why does pathlib refuse byte strings?

Hi people.

I understand that pathlib insisting on str’s might be advantageous for Windows or MacOS, but what about Linux?

Linux filenames have no encoding.

For application programming, on Linux, using str’s that are UTF-8 might make some sense. But for systems programming, it seems inappropriate. EG: What if you write a backup program, and some file you need to back up doesn’t have a valid UTF-8 filename?

EG:

$ cat t
below cmd output started 2023 Thu Jul 06 05:24:55 PM PDT
#!/usr/bin/env python3

import pathlib

pathlib.Path(b"/tmp/abc").touch()
above cmd output done    2023 Thu Jul 06 05:24:55 PM PDT
dstromberg@tp-mini-c:~/src/experiments/pathlib-touch x86_64-pc-linux-gnu 2939

$ ./t
below cmd output started 2023 Thu Jul 06 05:24:58 PM PDT
Traceback (most recent call last):
  File "/home/dstromberg/src/experiments/pathlib-touch/./t", line 5, in <module>
    pathlib.Path(b"/tmp/abc").touch()
  File "/usr/lib/python3.9/pathlib.py", line 1071, in __new__
    self = cls._from_parts(args, init=False)
  File "/usr/lib/python3.9/pathlib.py", line 696, in _from_parts
    drv, root, parts = self._parse_args(args)
  File "/usr/lib/python3.9/pathlib.py", line 685, in _parse_args
    raise TypeError(
TypeError: argument should be a str object or an os.PathLike object returning str, not <class 'bytes'>
above cmd output done    2023 Thu Jul 06 05:24:58 PM PDT

Thanks.

Python supports special surrogate values that can be used when something actually isn’t valid text. They will be returned when you list the directory’s contents, and can be used to open the file. If I create a file whose name is b"81-\x81" (four bytes, three are ASCII and one is 0x81) and check os.listdir(), I see that file as "81-\udc81". You can pass that to pathlib.Path() and it will accept it and successfully open the file.

Your backup program shouldn’t even need to care about this distinction, unless it needs to filter based on such files (in which case their special encoding will have to factor into your filtering rules).

2 Likes

You give a somewhat convincing argument.

But what if you want to write xargs in Python?

That’s fair! xargs itself shouldn’t have a problem (since subprocess invocation accepts bytestrings for args), but perhaps you’re writing an archive extractor or something. The file names come from deep inside a binary file, and you want to faithfully recreate them. That CAN be done, but now you need to be explicit that you really do want to accept potentially-broken file names:

>>> fn = b"some-file-\x81-oopsie"
>>> fn.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 10: invalid start byte
>>> fn.decode("utf-8", "surrogateescape")
'some-file-\udc81-oopsie'

This is a much less common use-case than the situations where you want real text in file names (file names are for humans, and humans want text that they can understand), so the way to do it does require a clear and intentional declaration that you want this surrogate handling.

To add to this, you can use os.fsencode() and os.fsdecode() to manually convert between bytes and str.

>>> sys.getfilesystemencoding()
'utf-8'
>>> sys.getfilesystemencodeerrors()
'surrogateescape'
>>> s = os.fsdecode(b'81-\x81')
>>> s
'81-\udc81'
>>> os.fsencode(s)
b'81-\x81'

The above example fails on Windows, on which bytes paths are UTF-8, except that lone surrogate codes are allowed. The error handler is thus “surrogatepass” instead of “surrogateescape”.

1 Like

Yep, I should have mentioned those. Went with the more basic decode() method to show what’s happening, and forgot to mention that you need to be using the correct parameters, which come from os.fs*. Thanks for filling that part in.

Is it normal that such files exist? I was able to create one from Python, but I couldn’t subsequently find a way to interact with it meaningfully from a terminal window.

I’ve seen an old JPilot .pdb that had such a “character” in it. Not inside the .pdb, but the filename of the .pdb. I think it was man~ana.pdb - where the n had a tilde above it.

They’re not common, but I think people would be justified in avoiding tools that couldn’t handle them.

In itself, that shouldn’t be a problem ("mañana.pdf" is a perfectly reasonable file name), but it’s possible that it was errantly encoded Latin-1 or Windows-1252, resulting in the N-with-tilde being encoded as a single byte instead of the two-byte sequence C3 B1 which would be valid UTF-8.

Indeed, which is why the surrogateescape system exists.

The example to which I was referring is an 8 bit character set. I’ve been using this to create the filename for testing:
echo foo > to-be-saved/“Ma$(echo | tr ‘\012’ ‘\361’)ana”

So yeah, high bit set.

You can just use the os and os.path methods with bytes and ignore the issue of encodings.
You will need to be careful with logs you create to use unicode not bytes.

As you say this is a systems programming problem that should work with any filename linux allows.

Took me a while to figure out how $(echo | tr '\012' '\361') works. So, byte 0xF1 is expected to represent ñ? That matches Latin-1 (ISO-8859-1), at least… maybe the file is left over from a legacy system?

Yes, it’s from JPilot, which is old Linux software that synchronized and edited Palm Pilot .pdb’s.

But there’s little more than convention to stop a Linux developer from using Latin-1 today.