Why does pathlib refuse byte strings?

dstromberg · July 7, 2023, 12:30am

Hi people.

I understand that pathlib insisting on str’s might be advantageous for Windows or MacOS, but what about Linux?

Linux filenames have no encoding.

For application programming, on Linux, using str’s that are UTF-8 might make some sense. But for systems programming, it seems inappropriate. EG: What if you write a backup program, and some file you need to back up doesn’t have a valid UTF-8 filename?

EG:

$ cat t
below cmd output started 2023 Thu Jul 06 05:24:55 PM PDT
#!/usr/bin/env python3

import pathlib

pathlib.Path(b"/tmp/abc").touch()
above cmd output done    2023 Thu Jul 06 05:24:55 PM PDT
dstromberg@tp-mini-c:~/src/experiments/pathlib-touch x86_64-pc-linux-gnu 2939

$ ./t
below cmd output started 2023 Thu Jul 06 05:24:58 PM PDT
Traceback (most recent call last):
  File "/home/dstromberg/src/experiments/pathlib-touch/./t", line 5, in <module>
    pathlib.Path(b"/tmp/abc").touch()
  File "/usr/lib/python3.9/pathlib.py", line 1071, in __new__
    self = cls._from_parts(args, init=False)
  File "/usr/lib/python3.9/pathlib.py", line 696, in _from_parts
    drv, root, parts = self._parse_args(args)
  File "/usr/lib/python3.9/pathlib.py", line 685, in _parse_args
    raise TypeError(
TypeError: argument should be a str object or an os.PathLike object returning str, not <class 'bytes'>
above cmd output done    2023 Thu Jul 06 05:24:58 PM PDT

Thanks.

Rosuav · July 7, 2023, 12:43am

Python supports special surrogate values that can be used when something actually isn’t valid text. They will be returned when you list the directory’s contents, and can be used to open the file. If I create a file whose name is b"81-\x81" (four bytes, three are ASCII and one is 0x81) and check os.listdir(), I see that file as "81-\udc81". You can pass that to pathlib.Path() and it will accept it and successfully open the file.

Your backup program shouldn’t even need to care about this distinction, unless it needs to filter based on such files (in which case their special encoding will have to factor into your filtering rules).

dstromberg · July 7, 2023, 12:58am

You give a somewhat convincing argument.

But what if you want to write xargs in Python?

Rosuav · July 7, 2023, 1:04am

That’s fair! xargs itself shouldn’t have a problem (since subprocess invocation accepts bytestrings for args), but perhaps you’re writing an archive extractor or something. The file names come from deep inside a binary file, and you want to faithfully recreate them. That CAN be done, but now you need to be explicit that you really do want to accept potentially-broken file names:

>>> fn = b"some-file-\x81-oopsie"
>>> fn.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 10: invalid start byte
>>> fn.decode("utf-8", "surrogateescape")
'some-file-\udc81-oopsie'

This is a much less common use-case than the situations where you want real text in file names (file names are for humans, and humans want text that they can understand), so the way to do it does require a clear and intentional declaration that you want this surrogate handling.

eryksun · July 7, 2023, 1:29am

To add to this, you can use os.fsencode() and os.fsdecode() to manually convert between bytes and str.

>>> sys.getfilesystemencoding()
'utf-8'
>>> sys.getfilesystemencodeerrors()
'surrogateescape'
>>> s = os.fsdecode(b'81-\x81')
>>> s
'81-\udc81'
>>> os.fsencode(s)
b'81-\x81'

The above example fails on Windows, on which bytes paths are UTF-8, except that lone surrogate codes are allowed. The error handler is thus “surrogatepass” instead of “surrogateescape”.

Rosuav · July 7, 2023, 1:50am

Yep, I should have mentioned those. Went with the more basic decode() method to show what’s happening, and forgot to mention that you need to be using the correct parameters, which come from os.fs*. Thanks for filling that part in.

kknechtel · July 7, 2023, 2:29am

Is it normal that such files exist? I was able to create one from Python, but I couldn’t subsequently find a way to interact with it meaningfully from a terminal window.

dstromberg · July 7, 2023, 2:31am

I’ve seen an old JPilot .pdb that had such a “character” in it. Not inside the .pdb, but the filename of the .pdb. I think it was man~ana.pdb - where the n had a tilde above it.

They’re not common, but I think people would be justified in avoiding tools that couldn’t handle them.

Rosuav · July 7, 2023, 2:42am

In itself, that shouldn’t be a problem ("mañana.pdf" is a perfectly reasonable file name), but it’s possible that it was errantly encoded Latin-1 or Windows-1252, resulting in the N-with-tilde being encoded as a single byte instead of the two-byte sequence C3 B1 which would be valid UTF-8.

Indeed, which is why the surrogateescape system exists.

dstromberg · July 7, 2023, 3:28am

The example to which I was referring is an 8 bit character set. I’ve been using this to create the filename for testing:
echo foo > to-be-saved/“Ma$(echo | tr ‘\012’ ‘\361’)ana”

So yeah, high bit set.

barry-scott · July 7, 2023, 5:59am

You can just use the os and os.path methods with bytes and ignore the issue of encodings.
You will need to be careful with logs you create to use unicode not bytes.

As you say this is a systems programming problem that should work with any filename linux allows.

kknechtel · July 7, 2023, 5:58pm

Took me a while to figure out how $(echo | tr '\012' '\361') works. So, byte 0xF1 is expected to represent ñ? That matches Latin-1 (ISO-8859-1), at least… maybe the file is left over from a legacy system?

dstromberg · July 9, 2023, 5:54pm

Yes, it’s from JPilot, which is old Linux software that synchronized and edited Palm Pilot .pdb’s.

But there’s little more than convention to stop a Linux developer from using Latin-1 today.

Topic		Replies	Views
Alliow `bytes(mystring)` without specifying the encoding Ideas	6	2373	September 25, 2022
Str(mybytes): wrong docs? Python Help	3	664	December 14, 2019
File:// URIs in Python Ideas	6	5276	May 8, 2022
Python os is not working Python Help documentation , help	1	3154	April 13, 2021
Neub trying to debug Python Help	2	320	August 16, 2022

Why does pathlib refuse byte strings?

Related Topics