Pathlib: preserve trailing slash

I find this issue intimidating:

It seems to me that we should solve it, as trailing slashes are meaningful to path resolution. Quoting man path_resolution on my local system:

If a pathname ends in a ‘/’, that forces resolution of the preceding component as in Step 2: it has to exist and resolve to a directory. Otherwise, a trailing ‘/’ is ignored. (Or, equivalently, a pathname with a trailing ‘/’ is equivalent to the pathname obtained by appending ‘.’ to it.)

And POSIX 4.12:

A pathname that contains at least one non-slash character and that ends with one or more trailing slash characters shall not be resolved successfully unless the last pathname component before the trailing slash characters names an existing directory or a directory entry that is to be created for a directory immediately after the pathname is resolved.

The fact that pathlib strips the trailing slash flies in the face of its otherwise conservative normalisation rules – for example, pathlib.Path.absolute() will not remove .. segments, whereas os.path.abspath() does remove these segments, which might change the meaning of a path.

I’m not sure how to fix the issue. My current thinking is to add a keep_trailing_sep argument to the PurePath and Path initialisers. Perhaps it could be None by default, which would raise a deprecation warning but otherwise act like False. In a future version of Python, we could then change the default to True.

Would that be worthwhile, or is it too disruptive? Any better ideas? Thanks

1 Like

As an alternative to the keep_trailing_sep idea, we could instead emit a warning when a path with a trailing slash is given to PurePath() or Path(), and after a couple of versions, remove the warning and begin preserving the trailing slash.

1 Like

To ask a naïve question, what would break if we switched immediately to preserving any trailing slashes?

A

Two things in the standard library:

An importlib.metadata.PathDistribution object may point to an egg-info file rather than a directory. It’s read_text() method accepts a filename, which may set to the empty string. The method calls joinpath() to join these:

If we preserve trailing slashes, this method adds a trailing slash to the egg-info file path (via joinpath('')), which makes it impossible to open as a file.

And zipfile.Path actually relies on pathlib to strip trailing slashes (which it otherwise uses to indicate directories):

The zipfile.Path bit could be resolved in future by making it a subclass of the upcoming pathlib.PathBase class.

I suspect a lot of code will break. This would be a change in semantics that is backwards compatible. It would have to be opt in.

For example what would name return if there is a trailing pathsep?

1 Like

What POSIX API does this rule apply to? Taken out of context I do not know what means.

1 Like

I think it should return the empty string, similarly to PosixPath('/').name or WindowsPath('C:/').name. This allows the following idiom to be used to remove any trailing slash:

if not path.name:
    path = path.parent
1 Like

I should have provided a link, and not misquoted the section number. It’s in the “general concepts” chapter:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_13

(Cross-posted to GitHub: pathlib strips trailing slash · Issue #65238 · python/cpython · GitHub)

I’m sorry Antoine chose to let pathlib deviate from the os.path module’s behavior for this. I recall thinking long and hard about edge cases like this and carefully implementing what I thought was best. I haven’t read that POSIX standard but I presume I was influenced by actual behavior of various UNIX utilities.

But I agree that changing this will break plenty of user code relying on the current behavior, and we can’t have that. I also don’t think that some kind of deprecation path makes sense here. The best we can do is have some way to indicate the preferred behavior when a path is created, and have that inherited by operations that return new paths.

I’m not sure what form the user preference should take – I’d say it shouldn’t be global, so it could take the form of either an alternative class, an alternative constructor, or a flag keyword argument to path-constructing operations.

Even so, there could be problems – suppose we have a library that accepts Path arguments and expects them to behave the old way, and a user constructs paths using the alternate constructor and passes those in. It might be quite a while before the user ends up passing a path that causes the library to crash or misbehave.

So maybe an alternative approach could be not to have the behavior be indicated by some property of the Path instance but by using different attributes. So maybe e.g. Path("foo/").name would return "foo" but Path("foo/").alt_name would return "". This would avoid the scenario I described just above.

Now we just have to decide on names for the attributes that could have this alternate behavior (are there others besides .parent and .name?). And probably the implementation will have to keep track of the trailing slash somehow.

2 Likes

tail looks like a good name candidate for .name. The term is already used by the os.path.split docs to mean exactly what is expected here. head could also be the alternative name to .parent with the new semantics.

I think there are too many affected methods for that to work. For example, most methods defined in Path are affected by this bug:

>>> import pathlib, os
>>> pathlib.Path('/etc/hosts/').stat()
os.stat_result(st_mode=33188, st_ino=82182074, st_dev=16777233, st_nlink=1, st_uid=0, st_gid=0, st_size=213, st_atime=1690969486, st_mtime=1690969486, st_ctime=1692390716)
>>> os.stat('/etc/hosts/')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotADirectoryError: [Errno 20] Not a directory: '/etc/hosts/'

There’s also __str__() and __fspath__().

Ah. But those errors are legit, and I don’t mind getting those where we didn’t before. The trailing slash is only valid for directories (or when we don’t know).

I think —str— and —fspath— are also fine to change.

1 Like

I like the current pathlib behavior. I use pathlib in preference to os.path because it allows me to think less about finicky details, such as whether a path I receive contains a trailing slash. I don’t want to be testing each function both with and without trailing slashes.

For example, I might write code like this:

import argparse
from pathlib import Path

p = argparse.ArgumentParser()
p.add_argument("path", type=Path)
args = p.parse_args()

print(f"Processing {args.path}/ ...")

If pathlib started to keep trailing slashes, python myscript.py foo/ would start printing Processing foo// .... I would have to check for trailing slashes in the input.

Another example: suppose I cache information on certain directories in a database, and I write a utility update-info that updates the cache. If I use raw paths, update-info foo-directory/ could store info about foo-directory/ but leave stale cache info for foo-directory. pathlib allows me to solve this problem very conveniently.

6 Likes

Thanks for the link. Having read that section in full I am still none the wiser about where that rule for trailing / applies.
Operations like open() or mkdir() do not need this rule for example.
I wonder what specific API function implements the rule?

Without knowing which API is affected it will not be possible to test.
I would like to write a test program that changes behaviour on the trailing / being present or absent on each OS to see it in action.

If this only changes command line utility behaviour the its not pathlib’s problem I would argue.

I see a lot of talk about POSIX compatibility, which is nice I guess, but I wonder if someone could shed light on what Windows has to say about this topic?

6 Likes

Experimentally, the trailing slash on Windows seems to have the same effect (i.e. forcing resolution as a directory):

>>> os.stat('c:/windows')
os.stat_result(st_mode=16895, st_ino=1407374883832841, st_dev=10709685299559133658, st_nlink=1, st_uid=0, st_gid=0, st_size=16384, st_atime=1701822404, st_mtime=1700700385, st_ctime=1575709424)
>>> os.stat('c:/windows/')
os.stat_result(st_mode=16895, st_ino=1407374883832841, st_dev=10709685299559133658, st_nlink=1, st_uid=0, st_gid=0, st_size=16384, st_atime=1701822404, st_mtime=1700700385, st_ctime=1575709424)
>>> os.stat('c:/windows/notepad.exe')
os.stat_result(st_mode=33279, st_ino=562949954686276, st_dev=10709685299559133658, st_nlink=3, st_uid=0, st_gid=0, st_size=201216, st_atime=1701822285, st_mtime=1700340994, st_ctime=1700340994)
>>> os.stat('c:/windows/notepad.exe/')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'c:/windows/notepad.exe/'
1 Like

I’ve been making a bit of progress towards addressing this problem. The key realisations:

  1. (Per Guido’s earlier comment) everything related to splitting dirnames and basenames needs to work as it did before, e.g.
    • Path('/foo/bar/').parent == Path('/foo')
    • Path('/foo/bar/').name == 'bar'
  2. Pathlib’s joining should deviate slightly from os.path.join() by ignoring empty segments, e.g. Path('foo', '') == Path('foo'), and not Path('foo/')

I believe these constraints ensure good backwards compatibility. I have a WIP patch here, if anyone’s interested:

2 Likes

I’m unsure how trailing slashes should affect the PurePath.parts property. Any thoughts? Present behaviour:

>>> PurePath('foo/bar/').parts
('foo', 'bar')
>>> PurePosixPath('/foo/bar/').parts
('/', 'foo', 'bar')
>>> PureWindowsPath('c:/foo/bar/').parts
('c:\\', 'foo', 'bar')

We could ignore any trailing slash (like now), but it would make parts lossy. As __reduce__() is implemented using parts, we’d need to change its implementation so that trailing slashes survive pickling.

Or we could add an extra dot part on the end, like ('foo', 'bar', '.'), or a separator to the final part, like ('foo', 'bar/'). But that might break user code. Perhaps we do this in a new full_parts attribute?

Grateful for any guidance, cheers.

@pitrou would love to hear your view on the above ^

To me, the one question is: which use case needs trailing slashes to be significant in pathlib?

FTR, I don’t count “rigorous compliance with the POSIX specification” a use case :slight_smile: . The fact that os.stat('/etc/hosts/') raises an error doesn’t seem like a useful feature for day-to-day programming.

3 Likes