Pathlib: preserve trailing slash

Maybe not, but I have frequently used the fact that mv somefile dir/ will error out if dir does not exist or is not a directory, rather than renaming the file to dir, overwriting any preexisting file of that name. So in that particular context, it’s of value to be able to say “this has to be a directory”. Does that translate into a need for pathlib to retain the trailing slash? Not sure - it could be treated as a feature of the command line parsing instead - but it’s definitely something that is of significant safety value.

2 Likes

Then I would suggest something more explicit than a mere trailing slash, which is easy to omit or overlook. For example we could add a method Path.move_into(target_dir) or a function shutil.move_into(src, target_dir) that would error out if target_dir is not a directory.

4 Likes

Yeah, which is why I said it could be a feature of command parsing (“if the last character of the destination is a slash, record that we want a target directory”). But that does mean the path would have to be preprocessed as a string before converting into a Path.

I’m sorry, but I don’t understand what “command parsing” means here.

The mv command receives arguments: ["somefile", "dir/"] It needs to figure out what to do with them. At the moment when I typed that on the command line, I was expressing my wish for it to move that file into that directory - not to rename somefile to dir, and definitely not to replace a file named dir - which it will respect (if dir doesn’t exist, or exists and isn’t a directory, mv will error out). However, if the incoming arguments had been ["somefile", "dir"], the behaviour would have been identical in the case where the target is an existing directory, but would have been different if the target didn’t exist or was a file.

A naive way to implement something like this would be to immediately convert all arguments into Paths, and then proceed to move files. But that loses the information about the trailing slash. Thus, one of two things must happen: Either the Path needs to retain the trailing slash, or the mv command has to first check its arguments to see if there was a trailing slash, and only THEN convert everything into Paths.

The “command parsing” part is the bit before it starts working in Path objects, where it would have to explicitly check for the presence of a trailing slash, even though the argument is a perfectly valid Path either way.

I’m not interested in the mv command. pathlib is deliberately not modeled on the Unix shell and does not aim to expose a shell-like experience.

What I am interested in discussing is how to make pathlib better for use as a Python library for manipulating, and acting on, filesystem paths.

2 Likes

If you were writing a Python script that manipulates files, would you use Pathlib? Would you accept command line arguments? If so, this situation does come up, just the same. I used mv as an example since everyone who’s used any Unix shell will be familiar with it, but the same behaviour is followed in a lot of other contexts too.

1 Like

If you want to have a convention that tailing slash mean the your code will check that the path is a dir that is fine. But it is not pathlib’s job to do that for your code.

5 Likes

That’s the exact part I was uncertain of, since this IS a vey common convention (trailing slash asserts that it’s a directory). But when I originally posted it, I did say that this could equally well be a matter for command parsing, not Pathlib.

I’m regretting saying anything now. Just caused nothing but confusion. Sigh.

1 Like

Trailing slashes are significant to os.path.join(). I don’t know what the path lib equivalent would be.

Trailing slashes are significant to os.path.join().

Whoops, they don’t. I was remembering relative path resolution in HTML (or maybe it’s technically a Web browser feature).

1 Like

These are the use cases I have to hand:

  1. Path.glob() return values. Trailing slashes are already meaningful in the pattern argument of glob(), forcing only directories to be returned. For compatibility with glob.glob(), the returned paths should also have trailing slashes. See also GH-106747.
  2. PurePath.match() pattern language. Unlike in glob(), trailing slashes are not meaningful to match(), making its pattern-matching language subtly different to glob().
  3. Add Path.chown(). I’d like to add a Path.chown() method, moving the implementation from shutil into pathlib. The old shutil.chown() function would call through to pathlib. However, pathlib stripping the trailing slash would change the behaviour of shutil.chown('foo/', ...). See GH-64978
  4. Add Path.move(). Just like the above. We can’t move any implementations from shutil to pathlib because we’d be introducing a bug to the longstanding shutil functions. See GH-73991
  5. Optimize PurePath.__fspath__(). If pathlib’s normalization routine retains trailing slashes, then returning an unnormalized path from __fspath__() is equivalent to returning a normalized path. This makes most Path methods much cheaper. See GH-102783

The more general problem is that existing users of string paths can’t adopt pathlib without introducing this bug. Most of the time that doesn’t really matter, but in some cases it does, e.g. we can’t use pathlib from shutil. I’d like folks using string paths to be able to adopt pathlib without having to worry about their paths being mangled!

3 Likes

Oh, I see, that is a new thing that was introduced in version 3.11. That strikes me as unexpected, and it violates pathlib’s design as I intended it.

I also see little actual discussion in that PR, which is unsettling (even though I did express doubts back in 2014 in the issue discussion). Changing the behavior of an established API should IMHO be done with a bit more care and understanding of why the API had the original behavior in the first place.

Let me correct this for you: Path.glob and Path.match used to be consistent with each other. Then someone decided in version 3.11 that Path.glob would be changed to give a special meaning to a trailing slash in the pattern argument, which - in addition to being a questionable decision - had the additional side effect of making Path.glob inconsistent with Path.match. :wink:

The behavior you’re talking about is not even documented in shutil — High-level file operations — Python 3.12.1 documentation or os — Miscellaneous operating system interfaces — Python 3.12.1 documentation . How is the user supposed to know about it?

Understanding of pathlib APIs should preferably not require arcane knowledge about system APIs. Yes, this can deviate from shutil and os module semantics, which I consider to be a good thing in this instance, because good APIs should not be arcane.

See what I suggested in Pathlib: preserve trailing slash - #22 by pitrou

(a similar suggestion holds for chown if that is ever considered for inclusion in pathlib, though I don’t think chown is common enough to warrant that)

I don’t understand what this means. Which “bug” is being introduced exactly? If someone is expecting undocumented behavior from non-pathlib libraries to be mirrored by pathlib, then it does not sound like a bug to me.

Pathlib does ostensibly not emulate what the shutil and os modules do (and, of course, especially not what they do without even saying). I think you know it fully well, given that you’ve worked on pathlib for years and understand that I deliberately made the semantics different in some places.

Bottom line: I understand that Unix shell and POSIX aficionados might be disappointed by the fact that a trailing slash doesn’t have the behavior that they are used to. I am part of the people who don’t have much appreciation for the beauty of POSIX C and shell APIs, and who think the arcane knowledge of old-timers is not a good point of reference when designing new APIs. That reflects in the way I designed pathlib.

7 Likes

Thanks very much Antoine

IMHO shutil is so widely-used and longstanding that Hyrum’s Law overrules other considerations. Even though it’s undocumented, we can’t change the behaviour of shutil.chown('foo/', ...) unless we’re fixing a clear bug. And this precludes us from using pathlib.

I’ve been trying to (carefully, gradually) bring pathlib and os.path behaviour together and de-duplicating the implementations wherever possible.

In Python 3.10 we added a strict parameter to os.path.realpath() and removed pathlib’s implementation. Path.resolve() now calls os.path.realpath(). We also made ntpath.expanduser() slightly stricter when guessing Windows home directories, and again removed pathlib’s near-duplicate implementation. In Python 3.12 we added os.path.splitroot() for parsing paths into (drive, root, tail) - the same division underlying pathlib’s path model.

I think this work has improved both pathlib and os.path, as we’ve been able to take the best of both worlds.

7 Likes

To summarise my view with an example: users should be able to safely refactor os.stat(some_string) to pathlib.Path(some_string).stat() without changing the result of the stat(). Only trailing slashes (and arguably empty strings) make that refactor strictly unsafe; in all other cases pathlib’s normalization is sufficiently conservative.

1 Like

I don’t have time for a longer response but please don’t do this. I share your point of view that having the underlying complex logic be used by all consumers is a great thing and also I think your work recently to improve pathlib has truly been awesome! At the same time I agree with @pitrou that pathlib should expose an easy API for users as that is its entire purpose, not compatibility.

The following short example is how I and most others manipulate paths: join is almost always the only os.path function that is used. As you can see in the final join operation, the preceding preservation of the trailing slash by os.path is quite unexpected as one might assume based on the previous operations that you would get a mangled path but rather there is some normalization going on underneath

On the other hand, pathlib makes this explicit at the object level which is much more user-friendly. By user-friendly I mean that they should be able to reason about what the API does rather than concerning themselves with interoperability (especially important for junior developers).

>>> import os
>>>
>>> p = os.getcwd()
>>> os.path.join(p, 'foo')
'C:\\Users\\ofek\\Desktop\\foo'
>>> os.path.join(p, 'foo\\')
'C:\\Users\\ofek\\Desktop\\foo\\'
>>> os.path.join(p, 'foo\\', 'bar')
'C:\\Users\\ofek\\Desktop\\foo\\bar'
>>>
>>> from pathlib import Path
>>>
>>> p = Path.cwd()
>>> p / 'foo'
WindowsPath('C:/Users/ofek/Desktop/foo')
>>> p / 'foo\\'
WindowsPath('C:/Users/ofek/Desktop/foo')
>>> p / 'foo\\' / 'bar'
WindowsPath('C:/Users/ofek/Desktop/foo/bar')
2 Likes

C:\Users\ofek\Desktop\foo is not equivalent to C:\Users\ofek\Desktop\foo\. You can check this by creating a text file called foo on your desktop and attempting to os.stat() both paths. This is the reality of the situation and I think pathlib should reflect reality, not a slightly-sanitised version.

1 Like

Yes I’m aware of that. I am talking specifically about the API and that a trailing slash should not be used as a proxy for file system metadata. The paths in my example in fact do not exist.

You’re of course free to argue that trailing slashes shouldn’t be used this way, but they are used this way by all the OS APIs that pathlib wraps, and there’s nothing any of us can do about it!

2 Likes

When trailing slashes are preserved, how would a user normalize to remove? I don’t see a method because right now all paths are normalized.