Pathlib: preserve trailing slash

The work-in-progress PR adds with_trailing_sep() and without_trailing_sep() methods. I wouldn’t characterise this as “normalizing” though, as it can change the meaning of the path.

1 Like

I guess if we have to… we have to.

Please consider, in an effort to retain some semblance of discoverability, adding the without_trailing_sep method to the table at the bottom as the analog to os.path.normpath.

2 Likes

We don’t have to. I’m hoping to pursuade folks it’s the right thing to do! I don’t think I’m doing very well so far :slight_smile:

Are you anticipating this would break your code? I’m aiming to keep stuff like path.name and path.parent working as before. In the PR I haven’t needed to adjust many test expectations, which made me a little hopeful that this would be sufficiently backwards-compatible!

Yes, specifically I worry about joining paths based on user input. Right now I can just act upon whatever they put without having to call any other methods but in this new way I have to normalize with your new proposed method and only do so on versions of Python that have that method.

2 Likes

Why would you need to do this - what would break if you didn’t?

Notably, converting to strings for use by other tools that don’t have robust path manipulation abilities and rather just add stuff at the end. By other tools I also mean potentially internal utilities of large code bases that just act upon strings.

I don’t have a particular example at the ready.

1 Like

Well, if that becomes a design goal for pathlib, then it means you’re turning pathlib into something else. From “a library to handle filesystem paths” to “an object-oriented replacement for os and shutil”.

This complicates the conceptual model quite a bit for very little added value.

I’ll stress that the trailing slash thing is only a (mis)feature of POSIX APIs. It doesn’t have anything to do with filesystem paths per se. /etc and /etc/ are not different objects.

8 Likes

Agreed. As a Windows user, I find the “add a slash to change the behaviour” behaviour unfamiliar and confusing. I can see that it’s convenient if you know it, but I’d rather something more explicit (to be in the spirit of Python). One of the major advantages of pathlib for me has always been that it’s careful to be cross-platform. Adding Unix-specific “conveniences” like this goes against that benefit (as well as not actually being convenient for me!)

To be fair, I don’t think I’ve ever written code where this would make an actual difference, but having to keep this edge case in mind would be an inconvenience when reasoning about code (particularly code that others wrote, when reviewing PRs, for example).

4 Likes

It’s also on Windows, Pathlib: preserve trailing slash - #16 by barneygale :slight_smile:

I agree though that practicality should beat purity here.

1 Like

Understood. And I’ll be honest, I would find it odd if doing a stat on the string 'c:/windows/notepad.exe/' were to return the stat data for the file notepad.exe. But that’s at least in part because I already know that notepad.exe is a file.

Pathlib is a higher level interface, and it abstracts paths. I don’t think pathlib should consider the following to be different paths:

C:/Windows
C:/Windows/
C:\Windows
C:\Windows\

Fundamentally, that’s the most important aspect for me.

It’s worth noting that the stat example is about concrete paths, not pure paths. I have a little more sympathy for the arguments when they are framed in terms of concrete paths, but I don’t think behaviour should differ depending on whether a path is concrete or pure.

And it’s worth remembering that converting a Path to a string is already lossy:

>>> Path("C:/Windows") == Path("C:\\Windows")
True
>>> str(Path("C:/Windows")), str(Path("C:\\Windows"))
('C:\\Windows', 'C:\\Windows')

So preserving a trailing slash can’t really be justified in terms of round tripping (which I think was mentioned somewhere in this thread).

Of course they can. They check for a trailing slash, and handle it before invoking the pathlib method. What they can’t do is simply drop in pathlib and expect it to work. So yes, it increases the cost of adopting pathlib, but it certainly doesn’t make it impossible.

And in terms of pathlib subclasses, does it even make sense to look at trailing slashes? If I recall, VMS paths used [foo] for a directory named foo, but a plain foo for a file named foo. I’m not suggesting anyone wants to support VMS paths, just that it’s not self-evident that “trailing slash equals directory” is universal. What if someone wants to support a filesystem where a filename of an empty string is valid? Would foo/ be the directory foo, or the file “” in that directory?

Maybe, if we want a pathlib abstraction for “pure path that’s a directory” then we could add that. But trying to signal it via an empty final path component seems like it’s a bad workaround for something that should be addressed at the design level.

2 Likes

Just on this point: normalizing path separators doesn’t change the meaning of the path as far as I know. Both pathlib and ntpath treat back- and forward-slashes as equivalent. Pathlib also removes extraneous slashes and '.' components as they’re not meaningful to path resolution. By contrast, pathlib retains '..' components and distinguishes POSIX paths starting with two slashes (like '//etc'), because these features are meaningful to path resolution. Two-slash POSIX roots are much more obscure than trailing slashes IMO.

1 Like

It makes them visibly different which matters for UI. Older versions of cmd.exe had problems with paths containing slashes, users unfamiliar with POSIX filename conventions can find slashes confusing, etc. I’m not trying to say that pathlib should preserve separators, simply that round-tripping a string to a Path and back shouldn’t be a design goal (maybe no-one ever suggested that it was - I thought I’d seen that argument made, but I can’t find it now).

Going back to the original post:

My view is that “this path is required to be a directory” should be an attribute of a Path. This constraint is orthogonal to the list of path components, and therefore should be kept separate in the implementation.

Whether a trailing slash means to set this attribute is then a choice that the implementation of the “construct a path from a string” method makes. We can debate whether it should be or not, but at least then, we are just debating a question of parsing, not mixing semantics into the equation.

There’s a separate question, which is what methods should take account of the “required to be a directory” attribute. That may or may not be controversial - I’m honestly not sure. But I think the main controversy is over the parsing question, not the semantics (UI is always the hard bit!)

Whether you then have a constructor argument trailing_slash_is_directory, what the default should be, whether we change the default after a transition period, or whether the behaviour depends on the OS-specific path implementation, is all debatable. And it’s possible the true answer is “it depends”, so we would need to apply the “in case of doubt, refuse the temptation to guess” principle to the constructor. I honestly don’t know. UX is hard.

And whether this is all too much work to be justified, I can’t say. But implementing a suboptimal solution just because it’s easier, definitely isn’t the right thing to do (IMO).

2 Likes

I don’t think this is a good idea per PEP 20 – (Zen of Python).

pathlib already has a function, is_dir(), to determine if something is a directory. If I’m reading code, it’s really clear a directory is expected. A slash at the end would be easy to overlook and per the current Pathlib documentation would be expected to be stripped off.

What would the following return?

Path('/tmp',keep_trailing_sep=True) == Path('/tmp',keep_trailing_sep=False)
Path('/tmp',keep_trailing_sep=True) == Path('/tmp/',keep_trailing_sep=False)
Path('/tmp',keep_trailing_sep=True) == Path('/tmp/',keep_trailing_sep=True)

given /etc/hosts is a file, does

Path('/etc/hosts/',keep_trailing_sep=True) 

raise an Exception? If /etc/hosts doesn’t exist as file, but is created outside of Python while a script is running, what would the behavior be?

Incidentally, on my system (Linux Python 3.8.10), the current implementation of replace
recognizes the trailing slash in a string

    tx = Path('/tmp/x')
    w = Path('/tmp/w')
    w.unlink(missing_ok=True)
    tx.unlink(missing_ok=True)
    tx.touch()
    tx.replace(Path('/tmp/w/')) # makes file /tmp/w
    w.unlink(missing_ok=True)
    tx.touch()
    tx.replace('/tmp/w/')  # raises Exception

specifically

  File ".../ptest.py", line 27, in <module>
    tx.replace('/tmp/w/')
  File "/usr/lib/python3.8/pathlib.py", line 1374, in replace
    self._accessor.replace(self, target)
NotADirectoryError: [Errno 20] Not a directory: '/tmp/x' -> '/tmp/w/'

not sure if is as designed or not?

1 Like

Also, what’s the behaviour for a PurePath, which has no link to an actual filesystem…?

On 19/12/23 3:35 am, Paul Moore via Discussions on Python.org wrote:> My
view is that “this path is required to be a directory” should be an >
attribute of a |Path|.

The meaning of a trailing slash is not always “required to be a
directory”. For example, in ls -l it determines whether to follow
symbolic links, regardless of whether it’s a directory.

1 Like

There are also options to ls say how symlinks are handled, -H and -L for example that are less obscure.

If Path("asdf") == Path("asdf/") no longer holds true, it will break code I’ve written (without an obvious cross-version fix).

5 Likes

Could you expand a bit on how your code would break please? I’m sorry for being so slow on the uptake here!

Thanks for your responses everyone, by the way; I’ve found them quite pursuasive. I won’t push this any further, and I’ll re-resolve GH-65238 shortly I think.

I’m separately working on adding some ABCs for rich path objects (GH-110109), and so I could attempt to implement some sort of PedanticPurePath atop PurePathBase - it would be a good test that pure functionality is readily customizable. This would be an external package - not something for Python itself.

I should also add more test cases for trailing slash handling in pathlib. My PR (soon to be discarded) causes surprisingly few tests to fail - it’s only really equivalence tests and a single match() test case.

3 Likes

The code I could remember takes some input (that may or may not have trailing slash) and checks whether it’s in a known set of Paths.

Thank you for exploring this, though! I can also remember seeing a bug resulting from not preserving trailing slashes. I don’t know that I have a strong opinion if this was day 1 of pathlib, but I do think that the compatibility story here is a little tricky / could cause subtle issues / makes me nervous.

3 Likes