Making paths absolute in os.path and pathlib

On a POSIX system with symlinks, if os.getcwd() returns "/home/me", what is the absolute version of the relative path "../you/.."?

If you answered "/home", you’re in agreement with os.path.abspath(). But if "/home/you" is a symlink to somewhere else in the filesystem, then your result is wrong, as the second ".." component is relative to the symlink target. This behaviour is called “plain wrong” in PEP 428[1].

If you answered "/home/me/../you/..", you’re in agreement with pathlib.Path.absolute(). But you could have missed that the current working directory is always a real path, with no symlinks in any path component[2].

What’s the right answer? I think it’s "/home/you/..". Any leading ".." components in the relative path can be safely elided (unlike pathlib), and any other ".." components should be retained (unlike os.path).

Should we teach os.path.abspath() and/or pathlib.Path.absolute() to produce this answer? And is the picture any different on Windows? I have no specific proposal, but figured it might be worth discussing. Thanks!


  1. ↩︎

  2. https://pubs.opengroup.org/onlinepubs/9699919799/functions/getcwd.html ↩︎

4 Likes

Whatever you do, don’t add stat() calls where there were none before.

4 Likes

IMO that’d be a good default; though I think it’d be even nicer to add follow_symlinks= (like you did for glob) to opt into resolving any eventual symlinks – which would then also enable resolving all trailing .. instances.

That way the default is reasonable (and stays performant), but if a user is willing to pay the price of a couple of “is this component of the path a symlink?” stat calls, why not give them a canonical way to do that?

Wouldn’t that be the same as os.path.realpath() / pathlib.Path.resolve()?

Sure. Though I think abspath() and absolute() have the much better (self-explanatory) name. Purely for abspath itself, IMO the least surprising result would be a path without symlinks (kinda what’s happening now with your example yielding /home, even though potential symlinks are ignored). The only real argument I see against that is performance resp. the cost of the OS stat calls.

If we prioritize performance by default, then abspath(..., follow_symlinks=False) would make sense. If we were on a green field, I’d say “least surprising result by default, performance optimization as an opt-in” (i.e. default to follow_symlinks=True).

Back to your question, the proximity to realpath is a given (e.g. if the response to your OP had been “follow the symlink”, abspath would effectively become realpath) – so if realpath turns out to be a special case of abspath, we could make it a wrapper around that?

1 Like

What is the practical advantage of doing this change?

I’m afraid it could lead developers to make some wrong assumptions. If I wanted to quickly check whether .absolute() collapses .. components, I’d probably type Path("..").absolute() in a REPL. With this change, it would not give me the full story.

1 Like

There is a difference in outcomes if you take the symlink into account or not.

$ cd $TMPDIR
$ mkdir -p dir1 dir3/sub1
$ ln -sf dir3/sub1 dir2
$ python3 getcwd.py
os.chdir(tmpdir) - cwd= /home/barry/tmpdir
os.chdor("dir1") - cwd= /home/barry/tmpdir/dir1
os.path.abspath("../dir2/..") - cwd= /home/barry/tmpdir
os.chdir("../dir2/..") - cwd= /home/barry/tmpdir/dir3

Where getcwd.py is

import os
from pathlib import Path

tmpdir = os.environ["TMPDIR"]

os.chdir(tmpdir)
print('os.chdir(tmpdir) - cwd=', os.getcwd())

os.chdir("dir1");
print('os.chdor("dir1") - cwd=', os.getcwd())

print('os.path.abspath("../dir2/..") - cwd=', os.path.abspath("../dir2/.."))

os.chdir("../dir2/..");
print('os.chdir("../dir2/..") - cwd=', os.getcwd())

I wrote the same chdir/getcwd in C++ and the code does the same as the python os.chdir/os.getcwd does.

However this is what happens in bash (and zsh on macOS):

$ cd $TMPDIR
$ cd dir1
$ pwd
/home/barry/tmpdir/dir1
$ cd ../dir2/..
$ pwd
/home/barry/tmpdir

All code run Fedora 39 and macOS 14.2.1.

I think the biggest problem here is the presumption that there even is a “right answer”. What should be the correct thing to do is highly context-dependent.

“Wrong” is very absolute. What if your code is being run in a context that doesn’t have access to the filesystem? Just because PEP 428 expresses an opinion, doesn’t make it correct.

Personally, I think the default behaviour should be what works without access to the filesystem. Checking the actual filesystem should be an opt-in choice, especially given that some filesystems (e.g., NFS) can be extremely slow, and may not even support symlinks anyway, so that the check is not only slow but also pointless.

This is a plausible approach, but is it worth the compatibility cost? It will potentially still be the wrong answer for some use cases, so you’d have to find evidence that it improves things more often than it makes things worse. Also, as @jeanas mentions, it makes the answer to the question “does abspath remove .. components?” much more complex and less useful.

I agree with @guido - at an absolute minimum, there’s no justification here for adding stat calls that aren’t already being made.

Not just Windows. WIth the new extensibility of pathlib, we need to allow for the possibility that an arbitrary pathlib implementation might have different rules. For example, can a provider support unresolved values like /home/me/.. for getcwd(), or a value containing symlinks? If so, your statement “But you could have missed that the current working directory is always a real path, with no symlinks in any path component” is demonstrably false, and so your proposed behaviour is wrong.

4 Likes

I think there are three important functions here that users ought to know how to use together (pathlib might be a special case, will come to that later), and the one that needs the most clarification is normpath.

(For context, I’ve been thinking about this issue a lot recently as we work through a number of related bugs, and I presume Barney is in the same place.)

The only breakdown of responsibilities that seems to make overall sense (disregarding back-compat) is:

  • abspath knows how to retrieve the current working directory
  • realpath knows how to resolve each segment to find the actual final path (with filesystem access)
  • normpath knows how to collapse segments to produce a probable path that is easy to read

Right now, abspath does an implicit normpath, which is where the problem actually arises, since a norm-ed path isn’t necessarily the real path.

If we didn’t have that, abspath("../file") might return C:\Users\me\../file instead of C:\Users\file. Meanwhile, join(os.getcwd(), "../file") returns the former, and normpath(_) returns the latter.

So I think the fundamental question is whether abspath is about path calculation or path display. normpath is clearly about displaying paths, and so implicitly is abspath (right now), but we could change that by simply removing the normalisation.

When we bring in compatibility, however, it is probably less impactful to leave it as it is and clearly document that abspath may produce incorrect results for the sake of readability, and join(os.getcwd(), ...) is recommended for correctness in the presence of symlinks or other name aliasing.


pathlib is a bit more interesting. You could argue that Path.__str__[1] implies “for display” and so it should be normalised, while Path.__fspath__ implies “for use” and so it shouldn’t. Compatibility-wise, I’m pretty sure Path already collapses segments on creation though, and so things could get quite messy if we change that (e.g. iteration over Path.parents will return the same directory multiple times). I’m not sure how best to handle it.


My final thought is that I’ve never seen anyone do this on purpose, and I suspect that any user asking for join("root/A", "../B") actually wants to remove “A” - the equivalent of join(dirname("root/A"), "B") - rather than to navigate to “A” and then go up one level. I’d love to understand better whether this is true, but path manipulation generally seems to be understood in terms of modifying the path and only once that’s done do we try to find what it refers to.

So are our path functions about manipulating path strings or are they about navigating the filesystem? And more importantly, what do our users currently think they do?


  1. I’ll use Path as my example class, but strictly I think it’s on PurePath, and realistically it just applies to all of them. ↩︎

8 Likes

This is very much my intuition as well. I’m pretty sure I’ve seen (maybe even written) code that uses .. to mean “remove a level”. I did a quick github search and Path(__file__).parent / "../<other stuff>" seems relatively common. While this may be OK because __file__ is fully resolved[1], the intent is clearly that .. removes a level (I read this as someone doing “parent” to get the containing directory for the file, then navigating from there using “…” for “up”).


  1. I’m not 100% sure if it is or not, TBH ↩︎

Note that pathlib as designed in PEP 428 didn’t expose an absolute operation on paths, because of the problems you explain. Unfortunately, that decision didn’t turn out very popular :slight_smile:

(also unfortunately, Bitbucket shut down its Mercurial repositories, so I can’t point to any point in the history of the standalone pathlib package where this was introduced, nor to the original discussion)

I don’t understand why leading .. components could be safely elided, but I’m probably missing something.

It also seems to me that any person that wants .. components to be elided doesn’t want such a hybrid solution.

By now, I don’t think we can change of behaviour of either function/method. We could however add a separate pathlib.Path.lexical_absolute method (perhaps under a shorter name :slight_smile: ) that collapses all .. components.

@pitrou

The key is

The key is that when concatenating cwd + path, collapsing leading .. components in path is correct because cwd doesn’t contain symlinks.

@barry-scott

Thanks, but I have understood what the proposed change is and what difference it makes. What I’d like to understand is what the practical reasons for making this change would be. Does anyone have real-world code that would be made simpler by this, or where it would fix some subtle bug or such?

PS I’m mostly talking about pathlib here. I can understand that changing os.abspath would be nice because os.abspath doesn’t always yield an equivalent path which is indeed a subtle bug magnet, but I’m not sure this is wise for compatibility.

I missed this as well, and it isn’t true on Windows. Provided the directory exists when you set it as current, getcwd will return it as it was specified.

2 Likes

Uh? This discussion isn’t specifically about cwd, is it?

It is, because what os.path.abspath() and pathlib.Path.absolute() prepend is the cwd. Currently, os.path.abspath() collapses all .. components and pathlib.Path.absolute() doesn’t collapse any. @barneygale suggested making both of them collapse leading .. components but preserve later components.

I guess this is pretty a strong argument against the proposal, right?

2 Likes

Sorry, was not trying to teach egg sucking… I was wondering what the answer to your question was and explored it in code as I was not sure of the behavior.

If there is any code where the lexical handling of .. vs OS handling of .. matters then I could imagine a CVE in that codes future. But I do not have an example.

Having said that a bash or zsh script will get different results to a C program today.

Offering both versions of absolute would give a place in the docs to warn of the reason for two functions and why the difference may matter to an app.

1 Like

I would be unhappy if any of these path manipulations would produce a path that leads to a different inode than the filesystem’s own interpretation. E.g. since foo/bar/../baz may be different from foo/baz (if foo/bar is a symlink to another directory), the baz/.. part shouldn’t be elided by a function that does merely syntactic manipulation, no matter what we think the user wanted.

In practice, at least part of the path was likely passed in to the program by some user input mechanism, e.g. sys.argv, so we can’t make strong assumptions about the intention of the author of the code; and the user passing the path in to the code likely expected to get whatever the filesystem would do.

3 Likes

I could also imagine this, and it’s not hard to contrive an example, but I suspect we’d say it belongs to the Python code and not the CPython implementation.

Perhaps we need the middle-ground normpath_but_not_double_dot() that can do everything else normpath does (replace altsep, collapse multiple adjacent seps and single dot) and switch abspath (and maybe join) to use it.[1] Then both os.path.abspath and pathlib.Path.absolute can have a behaviour change, and we don’t have to choose a favourite :wink:


  1. I’m pretty sure I reviewed an issue recently where that would’ve solved a problem, though that problem was related to using altsep and then looking for sep. ↩︎

1 Like

You really can’t. str() and bytes() of a pathlib type are intended for use by APIs that don’t accept pathlib types. We even document it as such, though they would’ve been used for this even if we hadn’t.

4 Likes