pathlib.Path.joincomponent()

Hello,

pathlib.PurePath.joinpath() uses os.path.join() to join new path components to the target path and returns a new path. An unfortunate side effect of this is that it interprets the characters os.sep and os.altsep in a special way, e.g.:

`from pathlib import PurePath

a = PurePath(“/foo/bar”)
b = a.joinpath(“/baz/quux”) # → results in PurePath(“/baz/quux”)`

This requires users of PurePath to intentionally check for os.sep and os.altsep when joining paths that come from untrusted sources. This is awkward for a few reasons but it’s also an insecure default. I propose a new method tentatively called joincomponent() that does no special interpretation of characters within the path components being joined and interprets them purely as components along the path.

Thank You,
Herbal Mist

This has nothing to do with special casing os.sep or using os.path.join. This is an intentional design choice: Absolute paths reset the root of the path currently being build. If you don’t want that, you will need to manually special case absolute paths in the arguments list. This is not an extra security risk, if you want to make sure that a path is within a specific folder, you need to check that. After all, users could also provide .. in the given path.

2 Likes

This is an intentional design choice: Absolute paths reset the root of the path currently being build.

Yes, that’s why I’m suggesting the design of a new method with new semantics.

It’s not just the handling of absolute paths, it’s also the interpretation of os.sep as a path separator and adding multiple components in a single call. e.g.:

a = PurePath(“/foo/bar”)
a.joinpath(“baz/quux”) # this call adds two path components instead of one
a.parts # == (‘/’, ‘foo’, ‘bar’, ‘baz’, ‘quux’)

I’m suggesting that adding multiple components in a single call to a.joinpath() is more surprising than not doing that. It’s fine to preserve that behavior but it would be nice if there were a method that didn’t do that.

If you don’t want that, you will need to manually special case absolute paths in the arguments list.

Yes but that requires users of PurePath to explicitly do that, as I mentioned in my original post.

After all, users could also provide .. in the given path.

Yes but you can check for .. by using the parts property and disallow that specifically before interpreting the path. You cannot do that for components that contain os.sep or os.altsep. Right now there is an inconsistency between how .. is treated and how os.sep is treated.

Well… yes. That is what / means in paths. baz/quux is not a valid path component. a.parts will never be ('/', 'foo', 'bar', 'baz/quux'). That just isn’t valid. If you don’t want this behavior, IDK what you are trying to do, but pathlib is not the correct library for you (since you aren’t working with paths apparently)

2 Likes

edit: never mind, misread what you were saying in the second post

I understand that it’s not valid. Right now the caller of PurePath is forced to check if path components that are to be added to joinpath contain os.sep or os.altsep beforehand to avoid unintended behavior. This creates a tight binding between the semantics of PurePath and os.sep/os.altsep.

If the path object is PureWindowsPath and I’m on a POSIX system, I have no way of knowing that I should check if the path component contains “\”, since that is not contained in os.sep or os.altsep on POSIX systems. This means that PurePath objects cannot be used polymorphically or as duck types because there is not enough information exposed to use them securely. That is the crux of the issue.

If a method existed like .joincomponent() that didn’t add any special handling to components that contained path separators, then I could look at path.parts before path was used to see if any component contained any invalid character and fail at that point.

Alternatively PurePath objects could expose os.sep/os.altsep as properties.

/cc @barneygale

You can do a few checks purely in pathlib (check that the resulting path is a subpath of the expected parent and doesn’t resolve to somewhere else via attacking a symlink you didn’t anticipate)

but I think there’s room for better here. I don’t think this needs to be a different method though, is there a reason this shouldn’t raise? joining a path here implies to me that one path must be relative to another, and I can’t think of an intended use here that raising on this would break.

Oh, quick meaningful clarification: even with fixing this, if the input is untrusted you Still need to check that the path ends up where expected (.. use, symlinks…) I only mean is there a good reason why join should allow completely resetting the root?

A couple of ideas for workarounds. I haven’t yet thought about the proposed new API (but I will do).

If you need to check that the user has supplied a purely relative path:

if user_path.anchor:
    raise ValueError('path is not relative!')
path / user_path

If you need to check that the user path doesn’t contain any drives, separators, etc:

path.joinpath('_').with_name(user_path)
1 Like

That idiom is essentially what I’m looking for. Doing a sort of faux dimensional analysis, of the current source code, that’s one of the few methods that looks at sep and altsep in the implementation, so it makes sense.

The semantics I was suggesting wouldn’t throw. In my head there should be an “AbstractPurePath” that has no concept of platform-specific separators or invalid characters. It just carries path components. Since PurePath is always either an alias to PureWindowsPath or PurePosixPath, then throwing makes sense. No abstract PurePath is exposed by pathlib right now anyway, (i.e. something that could be used as an parent class for a new file system type that used completely different path separators, etc.).

I still would suggest directly addressing the issue here though. I’m happy to use this workaround but it’s not ideal due to being wordy. Any combination of either adding a .joincomponent() or exposing .sep and .altsep properties or both would be sufficient. I think it also makes sense to add a note to the documentation that joinpath() shouldn’t be used with untrusted input since that seems like an easy error to make.

This is true for all pathlib operations. You should always make sure that the end result matches whatever security concerns you have. I don’t know what makes joinpath any more dangerous that anything else.

1 Like

Of all the methods on PurePath, I think joinpath() is more likely by a large amount than any of them to be mistakenly used with untrusted input. with_stem(), with_suffix(), relative_to(), __truediv__ all seem like they would ordinarily be used with programmer supplied constants. Additionally joinpath() (and with_pathsegments()) are unique in that they silently interpret their input with special processing, which is easy to miss in the documentation.

1 Like

No: __truediv__ behaves just like joinpath (PurePath('/foo/bar') / '/a/b' results in PurePath('/a/b'). Same with __init__ (PurePath('/foo/bar', '/a/b') results in the same).

You seem to be of the opinion that / and/or \\ should not be treated specially in paths. This is just wrong: They are special operators, basically by definition of path. If you don’t want to allow them, that is extra user pre-processing that you have to do.

In fact, it is quite reasonable for programmers to want to allow their users to specify sub folders. So this “special casing” (i.e. treating strings representing paths as paths) is to 100% the expected behavior

1 Like

I think you misinterpreted what I meant when I said that __truediv__ is mostly used with programmer supplied constants. I am aware that __truediv__ is a synonym for a.joinpath(b) but in most code that I see where __truediv__ is used, it’s used like this:

path / "foo" / "bar" / "baz"

Not:

path / "foo/bar/baz"

We see this type of usage in the documentation:

p / 'init.d' / 'apache2'

I can’t address the rest of your comment because it is projecting a position onto me that I did not express here.

You said “joinpath() is unique”: No, it’s not. The default behavior for methods on Path objects is that they “silently interpreter their input with special processing”. This is the default since, you know, pathlib works with paths.

The only methods that don’t treat the seperator as path separator in their string arguments are with_suffix (which errors) and write_name (which errors).

You said “joinpath() is unique”:

In the context in which I was explaining my reasoning for why it may be a good idea to add a warning in the documentation when using joinpath, it is unique. In any case, it seems to be a case of a misunderstanding, thanks for your input.

Well, if you are going to make false statements and dismiss people who call you out, then yes, I don’t see any value in continuing this discussion with you.

I’m sorry, I didn’t mean to dismiss you nor did I realize you were calling me out. It just seemed like the conversation turned into clarifying past comments. Your feedback has been appreciated. It’s important to look at the documentation as a whole before making any isolated change and indeed that caused me to assess the the other methods of PurePath. Thanks! If you have more opinions on this topic, please don’t feel discouraged from voicing them.

I think my preferred solution is for PurePath gain a classmethod is_valid_part_name (name pending) which checks if the passed in string contains the separator, if it’s equal to . or .. or if it’s any of the illegal names in Windows Paths. The intended code path for users where they only want to allow users to specify a single path component would be to first run the input through this function. That would be a more clean alternative to the with_name pattern above.