This doesn’t seem feasible, in that I’m pretty sure the Path is not supposed to be able to represent a broken “path” whose parts contain invalid characters.
I think what you really want to do is verify that the argument is a valid path component before joining it with .joinpath (or /, for that matter).
It looks like you’ve been given an idiom for that already: joinpath a dummy component, and then attempt to replace the dummy component using with_name. However, I would argue that this is not exactly intuitive. From my testing, it also has weird handling for . and ..:
>>> Path('/foo/bar').joinpath('_').with_name('..')
PosixPath('/foo/bar/..')
>>> Path('/foo/bar').joinpath('_').with_name('../baz')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/pathlib.py", line 856, in with_name
raise ValueError("Invalid name %r" % (name))
ValueError: Invalid name '../baz'
>>> Path('/foo/bar').joinpath('_').with_name('.')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/pathlib.py", line 856, in with_name
raise ValueError("Invalid name %r" % (name))
ValueError: Invalid name '.'
>>> Path('/foo/bar').joinpath('_').with_name('./baz')
PosixPath('/foo/bar/./baz')
Accepting leading ./ components seems probably benign but still not exactly desirable. Accepting .. by itself means that this doesn’t avoid a clear security risk, and is thus critically error-prone.
As multiple people have pointed out already, restricting something here may be useful in some way, but between symlinks and valid path components that allow traversal, any solution is still going to require validation when handling untrusted user input.
In the interest of adding something more constructive than just saying it needs further validation, such validation may not be best done from within python in all cases. If your application handles untrusted user input, and you are concerned about the safety of this, defense in depth would have you take reasonable steps to validate the selected path is reasonable. One way to to this would simply be to resolve the path, and then reject it if it didn’t resolve in a place your application should allow users to write or would cause overwriting something they should not be able to overwrite. As for an additional layer, you could use the os’s controls (things like AppArmor, SELinux policies, etc) to restrict where the application can write to.
PurePosixPath("foo").with_name("\0") # no exception
PureWindowsPath("foo").with_name('<>:"|?*') # no exception
You could use .joinpath() instead in the above examples and it would be no different. Given this state of affairs, it seems reasonable to expect similar treatment for path separators (which are also invalid in path components), at least when not using .joinpath(). I don’t think it’s right for the semantics of .joinpath() to change.
Yes but as already mentioned, the issue is that there is no way to do that without looking at os.sep or os.altsep. This makes it impossible to securely use PurePath.joinpath() in a duck-typing context.
This is a bug, though it seems to be fixed from Python 3.12 onwards:
>>> Path('/foo/bar').joinpath('_').with_name('./baz')
ValueError: Invalid name './baz'
By design .. segments are usually not special in pathlib, and constructing a path like foo/.. via foo.joinpath('_').with_name('..') is perfectly valid.
You could call resolve() on both paths and then check is_relative_to() perhaps?
One could do all these things, but I would really rather have explicit functionality for these tasks:
(lower-level) Split at path separators
(lower-level) Validate that a path component is not . or .. (or any other “special” value that would have a special interpretation - agnostic to what the filesystem actually contains)
(higher-level) Iterate over path components in a string and validate them; raise an exception on any special values; otherwise produce the concatenated path
(really high-level) Open a file at a specified location, creating intervening folders if necessary; but only if the path components all validate (don’t contain things like . or ..) and no symlinks are followed (with the option to disable the latter check).
If some or all of that needs to happen in a third-party PyPI package instead, I guess that’s fine - but I’d really like to avoid wheel reinvention for tasks this common with these kinds of security implications.
(For even more bonus points, it’d be nice if standard archive-extraction tools could leverage the above.)
Not that there wouldn’t be value in such things, but there are more useful things that require cross-platform support as well as methods of querying the system about the underlying filesystem and how it is mounted if you want to have everything for this be done correctly and securely for handling untrusted user input from within the process rather than at a layer constraining the process.
The steps you need here include atomically creating directories and files to ensure there isn’t a symlink confusion attack. This would be much more useful even for the trusted input situation but is a prerequisite building block (and not even the only one) for guarding the effects of untrusted input from within the process that Python doesn’t have.
Add onto this that a module or methods that are marketed as safe to handle user input attach notions of a security guarantee (Even an implied one), and I don’t think it’s a good addition, especially because it isn’t generally even the correct first tool to reach for.
The first one would be finding a way not to use user input at all for this if the user input is untrusted… You could have a database (including, but not limited to, just a small sqlite lookup) that matches what the user names a file to where the application actually stored it using safely generated file paths.
Beyond this, you can restrict the process so that it can’t write to anything other than pre-specified directories.
I’m assuming somewhat by the notion of an untrusted user here that the user doesn’t have direct access to the filesystem and this is a web app, if the user does have access to the filesystem, we could look at a lot more, but the fundamentals here don’t change.
I don’t intend to protect against race conditions, just malicious input in this execution context. Once mkdir etc. is invoked, we know that the newly created directories are empty, so they can’t contain malicious symlinks.
The goal I imagine here is to be able to do things like unpacking archives, search for plugins within a subfolder (letting the user specify names, but avoiding escape from the directory using existing symlinks), etc.
But maybe that differs from what OP had in mind.
I don’t think web apps are at all the common case for Python code, or really any non-Javascript code. Unless you’re talking about the server, in which case I’m not really clear on why it matters.
Trusted users certainly can manipulate untrusted input. For example, if I run code on my machine intended to process a document that I downloaded separately.
Then what you want is not comprehensive enough to attach any sort of promise about safely handling untrusted input.
Please read the rest of the statement you left out of the quote, and keep in mind that the context for this thread is user-provided paths. If the user is the one launching the app, they have at least as much permission as the app with regard to the host. So the only way this input can be viewed as untrusted in a way which is a concern of the app fits in that model or a similar one.
Pathlib would still be the wrong tool for this.
https://docs.python.org/3/library/tarfile.html#tarfile-extraction-filter includes handling for archives you have a reason not to trust, though admittedly support here is limited to 3.12+ (prior to this, you should not use python to extract tar files you don’t trust. call out to system’s tar)
When it comes to making security claims in a module, people have a lot more that has to be considered. I would be extremely cautious of trying to advocate for a version of pathlib or even a subset of it that was “safe to handle untrusted input” because I don’t think the problem space makes sense, at least not in a general sense, and the problems should be addressed at another layer in design.
Looking at the different arguments in this thread, it seems the following could make sense to improve path traversal issues in Python:
Suggestion 1: Add a new Path.open_beneath(basepath) API to the pathlib.Path class, which would verify that the path represented by the object is relative to a basepath.
Suggestion 2: Provide a public API similar to with_segments [1] to allow PurePosixPath (or PureWindowsPath) subclasses to modify the behaviour of path constructions and path join operations, to e.g. make path segments relative before joining, or prevent use of ../. This way, developers and companies can create their own safer subclasses if they want to.
I also agree with the general feeling that the behaviour of path join operations in os.path.join and pathlib classes, of ignoring previous base and path segments when a path segment is “absolute”, is counterintuitive for most people (and seems to be a python only behaviour, as for example, this isn’t the case in go [2]).
This is illustrated by the fact that some vulnerabilities were ONLY vulnerable to the absolute path traversal issue and not the ../ pattern, for example cuckoo sandbox evasion [3] or CVE-2019-14322 [4]. This has also been discussed many times [5] by security pple. In addition, controls can also be implemented upstream or downstream to limit the use of the ../ pattern, in which case the behaviour of python join path functions might make those controls bypassable. For example, some controls could happen during file opening operations or at custom file APIs level.
In general, I think changing the behaviour of path join operations as a whole would still add some security benefit (but shouldn’t be advocated for being completely safe), while actually being more intuitive for developers. Except being a breaking change in extremely widely used APIs, it seems to be relatively reasonable to me.