Make pathlib extensible

I’ve just discovered secret option 3: copy-paste __fspath__() where it’s needed, no extra class required.

Alex Waygood pointed me towards winning option 0: set __fspath__ = None in AbstractPath, and then set it back to __fspath__ = PurePath.__fspath__ in Path. I considered this option originally and discounted it because it felt like a hack, and because the exception message was misleading. Well, it turned out every other option was an even uglier hack, and that Alex was prepared to fix the exception message. So now I’m aiming for this:

Isn’t it beautiful? :heart_eyes: Thanks Alex!

The tarfile.TarPath implementation is working nicely. I’ll put a PR up as soon as I’ve landed a handful of pathlib test improvements (GH-106060 through 64 if anyone’s up for reviewing :pray:)

13 Likes

In addition to a public tarfile.TarPath class, the above PR adds a private pathlib._VirtualPath class (!!)

(I recently realised that “VirtualPath” is a way cooler more accurate name than “AbstractPath”, given its lack of true abstract methods. It should also put users in mind of virtual file systems, which is exactly what we want, I think.)

That PR is the largest I’ve ever logged, and it might take several months to land. But once it’s in, there’s not much more work left to make pathlib.VirtualPath public. Someone pinch me :sweat_smile:

12 Likes

Q: Should the path flavour (either ntpath or posixpath) be a class or instance attribute?

# Option 1
class MyPath(pathlib.VirtualPath):
    flavour = posixpath

# Option 2
class MyPath(pathlib.VirtualPath):
    def __init__(self, *pathsegments):
        super().__init__(*pathsegments, flavour=posixpath)

    def with_segments(self, *pathsegments):
        # override to prevent passing *flavour* to initialiser.
        return type(self)(*pathsegments)

Option 2 allows a subclass to determine the flavour in __init__() and pass it to super(), whereas in option 1 the flavour is fixed on the class. Once we choose an option, it will (edit: might) be hard to change our minds later.

Does anyone see any merit in allowing the flavour to be determined in __init__()? So far I haven’t come up with any specific use case, but I’m not confident enough to rule it out either. Thank you!

Note that the flavour only determines the syntax/structure of the path. It’s not used for filesystem access.

I think they should be class attribute

Barney, a while back is suggested the flavor should be a class parametrization through __init_subclass__. I still think this is a valid option.

Then you can guarantee it’s set and the subclass doesn’t have to go overriding anything

1 Like

I’m not sure I follow! The flavour attribute is set to os.path in PurePath, which will be the superclass of VirtualPath. And so subclasses of VirtualPath (including Path) will use the current OS’s flavour by default. Users should only need to override the flavour if their subclass always uses POSIX or Windows path semantics.

I’m severely misunderstanding then, my fault :sweat_smile:

No worries! Thanks for your feedback on the class vs instance attribute bit. Éric has also written in favour of a class attribute on GitHub.

At the risk of opening a pandora’s box:

Making flavour (or pathmod or whatever) a PurePath initialiser argument would allow us to remove the PurePosixPath, PureWindowsPath, PosixPath and WindowsPath classes in future, if that’s something we wanted to do. Many folks have observed over the years that pathlib’s class hierarchy is perhaps needlessly complex. The official docs put an inheritance diagram front-and-center and explain that, when you call PurePath() or Path(), you don’t get an object of type PurePath or Path back. It’s difficult to justify using subclasses to set path flavour when a simpler approach (initialiser argument) would work.

If we drive the flavour via an initialiser argument instead, it might look like this:

# pure path using current OS path semantics
path = PurePath('foo')
assert type(path) is PurePath
assert path.pathmod is os.path

# pure path using POSIX semantics
path = PurePath('foo', pathmod=posixpath)
assert type(path) is PurePath
assert path.pathmod is posixpath

# pure path using Windows semantics 
path = PurePath('foo', pathmod=ntpath)
assert type(path) is PurePath
assert path.pathmod is ntpath

# concrete path does not accept pathmod argument
path = Path('foo')
assert type(path) is Path
assert path.pathmod is os.path

All __new__() shenanigans are rendered unnecessary. Personally I find this a more appealing design than what we have now, and would gel perfectly with the proposed VirtualPath. But deprecating and removing 2/3rds of the pathlib public interface is about as churny as things come, so there would need to be strong support for this course of action. That doesn’t seem likely to me, but I figured I’d mention it just in case.

3 Likes

I wouldn’t like that.
I found PurePosixPath pretty useful for generic path-like operations – URL fragments, archive contents, even nested dict access. The minimal spec (only / and \0 are special, leading / or // are more special) works great even outside Posix.
Spelling PurePosixPath as PurePath('foo', pathmod=posixpath) sounds like unnecessary delving into implementation details.

2 Likes

Questions about churn aside, this is IMO equivalent in terms of “delving into implementation details”. It’s just more front-and-center in the name PurePosixPath. But as a beginner user, I’d certainly prefer not to have to think how my paths relate to posix (what’s that?), and minimize that exposure to a dim awareness (resp. short-and-sweet documentation) that there are different “path styles” between posix & windows.

Pulling on that thread a bit more, I don’t find pathmod= to be a good name for that kwarg, but PurePath('foo', style=posix|windows) looks like an appealing API to me.

Of course, the churn would be substantial, but if we are now able to envision / implement a way better long-term API, I don’t think we should forego such improvements indefinitely (as long as we can provide users with an easy way to migrate).

1 Like

I am unsure about the name VirtualPath.

In other contexts I have seen, vfs or virtual file systems are all about letting you type paths that look like regular filesystem paths, and some software does custom things to return metadata or file contents.
In the pathlib ecosystem, a TarPath class, or SshPath, or S3Path would be examples of virtual filesystems.
But their base class itself is not implementing virtual path.

Barney suggested PathBase when I commented this on the PR, which feels great and easy to understand to me!

2 Likes

Thank you for your feedback @encukou and @h-vetinari! The idea of adding a flavour argument to PurePath has been floating around in my head for years now, and I’m glad to finally be able to rule it out!

@merwok thanks, I’ve gone with PathBase in the PR! I think the docs (when we write them) could still mention virtual paths, right? e.g. “PathBase can be used to implement virtual path objects…”

:sparkles: July 2023 progress report :sparkles:

Thank you to all those who have been providing feedback on naming, hierarchies, etc. It’s so useful to bounce ideas off such talented and experienced devs!

As I mentioned in a previous post, I’ve put up a PR that adds a private _PathBase class:

That PR has been slimmed down: it originally added tarfile.TarPath too, but the expected behaviour of paths involving symlinks wasn’t clear, and so I’m going to work on TarPath in a PyPI package first.

When that PR lands, the remaining work is:

  • Add a public PurePath.pathmod class attribute (PR: GH-106533)
  • Figure out what to do with _PathBase.__hash__(), __eq__(), __lt__(), etc (any opinions?)
  • Make pathlib.PathBase public!

For the first time, I feel confident that this project will succeed. There are no architectural problems remaining in pathlib that would prevent it, nor any major decisions to be made (touch wood). It will be immediately useful upon release, and I think it could grow into one of Python’s best-loved features as third-party APIs begin to accept os.PathLike | pathlib.PathBase for path arguments. Eventually users should be able to do things like:

shutil.copytree(FTPPath(...), TarPath(...))
pandas.read_csv(S3Path(...))
image.save(TarPath(...))  # PIL

We’re doing for path objects what PEP 3116 and the io module did for file objects :slight_smile:

That’s it for now. Thanks again to everyone who has helped with this!

8 Likes

For my fellow visual learners, here’s a venn diagram showing os.PathLike and pathlib.PathBase:

8 Likes

The patch in review includes a PathBase.as_uri() method that raises UnsupportedOperation. I expect that some subclasses of PathBase will override that method, e.g. to return s3:// or ftp:// URIs.

Q: Should we add a symmetrical PathBase.from_uri() classmethod? This would provide an explicit means to contruct a path object (and its backend) from a URI - for example, an FTPPath.from_uri() method could parse the host/port/user/passwd from the URI, construct an ftplib.FTP object, and then wrap it in an FTPPath object.

For Path.from_uri() (local paths), I have a local branch that handles RFC 8089 file: URIs, including the weird ones with 4 or 5 leading slashes, such as those produced by urllib.request.pathname2url().

I don’t think there should be a PathBase.from_uri(). Instead I feel the subclasses should support uri directly in there __init__ function. This is already the case for some pathlib implementations:

from etils import epath
import upath

epath.Path('gs://xxx/yyy')
upath.UPath('s3://test_bucket')

An additional method would add redundancy/confusion and feel less natural I think. But that’s just my opinion.

Thanks! Assuming upath.UPath('s3:...') delegates to an S3Path class, how should users call its initialiser? So far I’ve been gunning for something like this:

client = boto3.client('s3')
path = S3Path('downloads', 'foo.tar.gz', client=client, bucket='foo')

The positional arguments are specified just like in PurePath and Path - a list of path segments to join. The keyword arguments add to the existing interface, rather than replacing it.

I’m not sure how to add URIs into this mix. In some cases you can distinguish URIs from file paths, but not always* and so positional arguments don’t seem right. A uri keyword argument might work, but it makes other arguments redundant and complicates the interface IMO. This is why I still lean towards a from_uri() classmethod, as the URI may be used to fill several initialiser parameters. What do you think?

(*) for example, file:/etc/hosts is both a valid file URI and a valid relative POSIX path

To be honest I didn’t know that was a thing, I almost always construct a Path with a single argument, and add on with / if needed.

I think it would be nice if the client wasn’t needed for s3 or gcs paths to work, since you don’t necessarily want to interact with your cloud storage when you’re working with paths. e.g. if you’re just formatting some metadata for documentation, or something.

Would PurePosixPath work? It’s designed not to perform any (virtual) filesystem access.