Make pathlib extensible

Having myself implemented a few pathlib APIs, I’m quite interested by this feature.

epath: A pathlib API for Google Cloud Storage access.

from etils import epath

path = epath.Path('gs://some-bucket/path/to/file')
imgs = [f for f in path.iterdir() if f.suffix == '.jpg']

github_api: A pathlib API to explore remote github repositories.

path = GithubPath('github://tensorflow/datasets/tree/master/docs/')
assert path.repo == 'tensorflow/datasets'
assert path.branch == 'master'
content = (path.parent / 'README.md').read_text()

I’ll be interested to try the feature & give feedback if needed.

2 Likes

:sparkles: March 2022 progress report :sparkles:

Slightly less to report this month, so I’m going into detail on my current task.

Brett Cannon made a helpful suggestion in PR 31085:

My current thinking (while I work through my PR review backlog to reach this PR), is this should be used for zipfile.Path . I also wouldn’t be apposed to that happening this PR if @barneygale wanted to give that a shot while he waits for a review from me.

This clarifies the order of things: I need to address the three issues I mentioned in the last post before we can add an AbstractPath class.

The first is to do with subclassing and flavours: users expect to be able to subclass pathlib.Path, instantiate their subclass, and have their new object use the local system’s path flavour (Windows or POSIX). This is what bpo-24132 is all about. Currently this falls over because many PurePath / Path methods expect to find a _flavour attribute, which is only set in PurePosixPath, PureWindowsPath and their subclasses.

How can we solve this? An obvious solution is to set _flavour in PurePath (e.g. switching on os.name), but if we look a little deeper we can spot a nice simplification.

It’s worth considering the possible values _flavour can take: pathlib._posix_flavour or pathlib._windows_flavour, which are singleton instances of _WindowsFlavour and _PosixFlavour respectively, those types being subclasses of _Flavour. I usually refer to them as “flavour classes”. Here’s how things look today:

PurePath._flavour        = xxx not set!! xxx
PurePosixPath._flavour   = _PosixFlavour()
PureWindowsPath._flavour = _WindowsFlavour()

What are flavour classes? In my view, they’re essentially re-implementations of posixpath and ntpath, with a few changes and improvements. Here are the most striking concurrences (from the 3.7.0 source tree):

flavour         os.path
=======         =======

sep             sep
altsep          altsep
casefold()      normcase()
splitroot()     splitdrive()    
gethomedir()    expanduser()
resolve()       realpath()

In the case of splitroot(), gethomedir() and resolve(), the implementations clearly derive from the implementations in posixpath and ntpath. But the implementations were not kept in sync after pathlib landed in CPython, meaning every bug needed to be fixed in two places, and that didn’t always happen. In PRs over the last year or two I’ve removed the gethomedir() and resolve() implementations, which solved some pathlib bugs.

Here’s the rub: these classes don’t need to exist. We can make PurePosixPath reference posixpath directly, and do the same thing with PureWindowsPath and ntpath. Flavour classes don’t bring enough to the table to really justify their existence. I can see why they did when pathlib was a standalone package, mind!

And that brings me to the second observation: we already have a handy dandy attribute that points to either posixpath or ntpath depending on the system - it’s called os.path! So we can solve the problem elegantly by setting _flavour as follows:

PurePath._flavour        = os.path
PurePosixPath._flavour   = posixpath
PureWindowsPath._flavour = ntpath

And so that’s what I’ve attempted to implement in PR 31691. Brett has kindly assigned himself as a reviewer, and I expect he’s still working his way through his mighty review queue. That’s all the news! o/

7 Likes

Very interesting observation!

3 Likes

:sparkles: April 2022 progress report :sparkles:

Brett has a chonk of work to do for 3.11 beta 1, not least for PEP 594, and so hasn’t had the opportunity to look at PR 31691. As such I’m afraid I have no progress to report this month.

I’d like to appeal to core devs who might have the time and inclination to help review that PR. I suspect Brett would appreciate one fewer thing in his to-do list! If I can return the favour somehow (review your code, write some tests for your shiny new feature) I’d be happy to do so! I hope I’m not being ingracious to Mr Cannon, who has been a massive help so far.

Til next time!

8 Likes

^ @brettcannon

1 Like

:sparkles: May 2022 progress report :sparkles:

  • Addressed a round of feedback from Brett Cannon (thank you!) on PR 31691.
  • Created PR 91882, which factors out a near-duplicate implementation of ntpath.splitdrive() from pathlib. Thank you Eryk Sun and Steve Dower for the reviews!
  • Participated in discussions about possible additions to pathlib. If/when we introduce AbstractPath, I think some of these will become pretty compelling.

Happy to be making progress again :slight_smile: . See you next month!

7 Likes

Hey folks. A question for you all: how should users supply state (like a backing socket, fileobj, etc) to their Path types?

Taking a potential TarPath class as an example (see gh-89812), here are some ideas:

import tarfile

mytar = tarfile.open('sample.tar.gz')

readme = mytar.TarPath('README.txt')                   # Idea 1
readme = tarfile.TarPath[mytar]('README.txt')          # Idea 2
readme = tarfile.TarPath('README.txt', backend=mytar)  # Idea 3
# xxx your idea here? :)

Ideas 1 and 2 generate a new TarPath type for each instance of TarFile; this type has the TarFile instance stored as a class attribute. Idea 2 is probably a patent abuse of __class_getitem__. The advantages of these ideas are:

  • The type’s interface, including its constructor, is exactly compatible with Path.
  • It doesn’t require much internal work in pathlib.

Idea 3 doesn’t generate a new TarPath type for every TarFile instance, but it does require some significant work on pathlib’s internals to facilitate passing the backing backend around to new TarPath objects (e.g. from iterdir()). This work might remove private constructors like _make_child_relpath() that assume the input is already normalized, which would have a performance impact. On the positive side, it could open up customization of how pathlib normalizes paths, which has been requested a few times over the years. E.g. folks might want to retain the leading ./ or trailing / in a path like ./foo/bar/baz/ as these can be meaningful to shells.

Any feedback? Other ideas/thoughts? Cheers.

2 Likes

I think letting the user define a custom __init__ signature is a quite important feature.

  1. For the end user, this is the most natural option
  2. This is how existing pathlib-like API are currently implemented. So not supporting it will either break existing code, or prevent codebase from migrating (defeating the purpose of AbstractPath).

For example:

p = upath.UPath('gs://bucket/f.txt', asynchronous=True)
p = zipfile.Path('archive.zip', at='folder/f.txt')

Internally, zipfile use a _next function to create other path sharing the same state. Maybe something similar could be used here.

3 Likes

Thanks @Conchylicultor, I think you’re right.

The only tricky bit is that, at the moment, there are a couple of different methods for constructing PurePath objects: _from_parts() and _from_parsed_parts(). The latter method doesn’t perform normalization and is used by things like and glob() and parent for speed. Strictly speaking __new__() is a third method, because it redirects to the POSIX/Windows variants where necessary before calling _from_parts().

I’m not sure how to introduce a _next() method into the mix without unifying the constructors, which is liable to wreck performance. I have some ideas though :slight_smile:. I’ll start seriously working this problem once PR 31691 lands.

:sparkles: June 2022 progress report :sparkles:

PR 91882 has landed, which factors out a near-duplicate implementation of ntpath.splitdrive() from pathlib. To achieve this, we improved support for \\?\UNC\ prefixes in ntpath.splitdrive(). Big thanks to Eryk Sun, Steve Dower and Serhiy Storchaka for their reviews and contributions. I’m not planning any further work to merge the pathlib + os.path implementations, but I think there’s potential to add os.path.isreserved() and os.path.path2fileuri() in future.

This leaves pathlib’s “flavour classes” as something of a relic, and I’m looking to remove them in PR 31691, which will have the effect of making PurePath and Path directly subclassable (!!). Brett Cannon reckons he might have some time coming up to review once he’s got through some bugfixes - thanks Brett!

Once that lands, I’ll start seriously engaging with the last major roadblock: adding some kind of _next() method, customizable in subclasses of AbstractPath, that is called by iterdir(), parent, etc, to generate new path objects. This appears to require us to unify the path constructors, which is likely to affect performance. Thus I might need to dig even deeper to offset the performance losses with gains elsewhere. It’s pretty possible I’ll need to write a PEP to justify these change and the addition of AbstractPath - terrifying and exciting in equal measure! I’m convinced that the use cases are hugely compelling so will give it my best effort :slight_smile:

Til next time o/

12 Likes

:sparkles: July 2022 progress report :sparkles:

@Ovsyanka added pathlib.Path.walk() :tada:. It’s a great way to walk a directory tree. When we add AbstractPath, this method will come “for free” when users implement iterdir() and stat().

PR 31691, which removes “flavour classes” and makes PurePath and Path directly subclassable, is awaiting core review. Brett Cannon still intends to take a look, but I worry it has become an anchor for him to drag around, so I’m planning to put a message on the python-dev mailing list next week appealing for more reviewers (Brett has blessed this).

I’ve played around with unifying the PurePath internal constructors – it’s simpler than I thought. I expect that PurePath.parents will become a true tuple, rather than a tuple-like immutable sequence. Everything else user-facing stays the same. Performance impact still TBD. While testing this I found a niche PureWindowsPath bug and logged it as GH-94909.

That’s it for July! Thanks for reading, cheers.

9 Likes

:sparkles: August 2022 progress report :sparkles:

Really good progress this month :slight_smile:

GH-31691, which makes PurePath and Path subclassable, seems to be nearing completion! Brett Cannon has been an enormous help here and has contributed many reviews over the last month. I’m excited for this to land - it will fulfil my goal of making all path behaviour customizable from within a path subclass, without the need for flavour or accessor subclasses. From here, pathlib.AbstractPath is within touching distance.

GH-95450, which makes path initialisers use os.path.join() to join arguments, is merged. Thanks again to Brett for this one. This eliminates a little bit of low-level OS-specific code from pathlib, and sets us up nicely to add some sort of makepath() method in short order.

GH-95486, which adds a new os.path.isreserved() function (and makes pathlib use it), is in review with Eryk Sun and Steve Dower (thanks guys). Reserved filenames on Windows are quite subtle, so I think it will run for a while.

Exciting times! Bye for now o/

13 Likes

Hi Barney,

First thanks for all this work. I really look forward to these updates.

Quick question, with your changes, will someone be able to customize the return type of __truediv__. For example, I have subclass of PosixPath that represents a folder with a specific structure on the lfs. I found myself having to override __truediv__ on this subclass and other PosixPath subclasses I made representing specific file structures.

The problem is, I’m using subclasses to represent specific folders but I don’t want the use / to create new instances of the subclass but rather a PosixPath.

To overcome this I did something like


class _PosixPath(PosixPath):

     def __truediv__(self, other):
          return PosixPath(self) / other

class SpecialDirectory(_PosixPath):

      @property
       def path_in_special_dir(self):
            return self / "path_foo"

Wondering if anything in your changes makes subclassing how I want to be more seamless. I will never have a situation where there is a SpecialDirectory inside of another SpecialDirectory.

Hey - could you elaborate on why don’t want to create new instances of your subclass? Personally I’d expect path / 'foo' to have the same type as path.

2 Likes

Maybe I’m making a bad use case but, my subclass is special directory with a known file structure. There are many of them but they’re never nested. Like in my example, SpecialDirectory would never contain another SpecialDirectory so path / 'foo' returning a SpecialDirectory doesn’t make sense (assuming path is an instance of SpecialDirectory). It makes more sense to return Path instance.

Edit: elaborating more on “maybe this is a bad use case” but was there ever a mention why someone would want to subclass Path? My initial use case was to make subclasses special file structures but after thinking about your comment Barney, and my use case maybe that isn’t correct?

See GitHub - barneygale/pathlab: Extends Pathlib to archives, images, remote filesystems, etc for use cases which was also made by Barney. Essentially, subclassing it allows you to traverse any filesystem-like structure using this familiar interface. Many libraries that allow you to traverse remote resources already do that.

5 Likes

Brilliant, I was unaware, thank you!

I have nothing to report since my update at the end of August. The bottleneck is core dev review time. I feel quite frustrated and down about the situation tbh. I hope my experience is not representative; if it is, CPython might eventually face a demographic crisis when core devs retire without anyone to backfill. Or maybe I picked a duff horse when I decided to work on pathlib, idk.

8 Likes

:sparkles: Festive progress report :sparkles:

I’m dead pleased to announce that gh-68320 is now resolved, and so Python 3.12 users can directly subclass PurePath and Path:

import pathlib

class MyPath(pathlib.Path):
    pass

path = MyPath()

I’d like to express my gratitude to @brettcannon, who stepped up to the plate to review pathlib changes and provided invaluable feedback week after week. It couldn’t have been done without him! Thank you also to @eryksun, who regularly shared his immense Windows expertise. And thank you to everyone else who contributed ideas, reviews and code. Despite the simple repro case, this took nearly 3 years of work to address!

I’m still looking forward to adding an AbstractPath class that users can subclass to provide S3Path, TarPath, etc. We’re really close now. The shortest possible route looks like this:

  1. Add support for sharing state between related path objects (gh-100479)
  2. Write a PEP and implementation for the introduction of AbstractPath

There are at least a handful of other tasks that may be considered blockers (TBD):

  • Add support for detecting the flavour of user subclasses (gh-100502)
  • Add support for customizing or disabling pathlib’s path normalization routines
  • Optimisation pass for path object construction

With Brett taking a step back from pathlib maintenance, I’m looking for core devs to sponsor this development effort. Any level of involvement would be appreciated, from giving feedback in this thread, to code reviews, to co-development and co-authoring of patches, documents, etc. I’d be happy to chat over email/IRC/discord/discourse/VoIP/whatever. Drop me a PM if you’d like! Even if you’re not a core dev, I’m always very happy to talk about this effort, so feel free to reach out.

Thanks for reading, til next time o/

21 Likes

Feel free to request my review on pathlib stuff. I’m not a pathlib expert (so I’ll probably want additional reviews from other core devs before merging if it’s something complex), but I’m a core dev who regularly uses pathlib, so I definitely have an interest in the topic. I can’t promise I’ll always be prompt in responding, but I don’t mind being pinged for pathlib-related stuff :slight_smile:

2 Likes