Make pathlib extensible

barneygale · February 24, 2020, 12:15am

Hi all,

I saw that Python 3.8 adds zipfile.Path, which provides a partial pathlib-compatible interface to zip files. This implementation doesn’t actually use any pathlib machinery, rather just mimics its interface. Previously I’ve noted packages like artifactory that do subclass from pathlib classes, but the implementation is necessarily messy due to pathlib.Path not being intended to be subclassed in user code.

But what if we changed that?

The pathlib source code already contains notions of flavours (implementing posix or windows path semantics) and accessors, but pathlib contains only a single accessor implementation, and in the last few years pathlib has accumulated code that bypasses the accessor.

I propose that we change instances where pathlib.Path directly calls methods like os.close() and os.getcwd() to use the accessor instead, drop the underscores from _Accessor and _NormalAccessor, and document the accessor interface.

I’ve played around with these ideas and published a package called pathlab that shows what might be possible. It turns out implementing the equivalent of os.stat() and os.listdir() in an accessor gets you pretty far!

Some questions arise:

Is this something we want users to do? For me it seems obvious that a common interface to the local filesystem, zip and tar files, FTP, WebDAV etc would be desirable, but I’m keen to hear feedback!
Should we provide a pathlib.StatResult class for use in users’ Accessor.stat() implementations? os.stat_result is a little tricky to instantiate as-is.
What does this mean for os.DirEntry, if anything? Its interface is a subset of pathlib.Path, with the exception of the addition of a path attribute.
How do we suggest users bind an accessor instance (which, unlike a local filesystem accessor, requires some context like a URL or an open file pointer) to a path type? In my implementation I’m doing some things in Accessor.__new__() to create a new Path subclass with the accessor instance bound in. But perhaps there’s a better way…
In Accessors, do we want to stick to the os API as much as possible, or perform some sort of simplification (i.e. unlink() and rmdir(), or just delete()?)
How do we support transfers between different kinds of filesystems? with path1.open('rb'), path2.open('wb') ... would work but would not preserve permissions/metadata/etc, so would there by scope to add a Path.transfer() method?

Penny for your thoughts! Hope this isn’t too crazy.

The changes needed to pathlib.Path can be seen here, albeit implemented in a subclass for now.

Barney

pitrou · February 24, 2020, 9:14am

@barneygale That’s… interesting. The original intent for the “accessor” thing was to have a variant that did all accesses under a filesystem tree in a race condition-free way using openat and friends. It turned out to be much too hairy to actually implement, so was entirely abandoned, but the accessor abstraction was left there.

It would be nice if you could directly modify pathlib and post the diff (or PR) somewhere. Some parts of pathlib use shortcurts for speed (especially Path construction), and I’m curious if you can retain the shortcuts while respecting the accessor abstraction.

encukou · February 24, 2020, 11:14am

I would advise making it easier to subclass PurePath, with the subclass supplying the filesystem access methods just like pathlib.Path does.
Why? Subclasses aren’t likely to really support the full Path API; for example home() or expanduser() doesn’t quite make sense in a zip file, or a filesystem that’s always read-only (like ISO) might not implement any of the mutating methods. The subclass could customize __new__ quite trivially. Also, Path is a path on the local filesystem; if making subclass of it would represent something else, I’d call it a violation of the Liskov substitution principle.
For methods like glob and {read,write}_{text,_bytes}, pathlib could expose some common helper functions for subclasses to use.

ISTM that some non-trivial thought (or just checking?) would also need to go into what happens when calling joinpath on/with disparate Path objects.

(FWIW, I’ve played around with similar ideas working on githpathlib, a partially pathlib-compatible library for accessing Git repos – i.e. you can read data from a commit directly from Git’s database, without having to checkout all of the commit’s files on disk.)

barneygale · February 24, 2020, 3:33pm

Thank you both for the thoughtful and detailed feedback!

It would be nice if you could directly modify pathlib and post the diff (or PR) somewhere. Some parts of pathlib use shortcurts for speed (especially Path construction), and I’m curious if you can retain the shortcuts while respecting the accessor abstraction.

I will give this a go! Should have a branch within a week or two.

I would advise making it easier to subclass PurePath , with the subclass supplying the filesystem access methods just like pathlib.Path does.

For me, a lot of the appeal of pathlib is that you get a lot of functionality out of few underlying os calls. Yes, you can create your own PurePath subclass with methods is_file, is_dir, is_symlink, is_char_device, is_block_device, is_mount etc, but it’s much simpler to implement Accessor.stat() instead.

Why? Subclasses aren’t likely to really support the full Path API; for example home() or expanduser() doesn’t quite make sense in a zip file, or a filesystem that’s always read-only (like ISO) might not implement any of the mutating methods.

Similarly, Path.owner() and Path.group() raise NotImplementedError on Windows. ISO files aren’t always read-only AFAIK - a quick test with file-roller shows I can edit ISO files. I don’t think there’s a clear division between a full path API and a partial one, unless we’re treating POSIX as gospel.

Also, Path is a path on the local filesystem; if making subclass of it would represent something else, I’d call it a violation of the Liskov substitution principle.

True. Suggest we provide AbstractPath or UserPath or something, with Path as a subclass.

ISTM that some non-trivial thought (or just checking?) would also need to go into what happens when calling joinpath on/with disparate Path objects.

Indeed! This isn’t something I’ve given much consideration yet. Will have a think.

Thanks for the gitpathlib link, cool stuff! I’d also draw folks attention to the s3path package.

pf_moore · February 24, 2020, 4:44pm

FWIW, pyfilesystem is a much more comprehensive “extensible filesystem abstraction” API. It’s been round for some time (longer than pathlib, I believe) and probably has some ideas worth investigating. But making pathlib more extensible is definitely an idea worth exploring in its own right.

barneygale · February 25, 2020, 1:44am

I wasn’t previously aware of pyfilesystem, looks really handy with lots of fine-grained control. The essential methods to implement are:

getinfo() Get info regarding a file or directory.

listdir() Get a list of resources in a directory.

makedir() Make a directory.

openbin() Open a binary file.

remove() Remove a file.

removedir() Remove a directory.

setinfo() Set resource information.

There are also non-essential methods which “have a default implementation in the base class, but may be overridden if you can supply a more optimal version.”

The author(s) have designed a nice, fine-grained API for FS operations, and so avoid some of the kinks like emulating os.stat(). That said, I’d suggest that modeling the accessor API on os has certain advantages:

The os module (and perhaps Unix APIs more generally) are already somewhat familiar to users
I worry that attempting to design a filesystem API that properly defines all reasonable capabilities and variations in support while remaining accessible is a particularly difficult task.
It’s what pathlib does already, so we don’t need to do much refactoring

My proposal is that we keep the (more-or-less) os-like API for accessors, and perhaps provide some default implementations (e.g. Accessor.listdir() calls Accessor.scandir()). That said, this is my first attempt and contributing to cpython and I’m not sure how big we’re allowed to dream!

encukou · February 25, 2020, 12:58pm

I suggest you copy pathlib and its tests, and publish it on PyPI with a new name (maybe pathlab.pathlib?). Then, add the new features, tests, documentation, test it in the real world, and after that propose changes into pathlib.
This way, you’re not blocked by anyone until the very end, it can be tested easily, and the improvements will be available even for older versions of Python.

(This is not meant to push you away! It’s a comfortable branch-merge workflow for big changes. Core devs also use similar workflows, see for example importlib-metadata, asyncio/tulip. Or compileall improvements from my colleagues.)

barneygale · February 25, 2020, 2:18pm

Great, thank you! I suppose my main aim with this thread was to gauge support for the idea and get help identifying possible implementation roadblocks. Thanks for the help, all. I think I’m right in saying there’s some cautious interest in the idea. I’ll now set up a branch of pathlib and write an initial implementation. This will probably lack some of the possible niceties mentioned in my original post - like a user-oriented version of os.stat_result.

barneygale · February 25, 2020, 9:33pm

For anyone interested in following my progress, my first attempt is available in a branch here: https://github.com/barneygale/cpython/tree/pathlib-abc. Currently lacking: tests, docs, and niceties like a one-liner to generate a UserPath subclass with a particular accessor instance bound.

barneygale · February 29, 2020, 6:02am

Hi all,

I think I’ve now implemented as much as I can without further design input. You can see my changes here.

Noddy example of how this can be used:

import pathlib
import stat
import os

class MyPath(pathlib.UserPosixPath):
    pass

class MyAccessor(pathlib.AbstractAccessor):
    path_type = MyPath

    def __init__(self, children):
        self.children = children

    def stat(self, path, *, follow_symlinks=True):
        return os.stat_result((stat.S_IFDIR, 0, 0, 0, 0, 0, 0, 0, 0, 0))

    def scandir(self, path):
        for child in self.children:
            yield self.path_type(path, child)

def main():
    accessor = MyAccessor(('foo', 'bar'))
    MyPath = accessor.path_type
    for p in (MyPath('/'), MyPath('/mlem')):
        print(p)
        print(repr(p))
        print(list(p.iterdir()))
        print(p.exists())
        print(p.is_file())
        print(p.is_dir())

main()

Changes:

Added UserPath, UserPosixPath and UserWindowsPath classes. These are intended to be subclassed by users.
Renamed _Accessor to AbstractAccessor
Renamed _NormalAccessor to NormalAccessor
Renamed _normal_accessor to normal_accessor
Added AbstractAccessor.path_type class attribute. In user subclasses of AbstractAccessor, this can be set to (a subclass of) UserPosixPath or UserWindowsPath. After an accessor is instantiated, accessor.path_type points to a subclass associated with the accessor instance. I’m not 100% sold on this tbh.
Added accessor.getcwd()
Added accessor.expanduser() (replaces flavour.gethomedir())
Added accessor.fsencode()
Added accessor.fspath(), which is called to implement __fspath__ and might perform a download/extraction in subclasses.
Added accessor.owner() and accessor.group()
Added accessor.touch()
Changed accessor.open() such that it is now expected to function like io.open() and not os.open()
Renamed accessor.link_to() to accessor.link() (for consistency with os)
Removed accessor.listdir() (now uses scandir())
Removed accessor.lstat() (now uses stat())
Removed accessor.lchmod() (now uses chmod())
Removed support for using a Path object as a context manager (bpo-39682)

I’d appreciate some feedback on the work so far - does it look generally OK? Am I breaking things I shouldn’t be?

barneygale · March 10, 2020, 3:36pm

Hi all,

I’ve put in a number of pull requests that lay some groundwork. There’s a small spreadsheet summarizing the changes here. Most of the changes are geared towards moving impure functionality into the _NormalAccessor class.

Question: how should __fspath__() work in Path subclasses? Is it reasonable to delegate to the accessor (with a default implementation of str(path)), or should __fspath__() stay as it is?

My initial feeling was that __fspath__() should return a path that obeys local filesystem semantics, and therefore it’s reasonable to delegate to the accessor in case the accessor-writer wants to download/extract the file and return a local filesystem path. But re-reading PEP 519, and considering that PurePosixPath and PureWindowsPath are path-like irrespective of local system, I’m now leaning towards __fspath__() always being implemented as return str(self), as it is currently.

Does this sound right? And does the same argument apply to Path.__bytes__()?

Thanks

pitrou · March 10, 2020, 6:48pm

Well… that’s ultimately the implementer’s responsibility, but I view __fspath__ as a light-weight, side-effect free inquiry. If it’s necessary to do a background copy to return a local filesystem path, then perhaps you shouldn’t implement __fspath__ at all, and instead let the user open the file (or a middle layer) explicitly for reading.

pitrou · March 10, 2020, 6:50pm

It does look generally ok, though validation will come through actually running the test suite against those changes.

One question about your example: are you expecting users to have knowledge of the Accessor class or subclass? For me, that was purely an implementation detail.

barneygale · March 10, 2020, 8:38pm

Thank you!

On __fspath__(): I think I agree, thanks. I’ll withdraw the relevant bug report + PR.

On exposing Accessor, I had two ideas:

Expose and document Accessor, and make this the principal class to subclass for implementing a custom Path subclass. It’s a kinda handy to keep all the filesystem-accessing stuff separate, and keep Path objects as an immutable view. It also means users don’t have to deal with any __new__ magic, because accessors are pretty boring :-). But it does mean users would probably need to subclass both Accessor and Path (or to call some other factory function, like MyPath = accessor.path_type in the example above), which might be asking too much.
Move all Accessor methods into Path, but this might make it awkward to keep state around, and I’m not currently a fan of this idea.

I appreciate this wasn’t the original intention for accessors, but they seem to fit this problem pretty nicely, and provide a path forward without any big rewrites.

pitrou · March 10, 2020, 8:51pm

I agree with making Accessor a hook for subclass implementers. Your example seemed to show it being invoked by the end user, though, which doesn’t sound necessary.

barneygale · March 10, 2020, 8:55pm

I’m imagining that a single Accessor instance is created by the user for a particular filesystem, e.g.:

accessor = S3Accessor(
    username='aws_user', 
    token='aws_token', 
    bucket='aws_bucket')
S3Path = accessor.path_type  # Evaluates to a `Path` subclass with this particular accessor instance bound
root = S3Path('/')
assert root.exists()

pitrou · March 10, 2020, 9:15pm

I’m a bit skeptical about this. My first hunch is that implementers can create another public-facing abstraction, for example a FileSystem object. If we expose Accessor to the public, people will inevitably be using it directly and we may end up more constrained that it we make it an implementer-only API (where we can probably be less strict when it comes to e.g. API stability).

barneygale · March 10, 2020, 10:10pm

Very fair. If it helps, I was planning to only expose AbstractAccessor, and keep _NormalAccessor private.

Authors of libraries that extend pathlib may still choose not to expose their custom accessor to user code, so a (theoretical) s3lib library could be used like this:

S3Path = s3lib.make_path_type(bucket='...')
root = S3Path('/')

and under-the-hood:

def make_path_type(bucket):
    class S3Path(pathlib.UserPath):
        _accessor = _S3Accessor(bucket)
    return S3Path

Marco_Sulla · March 10, 2020, 11:28pm

What about an ABC class?

barneygale · March 29, 2020, 8:12pm

Some renames that might make the API clearer:

_Accessor → AbstractFileSystem
_NormalAccessor → LocalFileSystem

I’d suggest both AbstractFileSystem and LocalFileSystem gets a __new__ method that prevents direct instantiation (subclass + instantiate would be fine) to guard against direct usage, per @pitrou

The key question for me is:

I have this FileSystem object stored in a myfs variable, how can I get a Path type that uses it?

I personally haven’t come up with anything good here. Options include:

Leave it entirely up to the user or library to construct a Path type with their FileSystem instance attached
myfs.Path(...) - Path is an attribute that stores a subclass of pathlib.Path with our FS instance bound.
pathlib.make_path_type(myfs)
pathlib.Path(..., fs=myfs)
Do away with FileSystem (i.e. accessors) altogether, and merge their methods into Path?
Something else?

Thanks!