Make pathlib extensible

I suggest you copy pathlib and its tests, and publish it on PyPI with a new name (maybe pathlab.pathlib?). Then, add the new features, tests, documentation, test it in the real world, and after that propose changes into pathlib.
This way, you’re not blocked by anyone until the very end, it can be tested easily, and the improvements will be available even for older versions of Python.

(This is not meant to push you away! It’s a comfortable branch-merge workflow for big changes. Core devs also use similar workflows, see for example importlib-metadata, asyncio/tulip. Or compileall improvements from my colleagues.)

Great, thank you! I suppose my main aim with this thread was to gauge support for the idea and get help identifying possible implementation roadblocks. Thanks for the help, all. I think I’m right in saying there’s some cautious interest in the idea. I’ll now set up a branch of pathlib and write an initial implementation. This will probably lack some of the possible niceties mentioned in my original post - like a user-oriented version of os.stat_result.

For anyone interested in following my progress, my first attempt is available in a branch here: https://github.com/barneygale/cpython/tree/pathlib-abc. Currently lacking: tests, docs, and niceties like a one-liner to generate a UserPath subclass with a particular accessor instance bound.

1 Like

Hi all,

I think I’ve now implemented as much as I can without further design input. You can see my changes here.

Noddy example of how this can be used:

import pathlib
import stat
import os

class MyPath(pathlib.UserPosixPath):
    pass

class MyAccessor(pathlib.AbstractAccessor):
    path_type = MyPath

    def __init__(self, children):
        self.children = children

    def stat(self, path, *, follow_symlinks=True):
        return os.stat_result((stat.S_IFDIR, 0, 0, 0, 0, 0, 0, 0, 0, 0))

    def scandir(self, path):
        for child in self.children:
            yield self.path_type(path, child)

def main():
    accessor = MyAccessor(('foo', 'bar'))
    MyPath = accessor.path_type
    for p in (MyPath('/'), MyPath('/mlem')):
        print(p)
        print(repr(p))
        print(list(p.iterdir()))
        print(p.exists())
        print(p.is_file())
        print(p.is_dir())

main()

Changes:

  • Added UserPath, UserPosixPath and UserWindowsPath classes. These are intended to be subclassed by users.
  • Renamed _Accessor to AbstractAccessor
  • Renamed _NormalAccessor to NormalAccessor
  • Renamed _normal_accessor to normal_accessor
  • Added AbstractAccessor.path_type class attribute. In user subclasses of AbstractAccessor, this can be set to (a subclass of) UserPosixPath or UserWindowsPath. After an accessor is instantiated, accessor.path_type points to a subclass associated with the accessor instance. I’m not 100% sold on this tbh.
  • Added accessor.getcwd()
  • Added accessor.expanduser() (replaces flavour.gethomedir())
  • Added accessor.fsencode()
  • Added accessor.fspath(), which is called to implement __fspath__ and might perform a download/extraction in subclasses.
  • Added accessor.owner() and accessor.group()
  • Added accessor.touch()
  • Changed accessor.open() such that it is now expected to function like io.open() and not os.open()
  • Renamed accessor.link_to() to accessor.link() (for consistency with os)
  • Removed accessor.listdir() (now uses scandir())
  • Removed accessor.lstat() (now uses stat())
  • Removed accessor.lchmod() (now uses chmod())
  • Removed support for using a Path object as a context manager (bpo-39682)

I’d appreciate some feedback on the work so far - does it look generally OK? Am I breaking things I shouldn’t be?

1 Like

Hi all,

I’ve put in a number of pull requests that lay some groundwork. There’s a small spreadsheet summarizing the changes here. Most of the changes are geared towards moving impure functionality into the _NormalAccessor class.

Question: how should __fspath__() work in Path subclasses? Is it reasonable to delegate to the accessor (with a default implementation of str(path)), or should __fspath__() stay as it is?

My initial feeling was that __fspath__() should return a path that obeys local filesystem semantics, and therefore it’s reasonable to delegate to the accessor in case the accessor-writer wants to download/extract the file and return a local filesystem path. But re-reading PEP 519, and considering that PurePosixPath and PureWindowsPath are path-like irrespective of local system, I’m now leaning towards __fspath__() always being implemented as return str(self), as it is currently.

Does this sound right? And does the same argument apply to Path.__bytes__()?

Thanks

1 Like

Well… that’s ultimately the implementer’s responsibility, but I view __fspath__ as a light-weight, side-effect free inquiry. If it’s necessary to do a background copy to return a local filesystem path, then perhaps you shouldn’t implement __fspath__ at all, and instead let the user open the file (or a middle layer) explicitly for reading.

1 Like

It does look generally ok, though validation will come through actually running the test suite against those changes.

One question about your example: are you expecting users to have knowledge of the Accessor class or subclass? For me, that was purely an implementation detail.

1 Like

Thank you!

On __fspath__(): I think I agree, thanks. I’ll withdraw the relevant bug report + PR.

On exposing Accessor, I had two ideas:

  1. Expose and document Accessor, and make this the principal class to subclass for implementing a custom Path subclass. It’s a kinda handy to keep all the filesystem-accessing stuff separate, and keep Path objects as an immutable view. It also means users don’t have to deal with any __new__ magic, because accessors are pretty boring :-). But it does mean users would probably need to subclass both Accessor and Path (or to call some other factory function, like MyPath = accessor.path_type in the example above), which might be asking too much.
  2. Move all Accessor methods into Path, but this might make it awkward to keep state around, and I’m not currently a fan of this idea.

I appreciate this wasn’t the original intention for accessors, but they seem to fit this problem pretty nicely, and provide a path forward without any big rewrites.

1 Like

I agree with making Accessor a hook for subclass implementers. Your example seemed to show it being invoked by the end user, though, which doesn’t sound necessary.

1 Like

I’m imagining that a single Accessor instance is created by the user for a particular filesystem, e.g.:

accessor = S3Accessor(
    username='aws_user', 
    token='aws_token', 
    bucket='aws_bucket')
S3Path = accessor.path_type  # Evaluates to a `Path` subclass with this particular accessor instance bound
root = S3Path('/')
assert root.exists()
1 Like

I’m a bit skeptical about this. My first hunch is that implementers can create another public-facing abstraction, for example a FileSystem object. If we expose Accessor to the public, people will inevitably be using it directly and we may end up more constrained that it we make it an implementer-only API (where we can probably be less strict when it comes to e.g. API stability).

2 Likes

Very fair. If it helps, I was planning to only expose AbstractAccessor, and keep _NormalAccessor private.

Authors of libraries that extend pathlib may still choose not to expose their custom accessor to user code, so a (theoretical) s3lib library could be used like this:

S3Path = s3lib.make_path_type(bucket='...')
root = S3Path('/')

and under-the-hood:

def make_path_type(bucket):
    class S3Path(pathlib.UserPath):
        _accessor = _S3Accessor(bucket)
    return S3Path
1 Like

What about an ABC class?

1 Like

Some renames that might make the API clearer:

  • _Accessor --> AbstractFileSystem
  • _NormalAccessor --> LocalFileSystem

I’d suggest both AbstractFileSystem and LocalFileSystem gets a __new__ method that prevents direct instantiation (subclass + instantiate would be fine) to guard against direct usage, per @pitrou

The key question for me is:

I have this FileSystem object stored in a myfs variable, how can I get a Path type that uses it?

I personally haven’t come up with anything good here. Options include:

  1. Leave it entirely up to the user or library to construct a Path type with their FileSystem instance attached
  2. myfs.Path(...) - Path is an attribute that stores a subclass of pathlib.Path with our FS instance bound.
  3. pathlib.make_path_type(myfs)
  4. pathlib.Path(..., fs=myfs)
  5. Do away with FileSystem (i.e. accessors) altogether, and merge their methods into Path?
  6. Something else?

Thanks!

Here’s the current state of play. I’ve opened a bunch of bug reports and PRs that lay the groundwork for making pathlib extensible, but avoid any big design decisions. They are as follows:

Already in review and nearing completion (thanks @pitrou and @brettcannon!):

  • PR 18909: Add Path.hardlink_to() method
  • PR 18846: Remove Path context manager support

Awaiting review:

  • PR 19220: Simplify handling of missing os methods
  • PR 18839: Remove needless cast to Path in Path.is_mount()
  • PR 18836: Use accessor to implement Path.samefile()
  • PR 18838: Use accessor to implement Path.touch()
  • PR 18834: Use accessor to implement Path.cwd() and Path.absolute()
  • PR 18844: Use accessor to implement Path.owner() and Path.group()
  • PR 18841: Use accessor to implement Path.home() and Path.expanduser()
  • PR 18864: Add follow_symlinks parameter to Path.stat() and Path.chmod()

Currently blocked by other tickets:

  • bpo-40038: Remove Path._init()
  • bpo-39895: Make _Accessor.open wrap io.open rather than os.open

Once these land I’ll put in a really simple PR that exposes accessors to the public and adds docs. Hopefully that can act as a launch pad for some further discussion on the design.

I don’t want those to become public “filesystem” APIs. There are filesystem APIs out there on PyPI that people can use. _Accessor wasn’t designed for that. Actually, it wasn’t designed at all. It’s just an implementation detail, and should remain so.

The _Accessor API is almost a strict subset of the os API. I personally think this is an ideal filesystem API to expose to users, as the functions they need to implement have familiar and exact equivalents in the os module. In my PRs I’ve also removed some accessor functions like lstat(), lchmod() and utime() to reduce the surface area for implementors.

I’m also keen to ease the migration for existing subclassers of _Accessor, and to avoid the heartache of designing a “good” filesystem API when the os module already does the job well enough.

I’m having another look at implementing this by removing accessors altogether - which is hopefully more up your street @pitrou . The idea is to add an UserPath, which Path then subclasses. Certain methods (e.g. stat()) will raise NotImplementedError, whereas others like glob() - which rely on those lower-level functions and don’t directly use os - have default implementations.

The tricky bits are:

  • How do I instantiate my subclass of UserPath and pass it some state/context that is then retained in ‘derived’ paths, like children of a directory? My current thinking is that we can support passing arbitrary keyword arguments to the constructor, like MyPathClass('/foo/bar', fileobj=myfileobj). The constructor (__new__()) calls cls = cls.bind(**kwargs), which allows a subclass to return a new class with the context attached as a class variable. This makes the context naturally ‘sticky’, but I wonder if there’s a better approach using metaclasses. Note that pathlib.Path current accepts and discards keyword arguments.
  • What do we do with UserPath.stat() and methods that call it, like exists() and is_dir()? This currently returns an os.stat_result, which is a little tricky to initialize in user code. A helper function or a compatible pure-Python class using dataclasses might be nice.

Hopefully this is a more amenable approach!

A long overdue update!

With help from my reviewers the following changes have been merged:

  • PR 18846: Remove Path context manager support
  • PR 18839: Remove needless cast to Path in Path.is_mount()
  • PR 18836: Use accessor to implement Path.samefile()
  • PR 18844: Use accessor to implement Path.owner() and Path.group()

Awaiting merge:

  • PR 19220: Simplify handling of missing os methods
  • PR 18838: Use accessor to implement Path.touch()

Awaiting review:

  • PR 19342: Remove partial support for preserving accessor when modifying a path
  • PR 18841: Use accessor to implement Path.home() and Path.expanduser()
  • PR 18834: Use accessor to implement Path.cwd() and Path.absolute()
  • PR 18909: Add Path.hardlink_to() method
  • PR 18864: Add follow_symlinks parameter to Path.stat() and Path.chmod()

Upcoming:

  • Remove Path._opener() and Path._raw_open() (bpo-40107)
  • Add username parameter to Path.home()

Once these are merged, my plan from there is as follows:

  1. Ensure each accessor method is only called from one place, e.g. either absolute() or cwd() should make a getcwd() accessor call, but not both.
  2. Move any functions that don’t make accessor calls to a new UserPath class that derives from PurePath. This includes things like glob(), read_text(), is_dir(). For any functions that do make accessor calls, add equivalent functions to UserPath that simply raise NotImplementedError. Make Path derive from UserPath.
  3. Remove accessors altogether - replace with direct calls of os functions

With all this done, we end up with a new pathlib.UserPath class with the following abstract methods:

  • ‘read’ operations: stat(), owner(), group(), iterdir(), readlink(), cwd(), home()
  • ‘create’ operations: touch(), mkdir(), symlink_to(), hardlink_to()
  • ‘move’ / ‘delete’ operations: rename(), replace(), unlink(), rmdir()
  • other operations: open(), chmod()

For local filesystem paths, these methods are implemented in pathlib.Path (a subclass).

UserPath contains implementations of methods that rely on these abstract methods, e.g. is_*, read_*, write_*, glob. exists, absolute, expanduser.

With this you can write your own subclass of UserPath that implements some of these methods (probably at least stat() and iterdir()). There’s still plenty of docs work and a couple of niceties to do after that.

Feedback very welcome :slight_smile:

1 Like

A bit more thinking out loud.

I’m playing around with this class hierarchy:

The superclass of Path is now named AbstractPath. Its API is identical to Path, except some methods raise NotImplementedError.

Note: What I referred to as UserPath in my previous post is now AbstractPath

I’ve added a new class called AbstractUserPath into the mix here, which simplifies some of the AbstractPath API. Specifically:

  • stat() is expected to return UserFileAttributes, which adds fields for owner, group and directory children
  • Adds create(), move() and delete() abstract methods
  • owner(), group() and iterdir() are implemented via stat()
  • touch(), mkdir(), symlink_to() and hardlink_to() are implemented via create()
  • rename() and replace() are implemented via move()
  • unlink() and rmdir() are implemented via delete().

Thus AbstractUserPath would have the following abstract methods: home(), cwd(), stat(), chmod(), open(), create(), move(), delete()