Make pathlib extensible

Here’s the current state of play. I’ve opened a bunch of bug reports and PRs that lay the groundwork for making pathlib extensible, but avoid any big design decisions. They are as follows:

Already in review and nearing completion (thanks @pitrou and @brettcannon!):

  • PR 18909: Add Path.hardlink_to() method
  • PR 18846: Remove Path context manager support

Awaiting review:

  • PR 19220: Simplify handling of missing os methods
  • PR 18839: Remove needless cast to Path in Path.is_mount()
  • PR 18836: Use accessor to implement Path.samefile()
  • PR 18838: Use accessor to implement Path.touch()
  • PR 18834: Use accessor to implement Path.cwd() and Path.absolute()
  • PR 18844: Use accessor to implement Path.owner() and Path.group()
  • PR 18841: Use accessor to implement Path.home() and Path.expanduser()
  • PR 18864: Add follow_symlinks parameter to Path.stat() and Path.chmod()

Currently blocked by other tickets:

  • bpo-40038: Remove Path._init()
  • bpo-39895: Make _Accessor.open wrap io.open rather than os.open

Once these land I’ll put in a really simple PR that exposes accessors to the public and adds docs. Hopefully that can act as a launch pad for some further discussion on the design.

I don’t want those to become public “filesystem” APIs. There are filesystem APIs out there on PyPI that people can use. _Accessor wasn’t designed for that. Actually, it wasn’t designed at all. It’s just an implementation detail, and should remain so.

The _Accessor API is almost a strict subset of the os API. I personally think this is an ideal filesystem API to expose to users, as the functions they need to implement have familiar and exact equivalents in the os module. In my PRs I’ve also removed some accessor functions like lstat(), lchmod() and utime() to reduce the surface area for implementors.

I’m also keen to ease the migration for existing subclassers of _Accessor, and to avoid the heartache of designing a “good” filesystem API when the os module already does the job well enough.

I’m having another look at implementing this by removing accessors altogether - which is hopefully more up your street @pitrou . The idea is to add an UserPath, which Path then subclasses. Certain methods (e.g. stat()) will raise NotImplementedError, whereas others like glob() - which rely on those lower-level functions and don’t directly use os - have default implementations.

The tricky bits are:

  • How do I instantiate my subclass of UserPath and pass it some state/context that is then retained in ‘derived’ paths, like children of a directory? My current thinking is that we can support passing arbitrary keyword arguments to the constructor, like MyPathClass('/foo/bar', fileobj=myfileobj). The constructor (__new__()) calls cls = cls.bind(**kwargs), which allows a subclass to return a new class with the context attached as a class variable. This makes the context naturally ‘sticky’, but I wonder if there’s a better approach using metaclasses. Note that pathlib.Path current accepts and discards keyword arguments.
  • What do we do with UserPath.stat() and methods that call it, like exists() and is_dir()? This currently returns an os.stat_result, which is a little tricky to initialize in user code. A helper function or a compatible pure-Python class using dataclasses might be nice.

Hopefully this is a more amenable approach!

A long overdue update!

With help from my reviewers the following changes have been merged:

  • PR 18846: Remove Path context manager support
  • PR 18839: Remove needless cast to Path in Path.is_mount()
  • PR 18836: Use accessor to implement Path.samefile()
  • PR 18844: Use accessor to implement Path.owner() and Path.group()

Awaiting merge:

  • PR 19220: Simplify handling of missing os methods
  • PR 18838: Use accessor to implement Path.touch()

Awaiting review:

  • PR 19342: Remove partial support for preserving accessor when modifying a path
  • PR 18841: Use accessor to implement Path.home() and Path.expanduser()
  • PR 18834: Use accessor to implement Path.cwd() and Path.absolute()
  • PR 18909: Add Path.hardlink_to() method
  • PR 18864: Add follow_symlinks parameter to Path.stat() and Path.chmod()

Upcoming:

  • Remove Path._opener() and Path._raw_open() (bpo-40107)
  • Add username parameter to Path.home()

Once these are merged, my plan from there is as follows:

  1. Ensure each accessor method is only called from one place, e.g. either absolute() or cwd() should make a getcwd() accessor call, but not both.
  2. Move any functions that don’t make accessor calls to a new UserPath class that derives from PurePath. This includes things like glob(), read_text(), is_dir(). For any functions that do make accessor calls, add equivalent functions to UserPath that simply raise NotImplementedError. Make Path derive from UserPath.
  3. Remove accessors altogether - replace with direct calls of os functions

With all this done, we end up with a new pathlib.UserPath class with the following abstract methods:

  • ‘read’ operations: stat(), owner(), group(), iterdir(), readlink(), cwd(), home()
  • ‘create’ operations: touch(), mkdir(), symlink_to(), hardlink_to()
  • ‘move’ / ‘delete’ operations: rename(), replace(), unlink(), rmdir()
  • other operations: open(), chmod()

For local filesystem paths, these methods are implemented in pathlib.Path (a subclass).

UserPath contains implementations of methods that rely on these abstract methods, e.g. is_*, read_*, write_*, glob. exists, absolute, expanduser.

With this you can write your own subclass of UserPath that implements some of these methods (probably at least stat() and iterdir()). There’s still plenty of docs work and a couple of niceties to do after that.

Feedback very welcome :slight_smile:

1 Like

A bit more thinking out loud.

I’m playing around with this class hierarchy:

The superclass of Path is now named AbstractPath. Its API is identical to Path, except some methods raise NotImplementedError.

Note: What I referred to as UserPath in my previous post is now AbstractPath

I’ve added a new class called AbstractUserPath into the mix here, which simplifies some of the AbstractPath API. Specifically:

  • stat() is expected to return UserFileAttributes, which adds fields for owner, group and directory children
  • Adds create(), move() and delete() abstract methods
  • owner(), group() and iterdir() are implemented via stat()
  • touch(), mkdir(), symlink_to() and hardlink_to() are implemented via create()
  • rename() and replace() are implemented via move()
  • unlink() and rmdir() are implemented via delete().

Thus AbstractUserPath would have the following abstract methods: home(), cwd(), stat(), chmod(), open(), create(), move(), delete()

Another diagram showing the broader ecosystem of classes, with some method calls shown:

Adding the AbstractPath class is the first phase. This provides a low-level user-extensible class.

Adding the *User* classes is the second phase, and much more open to discussion and design. The intention is to provide high-level user-extensible classes. I’ll prototype this in an external package.

That looks great!
I recommend default implementations for home() and cwd() that either return the vanilla Path or raise a suitable error. Those are related to the environment rather than just a filesystem, so it’s reasonable to assume exotic filesystems won’t have them.
And perhaps also make owner() , group(), symlink_to() and hardlink_to() raise by default, assuming that if they’re not implemented, the filesystem doesn’t support that operation. This would make error messages more consistent for the users of the various classes.
I could also imagine owner() & group() getting a default argument similar to dict.get's or getattr's, seeing that they sometimes raise KeyError even on vanilla Path.

AbstractUserPath, on the other hand, look like unnecessary generalization. It seems to have a very specific set of assumptions (none of which hold my use case, FWIW):

  • the filesystem stores owner (user/group) information
  • directories, links and regular files are treated similarly when creating, moving and deleting them
  • listing directory contents doesn’t slow down stat? (I might be misunderstanding this one)

If those don’t hold, AbstractUserPath could still be used, but it wouldn’t really help.
I suggest writing a few different concrete user paths first, and then seeing what could be simplified (with a superclass or otherwise).

Something like UserFileAttributes looks potentially useful, but I’d rather make it into a good standard stat_result lookalike for “exotic filesystems” that might not have all the data (or might have some extra data).
IMO, a stat result should be an immutable, data-only object that’s very fast to retrieve. Attributes like “user name” only make sense on it if they’re readily available. In general, I’d keep user(), group() etc. as methods.

I hope the post is helpful, even though I’m all talk and no action.

2 Likes

Thanks very much for the excellent feedback! Very helpful! I think I agree with all your points on AbstractPath. I found the AbstractUserPath assumptions to work reasonably well for my pet use cases (tar, zip, iso9660 and artifactory) but I’ll look again at what’s actually needed. If I still think the User classes are an important part, I’ll attempt to put together a well-argued case for em :slight_smile:

On a more procedural note, is there more I could be doing to speed this work along, or get others involved? I’m very grateful for Antoine and Brett’s time and expertise in code reviews, but so far only 4 of the ~15 PRs needed to get to AbstractPath have landed (list here). Maybe I should post to a python-ideas / python-dev mailing list? Is worthy of a PEP? Or perhaps just an adjustment to my eagerness?

Thanks again

Another update on this work. I’ve been plugging away with bugs and PRs to improve the pathlib internals. Huge thanks to my reviewers and those who chipped in with suggestions.

We’re now in a position where:

  • Flavours do not access the filesystem.
  • Paths access the filesystem only via their accessor

The internal abstractions are now tight enough for us to start considering larger changes. The first of these is to remove accessors entirely!

We’re then in a great position to add an AbstractPath class; the core diff will be straightforward, but it opens several cans of worms I’ve mentioned previously (binding state, os.DirEntry compatibility, os.stat_result enhancements, impact on zipfile, tarfile). I’m leaning towards writing a PEP to cover these things.

3 Likes

Any thoughts on how different Path types should compare?

You can currently compare pure and non-pure paths of the same flavour, e.g. comparisons between PurePosixPath and PosixPath objects work as expected.

I’m not sure how to generalise this to user-defined AbstractPath subclasses. Some ideas:

  • Allow comparisons if one operand is a subclass of the other?
  • Disallow comparisons between different types in general, but special-case PosixPath and WindowsPath to allow comparisons with their pure variants?
  • Something else?

A quick update for anyone following along. @encukou and I had a chat about the work so far, the tentative plan, and the PRs currently open. This was immensely helpful, thank you Petr! The short term plan is as follows:

  • #26153 - Test and document Path.absolute()
  • #25701 - Remove pathlib._Accessor
  • #26141 - Remove pathlib._Flavour
  • (no PR yet) - Add pathlib._AbstractPath as an experimental, unstable API.

In the medium term:

  • Write a draft PEP detailing the addition of _AbstractPath and the known design questions that must be answered before we promote it to AbstractPath.
  • Share the PEP with interested developers, particularly those maintaining libraries that already subclass pathlib.Path in naughty ways
  • In my own third-party package(s), use _AbstractPath to implement as many popular filesystem/archive/etc formats as I can.
  • Incrementally improve the _AbstractPath implementation as we discover its rough edges.
  • Incrementally expand the PEP, discussing design questions and proposed answers.

Petr and I also discussed the Windows and Posix path syntax. Given _AbstractPath subclasses from PurePath, and not PurePosixPath or PureWindowsPath, we talked about how subclasses of _AbstractPath can opt in to posix-style or windows-style syntax.

The affected methods are:

  • _splitroot(), which extracts the drive, root and parts from a string sequence
  • _casefold(), which makes the path lowercase on Windows for use in comparisons.
  • is_absolute(), which must consider the drive on Windows
  • is_reserved(), which returns True for certain filenames on Windows like AUX)
  • as_uri()

My current view (reflected in the #26141 patch) is that the default implementations of these methods in PurePath (and consequently _AbstractPath) should follow posix syntax by default, with the exception of as_uri() which should raise by default.

In all relevant cases the posix behaviour is more primitive and straightforward, whereas the Windows behaviour is more elaborate. This is especially apparent in _casefold() (always returns input) and is_reserved() (always returns false).

For subclasses of _AbstractPath that need Windows syntax, they can get this via:

class MyLovelyPath(pathlib._AbstractPath, pathlib.PureWindowsPath):
    ...

For posix syntax (which is much more common in my experience), the incantation is a little simpler:

class MySuperPath(pathlib._AbstractPath):
    ...

image

I’d really appreciate review from core devs of:

  • PR 26153 - document and test pathlib.Path.absolute()
  • PR 25701 - remove pathlib._Accessor

Thanks :slight_smile:

1 Like

I would appreciate if Bernie contributed a bit more before demanding core reviews :wink:

1 Like

Just in case you’re being at least partly serious, I am doing my best to monitor the bug tracker + github + forums etc to support other people using pathlib! I am trying to be unselfish in this work but if I’m coming across as more of a hindrance than a help I do apologise and will reflect on where I spend my time. I find this a little hard to judge.

If we’re just making fun of Bernie’s contributions, lets not discount his great meme potential! :slight_smile:

1 Like

This was just in jest :slight_smile:

1 Like

Sorry for the radio silence - I’ve been pretty busy last few months!

I’d really like to get this PR over the line. Would a core dev be willing to help? It removes an internal abstraction called _NormalAccessor that needlessly complicated pathlib’s internals.

Accessors are:

  • Lacking any internal purpose - ‘_NormalAccessor’ is the only implementation
  • Lacking any firm conceptual difference to Path objects themselves (inc. subclasses)
  • Non-public, i.e. underscore prefixed - ‘_Accessor’ and ‘_NormalAccessor’
  • Unofficially used to implement customized Path objects, but once once bpo-24132 is addressed there will be a supported route for that.

This patch preserves all existing behaviour.

Thank you.

:sparkles: January 2022 progress report :sparkles:

Many thanks to Éric Araujo, Brett Cannon and Zachary Ware for their help reviewing my PRs. I’ve landed a small performance improvement, and two other important PRs look to be nearing completion.

To summarise some of the foregoing discussion in this thread, I reckon there are two main places where pathlib could be extended by users:

  1. Path-like objects that operate on S3, zip files, etc. This was the original focus of this thread.
  2. Path subclasses that operate on the local filesystem, but add new methods or modify existing ones

I’m currently focusing on the former case, but the use cases will become more related as work progresses.

Zooming in a bit, we’re two PRs away from entering a “phase 2” of the project where experiments can be undertaken outside the CPython source tree much more readily (e.g. in PyPI packages). My previous comment describes the first of those PRs.

That’s all, cheers!

3 Likes

:sparkles: Feburary 2022 progress report :sparkles:

Big thanks as ever to folks who have reviewed PRs and contributed ideas – Alex, Brett, Ethan, Éric, Eryk, Petr, Zachary and probably others I’ve missed!

There’s been quite a lot of progress over the last month. Here’s a rundown:

  • PR 25701 - pathlib._Accessor has been removed. Its presence was a big pain point when refactoring the pathlib internals. Its removal marks the beginning of a “phase 2” to this work!
  • PR 26153 - pathlib.Path.absolute() is now fully documented and tested.
  • PR 30971 - pathlib.Path.__enter__() now raises DeprecationWarning and is scheduled for removal in 3.13.
  • PR 31085 - pathlib._AbstractPath is introduced by this (open) MR. It has abstract stat(), open() and iterdir() methods.

Before we can remove the underscore prefix from _AbstractPath (and add full docs + tests), we need to do at least three more things:

  1. Remove the need to further subclass PurePosixPath or PureWindowsPath in order to set _flavour.
  2. Figure out how to pass state (sockets, file objects, etc) into objects generated by path / 'foo', path.iterdir(), etc.
  3. Figure out if we’d expect user to construct os.stat_result objects from their stat() implementations – and if not, what?

I think I have a handle on the first of these – look out for a future PR named "Remove pathlib._Flavour". Feedback/ideas very welcome on the second and third!

My focus for the next month will be on getting pathlib._AbstractPath over the line, and removing pathlib._Flavour.

Thanks for reading o/

6 Likes