Fixing Subclassing in Pathlib

I submit for your consideration what I believe is the first working version (with documentation) of an extensible, subclassable PurePath/Path to close this 6-year-old bug(1) that has spawned both a StackOverflow thread(2) and a page on CodeReview(3). I present the fix here both as a PR and an idea because as I understand it, this is the proper forum to discuss making additions (which my fix requires) to the standard library.

I believe that to make sense of what I did to fix it, it is important to understand at a deep level what the real cause of this bug report is. To do so you have to go down a little bit of a rabbit hole of logic. Iā€™m hoping you will indulge me and join me as I walk you through why the bug exists and has persisted for so long. For the points that I am about to make I will examine Path, and its derivatives PosixPath and WindowsPath, but the points I make apply equally to PurePath and its derivatives.

The reason that Path is not naively subclassable (by design) is because it is a factory class which when instantiated returns one of two entirely different classes. As such, despite what the dependency diagram indicates at the top of the Pathlib documentation(4), Path inverts the normal dependency relationship with its derivative subclasses. To produce them, it must know that they exist somehow. Right now they are just hard-coded references in the new method. This means however if we naively subclass path with something like:

class MyPath(Path):
    pass

It will fail because we didnā€™t also create new derivative classes to correspond to each of the Posix and Windows system flavours and tell our new MyPath about them so that it can generate them.

To my knowledge, there are four solutions to this problem, and I believe only one of them should be preferred. Before I show you that, Iā€™m hoping you will go with me and take a moment to consider why Path is a factory that returns different classes in the first place.

From what I gather from the documentation and the PEP, the reason for this is to facilitate writing platform-agnostic code.

From PEP 428:
It is expected that, in most uses, using the Path class is adequate, which is why it has the shortest name of all.

From the Pathlib documentation:
If youā€™ve never used this module before or just arenā€™t sure which class is right for your task, Path is most likely what you need.

Using Path one could for instance write:

path = Path(get_path_from_somewhere())
do_stuff(path)

This will run on both Posix and Windows as long as do_stuff is polymorphic with respect to both PosixPath and WindowsPath. This might all seem trivial, but if youā€™ll go with me a moment, letā€™s think about how that polymorphism would be achieved in actual practice. One could do the preferred:

def do_stuff(path):
    path.really_do_the_stuff()

And then define the platform-specific logic on really_do_the_stuff on our customized derivatives of PosixPath and WindowsPath. Except that wonā€™t work because Path isnā€™t subclassable.

So alternatively, we could do this:

def do_stuff(path):
    if isinstance(path, PosixPath):
        really_do_unix_stuff()
    else:
        really_do_windows_stuff()

Ok, but then, this could instead be written as:

def do_stuff(path):
    import os
    if os.name == "posix":
        really_do_unix_stuff()
    else:
        really_do_windows_stuff()

However, if we donā€™t need to reference the class names with isinstance to facilitate polymorphism, we are left wondering what having a separately named PosixPath and WindowsPath actually does for us.

This hints that this whole factory design of Path/PurePath maybe isnā€™t achieving what one might think that it would. But it must be necessary right? It canā€™t just be that it is only there complicating things and thereby preventing subclassing, right? Pathlib is brilliant with its object-oriented goodness and convenience methods (infinite praise goes out to Antoine Pitrou for this), but if the factory were necessary for all of that I wouldnā€™t be writing this. However, it turns out we donā€™t need the factory. Path works fine without it.

I know this because I have constructed a complete replacement of PurePath and Path which omit the factory design but instead attach the flavour to the class at the time of instantiation. These newly designed classes still pass all of the test cases that are run against PurePath and Path on both Windows and Linux*. Because they omit the factory design, they donā€™t have inverted dependencies and therefore are naively subclassable and extensible in any way you see fit. You can use them to achieve platform-agnostic code either using os.name as above, or by attaching platform-specific code to subclasses and calling it that way.

So now weā€™ve come to what the real crux of the problem is. Itā€™s not an issue hidden in the implementation, itā€™s an issue hidden in the design. Of course this design public, and also codified in a PEP. As such the only real way out is to introduce an alternative set of classes to pathlib. To this end I give you:

SimplePath (subclassable alternative to PurePath)
SimplePosixPath (subclassable alternative to PurePosixPath)
SimpleWindowsPath (subclassable alternative to PureWindowsPath)
FilePath (subclassable alternative to Path/PosixPath/WindowsPath)
PathIOMixin (**explained further below)

On Windows, SimplePath behaves as if it were PureWindowsPath. On Posix, it behaves as PurePosixPath. Similarly, FilePath behaves like WindowsPath on Windows and PosixPath on Posix. These four classes combined (less PathIOMixin) could, if one were so inclined, act as a complete replacement for the existing six Path/PurePath classes.

Iā€™ve attached all the code discussed above as part of this PR to close this issue. If you are still in doubt, Iā€™m hoping youā€™ll take the time to look at my code. Despite being 11 commits, itā€™s essentially just two refactors. The first splits the two responsibilities PurePath has, separating the base class methods into _PurePathBase. The second does the same with Path, moving the base class methods into PathIOMixin. The other commits are just minor tweaks and documentation to account for all of that.

The advantage this design has is that it allows subclassable PurePath/Path-like objects that people can work with while abiding by the existing standards framework and simultaneously not breaking anyoneā€™s existing code. If at some point one decided to deprecate via official methods the less functional versions of these classes and fix all of the surrounding documentation, then that is something that could be pursued.

In the beginning, I mentioned that I am aware of four solutions to this problem. Iā€™ve left the other three until now because I wanted to make sure that you really understood the problem and how it could be avoided. The difference between the solution I gave above and the following, is that all of the following bolster the existing problem in design. Every one of them is built on top of the existing problem, pouring concrete around the assumption that the factory is necessary and fixing it in place for years and years to come. I prefer to leave the option open to remove that at some future date if it is not offering any functional benefit. All that said, in good faith, here are the other options:

A)

# Make Public in __all__

class SubclassablePurePath1(type(PurePath())):
    pass

class SubclassablePath1(type(Path())):
    pass

or alternatively:

import os
# Make Public in __all__

class SubclassablePurePath2(
    PurePosixPath if os.name == "posix" else PureWindowsPath
): pass

class SubclassablePath2(
    PosixPath if os.name == "posix" else WindowsPath
): pass

B)

Have Path be aware of its subclasses via registration with for example
init_subclass. Then, upon instantiation, in the new method, check whether a subclass of the appropriate flavour exists. If not create it with type, and then instantiate an instance of that with the pathsegment arguments. Then there are all sorts of little caveats to worry about. First, you have to worry about what if there are already existing multiple subclasses with the right flavour, how do you decide which is the class to use? Also, when instantiating the naively subclassed

class MyCoolPath(PurePath): pass

How do we choose the names of the derived classes if they donā€™t exist? On windows would we create a WindowsMyCoolPath? Or should it be MyCoolWindowsPath? (It turns out it would have actually been more straightforward if Pathlib used WindowsPurePath instead of PureWindowsPath, but Iā€™m not sure how that plays with the English grammar rules for combining adjectives.) Also, this begs the question, should we create an overrideable function that allows users to customize how the name of their derived classes are chosen?

The answer is no. Just no. I started writing this, and I deleted it because I realized this merely kicks the can down the road and ignores the real problem. People are going to run into all sorts of derived class naming issues at the very least.

C)

So on to the fourth and last solution. Recently, there has been a lot of reorganization in pathlib. Barney Gale has been putting in a lot of work, fixing inconsistencies in its organization as well as various bugs. I admire how he recognized that _accessor is a vestigial abstraction buried in pathlib and has put in a PR that through a series of commits removes it. He also has a desire to make Path subclassable, and an idea (as I understand it) that he was proposing was to do this not directly, but adding a class that inherits from PurePath and from which Path is derived called AbstractPath. (His thread on that here.) He has several open PRs out there for Pathlib and was put in a lot of work in rewriting the bulk of the functions in pathlib to facilitate his vision. I donā€™t want to misrepresent his ideas, so Barney, if you are reading this, Iā€™m hoping you can explain better than I can what your end solution is going to look like. Also, I apologize, Iā€™m not trying to put you on the spot, but just want to make sure that I make space for an alternative solution and acknowledge all the work you have done towards solving this problem as well.

Incidentally, Barney, I hope youā€™ll see that everything that you are working to achieve falls out naturally from the PR I submitted above.

For instance:

class LimitedIOMixin:
""" Depends on SimplePath """
    def open(self):
    ...
    # Any other specific I/O methods you want to implement

class MyPlatformAgnositcLimitedZipPath(SimplePath, LimitedIOMixin):
    pass

So thatā€™s my argument. I submit my idea and code because I think it is the way to truly solve this problem. Hopefully, others see this as well and also see how everyone gets out of this solution what they were hoping pathlib would provide. If you have made it all the way through this, thank you for your time and consideration. I welcome any feedback at all you have for me.

*Full disclosure, there are a couple of tests that referenced class names and were inappropriate to run, but this statement is otherwise true.

** Why is PathIOMixin Necessary?

The answer is that to write custom IO operations for paths that are dissimilar in flavour to the machine that you are running the code on, you need to be able to combine the flavoured SimplePath with a mixin that provides the IO operations. You could write an entirely new IOMixin, but because of the work Barney is doing simplifying the organization of Path, if you just override 11 of the methods in PathIOMixin then all of the other 21 methods from Path will just automatically work (because they depend on them / will depend on them). For example:

#Called from Posix
class AzureWinIOMixin(PathIOMixin):
    def stat(self, *, follow_symlinks=True):
        ...
    # Also define owner, group, iterdir, readlink, cwd, home, touch, mkdir,
    # symlink_to, hardlink_to, rename, repace, unlink, rmdir, open, chmod

class AzureWinServerPath(SimpleWindowsPath, AzureWinIOMixin):
    pass

Referenced Links:

  1. h-t-t-p-s://bugs.python.org/issue24132
  2. h-t-t-p-s://stackoverflow.com/questions/29850801/subclass-pathlib-path-fails
  3. h-t-t-p-s://codereview.stackexchange.com/questions/162426/subclassing-pathlib-path
  4. h-t-t-p-s://docs.python.org/3/_images/pathlib-inheritance.png
2 Likes

Firstly, I really appreciate all this - itā€™s excellent to have a pathlib comrade in arms thinking about the same sorts of problems!

Secondly, itā€™s worth distinguishing two use cases:

  1. Using pathlib for non-local filesystems - zipfile.Path is a good example here. In this case the host OS is usually irrelevant, and the path syntax is baked into the format itself (e.g. ISO9660 is case-insensitive but POSIX-style). For this Iā€™m proposing a new AbstractPath class that sits between PurePath and Path.
  2. Extending pathlib functionality for local filesystems - e.g. adding the much-requested rmtree() method in a subclass. This is the use case in bpo-24132. For this Iā€™m proposing users go with your option (B).

If I were to design pathlib from scratch, I think Iā€™d arrive at your option (C). The dependency inversion inherent in (B) is precisely the cause of this 6-year delay in implementing something that sounds easy.

However, I think itā€™s too late in the day to switch to (C) now. Your proposal is essentially (B + C), as we need to keep B around for compatibility. That necessarily increases complexity because we have two parallel class hierarchies with slightly different semantics, and no deprecation plan for either.

I think itā€™s possible to make (B) work nicely, despite the difficult issues you raise.

To illustrate how we can refine (B), I think we can get the necessary boilerplate down to:

class TaggedPath(Path):
    def add_tag(self, tag):
        ...

@TaggedPath.register
class TaggedWindowsPath(TaggedPath, WindowsPath):
    pass

@TaggedPath.register
class TaggedPosixPath(TaggedPath, PosixPath):
    pass

(Naming TBD)

Although verbose, this doesnā€™t seem world-endingly awful to me.

If youā€™re going to provide a class decorator, why not just go the route of paramterizing your classes with a required keyword arg from __init_subclass__?

Require __init_subclass__ to specify if posix or windows

1 Like

Only omitted because I didnā€™t know it existed! :sweat_smile: From the docs it looks ideal, thanks for the pointer.

A bit more thinking out loud:

Maybe (C) is possible without breaking backwards compatibility or creating duplicate classes?

It might be possible to functionally remove/deprecate PureWindowsPath, PurePosixPath, WindowsPath and PosixPath, which simplifies the whole ecosystem. You could either do the OS/flavour check at runtime in relevant PurePath methods, or we retain _Flavour and formalise it. The key thing is we end up with a simple hierarchy: PurePath ā†’ AbstractPath ā†’ Path.

The *WindowsPath and *PosixPath classes stick around, but attempting to instantiate one results in a PurePath or Path object. Weā€™d also need to define __instancecheck__ (and others?).

Iā€™m going to withdraw my PR removing flavours until/unless I can rule out this approach.

Barney and John, thank you both for taking the time to give this serious and
deep consideration. I really appreciate the feedback and working with other people with the earnest desire to make pathlib even greater.

Barney I think that the idea to remove/deprecate the flavoured versions of
these classes is brilliant. This encapsulates the gist of what I was trying to
do and simplifies it even further to its core essence. It makes it more clear
that all the flavoured versions of both (Pure)Path are superfluous.

John, speaking of your slick idea with __init_subclass__, I think using that method is how you could make Barneyā€™s idea work. First you could introduce a flavour keyword argument to PurePath. Then you could use __init_subclass__ to examine the mro and check all superclasses for a flavour keyword argument in order to preserve the flavour in subsequent subclasses. I think you would have to do this because it gives a deprecation path for the (admittedly probably rare) cases where someone has explicitly used a *PurePath (or subclass thereof) that is opposite of their system flavour.

# On Windows, instead of PosixPurePath  
PurePath(*pathsegments, flavour="posix")

Then Abstract path could be defined as:

class AbstractPath(PurePath, flavour="generic"):
    ...

With ā€œgenericā€ just an alias to ā€œposixā€ but existing for code clarity.
This has the further advantage that documentation would only have to be limited to the methods and attributes of each class because extensibility works the way that you would expect.

The only barrier I see is that deprecating these classes and adding in AbstractPath
involves several changes to the public API. I am not sure if it will need a formal proposal or not. If so we could combine both ideas into one.

The only barrier I see is that deprecating these classes and adding in AbstractPath
involves several changes to the public API.

Anything backwards-incompatible that comes to mind?

Then you could use __init_subclass__ to examine the mro and check all superclasses for a flavour keyword argument in order to preserve the flavour in subsequent subclasses.

We have a more general problem to solve w.r.t. making the type ā€˜stickyā€™. For AbstractPath subclasses (say, FTPPath), many implementations need to bind some state onto a further type (say, FTPPath<ftp.python.org>. The shared state needs to be sufficiently general to handle shared sockets, requests.Sessions, file objects for .zip files, etc. Preserving flavour can be considered part of this mechanism to make type state sticky.

I think we can make this work by adding a classmethod that generates a subclass with the given state attached. The idea is to allow something like this:

class FTPPath(pathlib.AbstractPath):
   ...

host = 'ftp.python.org'

def main():
    with socket.socket((host, ...)) as sock:
        PythonFTPPath = FTPPath.bind(sock=sock)
        root = PythonFTPPath('/')

        # or perhaps
        root = FTPPath('/', sock=sock)  # __new__ generates a subclass

        # or if we fancy being verbose
        class PythonFTPPath(FTPPath):
            sock = sock
        root = PythonFTPPath('/')    

I may be missing something obvious hereā€¦

Amazingly, I think technically no. It appears everything is overridable / fixable by making PurePath/Path masquerade as all of the flavoured variants when called that way. My biggest concern was how to override the return value of type() calls, but even this is possible though Iā€™ve never even thought to try before. Also in the cases that PosixPath and WindowsPath were subclassed even __mro__ and mro() can be fixed up in the subsequent subclasses as well.

Moreover, implementing all of the machinery to make PurePath/Path directly subclassable by masquerading as *PurePath/*Path would touch all of the places where one would want to insert deprecation notices if one were to actually go through with simplifying the public classes. Even without going through with that, I think bpo-24132 can be done without breaking changes.

Iā€™m going to start working on this. Iā€™ll list you as co-author on all the commits and ping you when Iā€™m done if that sounds fair. Also, Iā€™ll work around anything that seems to be involved in AbstractPath and your current PRs.

Indeed, and now I see better some of the nuances of what you have been working to accomplish. It does seem essential to allow for that extra shared state to be exist across multiple instances of the derivatives of AbstractPath as in the examples you indicate. To me this state seems like a distinct abstraction from flavour (which needs to be inherited by subclasses), but similar in that the combination works to together to define the type of the class.

Between 1 & 2, Itā€™s a toss up to me, but they both seem like they would be straightforward for the end user. 1) is less magical I suppose, however, 2) looks cleaner to my eye.

Iā€™m assuming that for both 1 & 2 in the case where a new instance is returned by a method, the plan is to just alter _from_parts() to also add in the stateful information. Like perhaps store sock in this case internally in a class variable like _state because you would want to avoid having library authors when implementing FTPPath, RequestsPath, etc to have to override _from_parts. However, because what you are building is so abstract, it seems possible to me that state will contain multiple parameters so _state might have to be a dict. Though, youā€™ve probably thought of all of this already.

The third, like you said, seems unnecessarily verbose and will probably lead to additional class definitions sprinkled throughout procedural code. However, unlike the others, if you were to subclass it, sock/state stays with it. I have doubts about the usefulness of that though.

For me, generating a subclass is appealing because it can then be used identically to pathlib.Path. The initializer arguments are exactly the same. You can always initalise one path from another via p2 = type(p1)('/some/path')

But I havenā€™t given enough consideration to the alternative, i.e. that we edit _from_parts() etc as you suggest to pass state through. Thereā€™s actually a number of different codepaths for initialising paths internally - e.g. youā€™ll also see self.__class__() used. Iā€™m not sure how these various methods would know what to pass to the initialiser. We could generalise it as a state attribute (type: Any? dict?) so the code can do:

self.__class__(pth, state=self.state)

Weā€™d also need to factor the state into things like __eq__.

I guess this is why I lean towards a two levels of types for users of AbstractPath, so the state is baked into the concrete type.

Also, very excited to read your patch when youā€™re ready! No need to make me co-author - ideas are free. Do prepare yourself for a number of revisions on your patch though. We have to walk quite a narrow path to get everything working nicely.

One more thing to consider:

Iā€™ve been trying to make pathlib rely more on os and os.path where possible, mostly to eliminate bugs in pathlibā€™s parallel implementations. Path.resolve() and Path.expanduser() have had this treatment in the last few months.

This has the handy effect of removing OS-specific code in pathlib, which in turn simplifies any future juggling of the Windows and Posix subclasses.

My next target is to experiment with storing paths internally as a normalised string (via os.path.normpath()), rather than the current _drv, _root and _parts setup. This promises to dramatically simplify the internals, with a potential performance cost that I still need to measure.

1 Like

The majority of direct instantiations of Path I do are with normalised relative paths, and indirect instantiations are via iterdir and /. I would say thereā€™s a performance gain in not processing the path into parts, and simply caching the known parents when instantiating child paths.

class Path:
    def __init__(self, path):
        self.path = os.path.normpath(path)
        self._cached_parents = []

    def __truediv__(self, name):
        path = self.__class__(os.path.join(self.path, name))
        path._cached_parents[:] = [self] + self._cached_parents
        return path

    @property
    def parents(self):
        parent = self
        for parent in self._cached_parents:
            yield parent
        while True:
            parent = self.__class__(os.path.split(parent)[0])
            if not parent.path:
                break
            self._cached_parents.append(parent)
            yield parent

As an idea

2 Likes

I think the path forward looks like:

  1. Add functions/arguments to os.path such that it has feature parity with pathlib:
    a. Add an argument to normpath() that preserves path meaning (doesnā€™t strip .. entries)
    b. Add os.path.isreserved().
    c. Add os.path.asuri()
  2. Remove pathlibā€™s implementation of these in PureWindowsPath, PurePosixPath, _WindowsFlavour, _PosixFlavour
  3. Remove flavours

Somewhere on the way we may want to rework Path to store the path as a single string, rather than a _drv / _root / _parts combo.

Iā€™ll be directing some effort towards os.path feature parity over the coming weeks.

Is there a way currently to subclass pathlib.Path that Mypy accepts? Iā€™m at a loss.

Can you show us an example of what you tried? It seems to work fine when
I try it.

Here is my code:

from pathlib import Path

class MyPath(Path):
    pass

def func(arg:Path) -> str:
    pass

func(MyPath())

and when I run mypy over it, it reports:

[steve ~]$ mypy pathlib_demo.py 
Success: no issues found in 1 source file

This works now:

import pathlib
import sys

# There is special recognition in Mypy for `sys.platform`, not `os.name`
# https://github.com/python/cpython/blob/09d7319bfe0006d9aa3fc14833b69c24ccafdca6/Lib/pathlib.py#L957
if sys.platform == 'win32':
    _PathBase = pathlib.WindowsPath
else:
    _PathBase = pathlib.PosixPath


class Path(_PathBase):
    ...

Itā€™s a little silly, but this might be the best bet for now:


class MyPath(type(pathlib.Path())):
    ...

That doesnā€™t satisfy Mypy.