Make pathlib extensible

Support for subclassing pathlib.Path was added in 3.12, so this should just work:

class MyPath(pathlib.Path):
    pass

The cost of the check in Path.__new__() is a shame, but unavoidable without breaking APIs as far as I can tell.

1 Like

On POSIX compliant systems, POSIX is the name of the API that’s mostly used to implement the os module, and os.name is “posix”. On Windows systems, Win32 is the name of the API that’s mostly used to implement the os module, but os.name is the name of the base system/kernel, “nt”. On POSIX systems, sys.platform is the name of the base system/kernel, such as “linux” and “darwin”, while on Windows it’s the name of the API, “win32”.

1 Like

Ah I totally missed that Python 3.12 already made pathlib.Path subclassable when I tested my suggested code in Python 3.10.

I understand why for backwards compatibility you need to keep the relative class hierarchy as is and the if cls is Path: cls = WindowsPath if os.name == 'nt' else PosixPath check in Path.__new__, and I also like that the new implementation drops the whole _Flavour class by simply reusing os.path.

Thanks for all the good work. :slight_smile:

4 Likes

At many places, functions support filenames, path-like objects and file handles like this:

if hasattr(filename, "read") or hasattr(filename, "write"):
    fp = filename
    closefp = False
elif isinstance(filename, (str, bytes, os.PathLike)):
    fp = open(filename, mode)
    closefp = True

But this code rules real generalized path-objects out. So how should we improve this?

Is it ok, to expect, that any object with a open-method can be used

if hasattr(filename, "read") or hasattr(filename, "write"):
    fp = filename
    closefp = False
elif hasattr(filename, "open"):
    fp = filename.open(mode)
    closefp = True
elif isinstance(filename, (str, bytes, os.PathLike)):
    fp = open(filename, mode)
    closefp = True

or is it better to check for a instance of type PathBase?

if hasattr(filename, "read") or hasattr(filename, "write"):
    fp = filename
    closefp = False
elif isinstance(filename, pathlib._abc.PathBase):
    fp = filename.open(mode)
    closefp = True
elif isinstance(filename, (str, bytes, os.PathLike)):
    fp = open(filename, mode)
    closefp = True

It think, to allow a broader support for path implementations, there should be one way to handle this case.

Path objects work just fine with open(), they don’t need a separate branch.

But I don’t love this pattern in general–I would rather write one function that takes a file-like object, and another that takes paths and opens them for use by calling the first function.

:sparkles: Sparkly update! :sparkles:

I’ve resolved GH-73991 by adding new Path.copy(), copy_into(), move() and move_into() methods. Publicly these can only be used for local filesystem copies/moves, but secretly these methods are implemented in PathBase and allow any other instance of PathBase as the destination path. With appropriate implementations of PathBase, it should be possible to move a file from local storage to a .zip file, thence to a .tar file, and finally back to local storage, with three calls to move().

There are underscore-prefixed methods for preserving metadata when copying/moving between different types of PathBase, but they need much refinement. As mentioned in my last update, I also need to develop a FlatPathBase class to better support filesystems that aren’t directory-oriented. To make progress on both of these things, I’m planning to write my own private PathBase-derived version of zipfile.Path that passes all/almost all its tests. I’ll extract the generic bits (e.g. generation of implied directories) into FlatPathBase, and ensure we can round-trip metadata in copy() and move(). If I can get it working, it should allow me to finalize the pathlib._abc APIs.

I think that’s it for now. Cheers!

21 Likes

I haven’t closely followed the new developments but I noticed a really useful method is now deprecated: pathlib — Object-oriented filesystem paths — Python 3.14.0a0 documentation

What was the reason for this? I can’t find anything using GitHub’s search.

See Deprecation of pathlib.PurePath.is_reserved()

1 Like

:sparkles: Ooo an update! :sparkles:

I’ve written a hacky implementation of zipp.Path atop PathBase and found the experience very informative! I’m looking forward to sharing the results with @jaraco. One thing that’s become obvious: the PathBase interface is too large, at ~50 methods and attributes.

I’m thinking of eliminating around 20 methods from the PathBase interface, such as is_fifo(), hardlink_to() and group(). Implementations may wish to add those methods, but they shouldn’t be guaranteed by the PathBase interface I don’t think.

I’m also considering splitting PathBase into read-only and read-write classes. The read-only class would define ~20 methods (including three abstract methods: open(), iterdir() and stat()), and the read-write class would add ~10 methods (including three abstract: mkdir(), symlink_to() and _delete()). These could be made true ABCs rather than quasi-ABCs-that-raise-UnsupportedErrors-by-default, which might be nice.

I don’t know what to name these classes, maybe ReadablePathBase and RWPathBase? Meh…

Here’s my working, if it’s useful: PathBase pruning - Google Sheets

It’s an exciting stage of the project for me: the pathlib ABCs are feature-complete, and the remaining technical work amounts to pruning and neatening.

Thanks for reading!

7 Likes

I looked at your table, and one thing makes me uneasy about this split: open is a method that can be used to modify or write, in addition to reading. It doesn’t look like it belongs to a “read” class. The same is true to a lesser extent for copy and copy_into, because they modify the target path. I understand that in practice you need open to be able to get the content of the file, so I don’t see another way to do it while keeping the same separation.

How about replacing ReadablePathBase with something like InfoPathBase, which would only keep the methods that gather information about the path, like metadata, kind of file, directory scanning… RWPathBase could become an ActionPathBase that act upon the path, with the rest of the set. (Sorry, I’m not great ant naming things!)

I adapted your table to show what I mean: Another PathBase pruning - Google Sheets

I didn’t touch the deletion, maybe some of them look like they could be useful, like unlink?

1 Like

I’m not sure if this works in the static type system, but I think you can express the open int the base class as

class BasePath:
    def open(mode: Literal["r"] = "r", ...):
        ...

While you can express open in the child classes as

class RwPath:
    def open(mode: Literal["r", "w", "rw"] = "r", ...):
        ...

But I’m not sure if an inherited method can extend the allowed literals in the static type system.

In any case, I don’t see much of an issue with the open method.

2 Likes

I appreciate the feedback. Just on this point:

At the moment, copy() and copy_into() allow you to pass another kind of PathBase object as the target path, and only the target path needs to be writable, so:

# This would work even if the zipfile.Path is read-only
source = zipfile.Path('src', ...)
target = pathlib.Path('blah')
source.copy_into(target)
1 Like

Just a nitpick that mode has a lot more options (e.g. rt and rb)

1 Like

… Can you tell I don’t work with binary files too often?

I agree that this one is debatable. I was thinking in terms of “is there a risk of losing data somehow?”, and in my mind a ReadablePathBase as opposed to a RWPathBase should be safe in this regard. Maybe I took it too much as a ReadOnlyPathBase, but it’s actually not what you proposed.

move seems like reasonable default implementation for any kind of path, i.e. subclasses would get it “for free” if they implement the underlying operations (but they could also optimize it for copies within a “filesystem”). It would be sad if it went away.
Similar for methods like rename or replace. Implementing them in the base class could be a good opportunity to ensure various Paths can work well together for users, without unnecessary effort from authors of the subclasses.

Most of the is_* methods marked for removal are simple wrappers around stat(). I don’t think it makes much sense to remove these default implementations if stat() is still required.
Which brings to mind what to do about stat(). It seems that stat_result itself could use a protocol (or generic superclass) – perhaps one with only attributes, not the old tuple-like interface?
Or perhaps another class for with stat-related methods (including is_*, chmod, touch)?

Besides read/write, there’s another split that affects what kind of operations are available/allowed. It might be reasonable to encode this in the type system:

  • directory (glob, walk, iterdir; mkdir; move_into destination)
  • data (open, read_*; write_*)
  • link (readlink, lstat)

(For example, importlib.abc.Traversable is essentially ReadablePathBase without links, stat, and some ease-of-life utilities that it could get “for free” by subclassing the base, like glob.)

On the other hand, such a class hierarchy could easily become too complex to be usable.

4 Likes

Sorry Petr, I think you might have seen a version of the spreadsheet where I was experimenting with evicting PathBase.move(). I’ve now moved it back into RWPathBase section.

I reckon we should keep PathBase.move() but evict rename() and replace(), because move() supports most use cases for the latter two methods, adds support for directories, and allows an arbitrary PathBase as a destination.

Even for “rare” file types? (specifically: block/character devices, FIFOs, unix sockets, arguably Windows junctions.) I’m not ruling out adding these back later in response to user demand, but I’m somewhat doubtful that folks will need them soon. They’re not representable in quite a few kinds of virtual filesystem.

Good idea, thank you, I will do this soon :slight_smile:

2 Likes

Hmm, there’s a judgement call here. But there’s some value in keeping a simpler function that can’t copy a whole directory tree across a network. That can be an expensive operation (literally, with things like cloud storage).


They are representable in stat() results, though. As long as you keep mode, things like stat.S_ISBLK need to work on it (i.e. return False in most virtual filesystems). And if that’s part of the API, you might as well add the wrappers to the base class.
(I don’t think the raw number of methods is a problem – the metric to minimize is things that users need to override or think about.)

1 Like

Because I like drawing diagrams, here’s a sketch of the class hierarchy that I’m aiming for:

7 Likes

Happy new year all!

Here’s a revised diagram showing the classes I’m now aiming for:

I’m not planning to add DeletablePath yet. I’ve moved its prospective methods (_delete(), move() and move_into()) from PathBase to Path. We might come back to it later.

WritablePath inherits directly from JoinablePath rather than ReadablePath, which seems much neater to me. There’s a fly in the ointment: PathBase.open(), which seems to apply to both readable and writable paths. I think this could be solved if we replace the PathBase.open() method with a pathlib._abc.open() function. This would work like built-in open(), but additionally accept objects with __open_rb__() or __open_wb__() methods, depending on the mode used. When opening in text mode, it would try __open_r__() or __open_w__() first, and fall back to wrapping the binary dunder methods in io.TextIOWrapper.

I still need to work out exactly how we convert a mode string to a dunder method name, and what modes should be allowed. When I come to write the PEP for making the ABCs public, it will include adjusting built-in open() to understand the new “openable” protocol.

That’s my current thinking anyway, happy to hear feedback. Cheers!

7 Likes