Make pathlib extensible

barneygale · May 23, 2023, 1:34pm

You’re right about backwards compatibility. Foo. I may have to do the diamond thing!

It’s a frustrating edge case, as os.fsencode() doesn’t do I/O, but does vary by system. It’s the only part of PurePath that does this, and (apart from fspath()) the only os function called.

barneygale · May 23, 2023, 11:32pm

PR up! GH-89812: Add `pathlib._BasePurePath` by barneygale · Pull Request #104810 · python/cpython · GitHub

On naming: AbstractPurePath doesn’t seem right, as the class doesn’t have any abstract methods. And to me, PureLocalPath indicates the presence (rather than absence) of the __fspath__() method! I quite like BasePurePath, LexicalPath, maybe LexicalPurePath. Thoughts?

barneygale · May 24, 2023, 11:44pm

Here’s my current thinking:

_BasePurePath is like PurePath, but lacking __fspath__(), __bytes__() and as_uri()
_AbstractPath is like _BasePurePath, but adds abstract stat(), open() and iterdir() methods, plus all methods that can be derived (like is_dir(), read_text(), glob()). It will be extended by tarfile.TarPath. If that goes well, we can drop the underscore prefix
_BasePath is like _AbstractPath, but it uses os functions to implement stat(), open() and iterdir(). It exists mainly to make _AbstractPath testable.
_BasePath is like Path, but lacking some methods like expanduser(), hardlink_to(), etc.

I may omit _BasePath if it doesn’t prove useful enough.

liormizr · June 8, 2023, 10:24am

Hi All,

My name is Lior and I’m the author of s3path.
First I got to say that it’s very cool that you guys are thinking about how to make pathlib better for extensions.

When I started s3path one of the reasons I choose to base the project on pathlib was the simplicity to extend pathlib with other protocols and the fact that pathlib is a standard library so I can use that same api instead of reinventing a new one.
And the potential that there will be more extensions like that.

From my point of view about how to improve pathlib extensionality, most of our issues are about how other packages react to pathlib and when to use that native path string vs uri.
For example:
Some packages expect a native string if it’s a local file but the same methods expect uri if you want s3 path.

I’ll give two examples you can see issues:
copy files between S3 locations using S3Path · Issue #98 · liormizr/s3path · GitHub - using shutil.copy
And
Pandas compatibility · Issue #59 · liormizr/s3path · GitHub - integration with pandas

Also, we are missing some methods like
Path.from_uri classmethod and Path.copy method but I think that this is off topic

Thanks
Very interesting discuss
Please let me know if I can help with anything

barneygale · June 8, 2023, 11:45am

Thanks for the feedback!

If you’re working with an API the requires a local file path, I expect you’ll need something like:

remote_path = S3Path("foo", "bar", bucket="woot")
local_path = remote_path.download()

or perhaps:

remote_path = S3Path("foo", "bar", bucket="woot")
with tempfile.TemporaryDirectory() as temp_dir:
    shutil.copytree(remote_path, temp_dir)

The latter assumes that we adjust shutil.copytree() to accept os.PathLike | pathlib.AbstractPath. I slightly prefer this to a Path.copytree() method, as it’s clearer that it can work with different kinds of path object.

IIRC, s3path implements __fspath__() and attempts to automatically download files just before they’re used. IMHO that’s the wrong approach, as you’ll never be able to account for all the situations where os.fspath() is called.

liormizr · June 8, 2023, 12:14pm

Hi @barneygale
I think that you are missing something in our usage

The beautiful thing here is that all the path objects have the same api.
So a developer can for write a function that get in his machine work with windows path object, in CI ran the unit tests in linux path and in staging/production the same code get s3 path object.

So do for that I’m very careful changing pathlib api and keeping it strict.
Also keeping it explicit like the base pathlib api.

So for your example:

remote_path = S3Path("foo", "bar", bucket="woot")

Need to stay:

remote_path = S3Path("/woot", "foo", "bar")

and

local_path = remote_path.download()

is not something that will work if remote_path is not a for example PosixPath

Now fspath have a very specific design - PEP 519 – Adding a file system path protocol | peps.python.org
Adding IO inside it is very very not explicit and will add a lot of voodoo.
It won’t be trivial to debug and in general understand what will happened and when
Specially when you are talking on a hay IO operations.

Now about the Path.copytree() method
I love it!
This is something that in Path level I can modify and then we can implement inside the right implementation for every use case.

So for example in s3path there will be multiple cases:

s3 to s3
local (or some oder flavor) to s3
s3 to local (or some oder flavor)

barneygale · June 8, 2023, 12:43pm

Say it quietly, but I think arbitrary nesting will be possible:

# download markdown file in zip file in tar file in S3 bucket.
path0 = S3Path('bucket/foo.tar')
with path0.open('rb') as fileobj0:
    tf = tarfile.open(fileobj=fileobj0)
    path1 = tarfile.TarPath('arc.zip', tarfile=tf)
    with path1.open('rb') as fileobj1:
        zf = zipfile.ZipFile(fileobj1)
        source = zipfile.ZipPath('README.md', zipfile=zf)
        target = pathlib.Path('Downloads', 'README.md')
        shutil.copytree(source, target)

liormizr · June 8, 2023, 8:11pm

What do you guys think about this

# download markdown file in zip file in tar file in S3 bucket.
source_path = S3Path('bucket/foo.tar')
target = Path('Downloads', 'README.md')

zip_file = TarPath(source_path).joinpath('arc.zip')
source = ZipPath(zip_file).joinpath('README.md')

shutil.copytree(source, target)
# OR
target.copytree(source)
# OR
source.copytree(target)
# you get the point

Now when I see this code lets make it a real production code with test and everything
If I’ll split the code to functions I will get something like this:

def extract(path, zip_name: str, file_name: str):
    zip_file = TarPath(path).joinpath(zip_name)
    return ZipPath(zip_file).joinpath(file_name)

def copy(source: Path, target: Path, convert_format_callback: Callable):
    extracted_file_object = convert_format_callback(source)
    shutil.copytree(extracted_path, target)

def main():
    copy(
        source=S3Path('bucket/foo.tar'),
        target=Path('Downloads'),
        convert_format_callback=partial(extract, zip_name='arc.zip', file_name='README.md'))

Now when I look at copy method
it can use the STD Path object, it can use s3path, if can use GoogleCloudStoragePath, AzurePath, FTPath of what ever we will implement

it doesn’t really matter as long that we will have the right strict api to folow

Now about Zip and Tar.
Personally I don’t like adding this idea in to path objects becomes we mix between Path objects that manage File Systems and file formats.
But this is my

barneygale · June 16, 2023, 6:28pm

June 2023 progress report

My revised target for adding pathlib.AbstractPath and tarfile.TarPath is Python 3.13. I have a plan and working (local) implementations of both

An inheritance diagram, with new classes in yellow:

There are 2-3 more PRs I need to land before I add AbstractPath:

GH-104810: Add pathlib._BasePurePath (thanks Alex and Éric for the reviewing!)
GH-105794: Add follow_symlinks argument to pathlib.Path.is_dir() and is_file()
(Probably) final pass of pathlib.Path tests

(Reviews appreciated as ever!)

The tarfile.TarPath class will follow soon after AbstractPath is added. The implementation is short and sweet, which is nice and helps validate the AbstractPath design, but I have a lot of tests to write!

The AbstractPath class name will initially have an underscore prefix. I’ll solicit feedback from maintainers of pathlib-y pypi packages, make any adjustments, and hopefully drop the prefix in time for 3.13 beta 1.

That’s all. Ciao for now!

barneygale · June 17, 2023, 8:33pm

Q: should any/all of the following methods form part of the AbstractPath interface, and if so, how should they be implemented?

absolute
expanduser
symlink_to
hardlink_to
touch
mkdir
rename
replace
chmod
unlink
rmdir
owner
group

The existing implementations of these methods (in PurePath and Path) aren’t suitable for AbstractPath. I’ve excluded lchmod(), cwd() and home() from the list as their fates depends on what we do with chmod(), absolute() and expanduser().

Ideas:

Don’t include them in the AbstractPath interface
Make them (all?) abstract
Default implementations that raise an appropriate OSError indicating an unsupported operation.
Something else?
Some combination of the above?

As a duck typing-fancier I lean towards option 1, as it naturally supports feature detection and doesn’t place much burden on implementors of AbstractPath. On the other hand, it arguably muddles the interface and makes fewer guarantees to users of path objects. It prevents us, I think, from adding methods like rmtree() and move() to the pathlib.AbstractPath interface, as the required methods wouldn’t necessarily be available. They could still work as functions that perform their own feature detection.

In support of option 2, it’s worth noting that NotImplementedError is already raised by pathlib.Path methods in some cases, e.g. it’s always raised by Path.owner() and Path.group() on Windows!

In support of option 3, arguably users should handle EROFS the same whether it’s generated by the OS or artificially by pathlib. OTOH, the interface will be complicated (for us and for users) if we need to report different errors in different scenarios, and I’m not sure there’s even appropriate errors for some operations (e.g. absolute()).

Thoughts anyone?

h-vetinari · June 18, 2023, 1:22am

I think 2. is appealing (perhaps without unix-isms like owner & group). It says “this is the expected baseline for what a Path can do”. If some exotic platforms cannot implement certain operations, they can “implement” a method that just raises an informative NotImplementedError of why that’s not possible.

That seems better than severely restricting the interface for the vast majority of setups where that’s a reasonable baseline. Of course, the balance here is in the trade-off of what “exotic” is. Windows certainly isn’t, but a complicated network-only FS probably is.

EpicWink · June 18, 2023, 4:29am

I prefer option 1, as most of those methods don’t make sense in many cases (including tar-path and cloud object storage), and not-implemented-error-raising functions are harder to deal with typing.

Is the goal to remove the need for people to subclass type(pathlib.Path())? If so, could people who need these methods simply subclass pathlib.Path?

I like this approach the most, when in combination with new public typing.Protocols which declares the interfaces used by these functions, eg SupportsMove which is a union of class with rename and class with read, write, and unlink).

I think absolute, replace, touch and unlink could be part of AbstractPath, perhaps also read_bytes and write_bytes.

If adding write_bytes, then touch could have a default implementation of write_bytes(b"").

If adding *_bytes, then maybe *_text could also be added with default implementations.

barneygale · June 18, 2023, 4:04pm

Already fixed in 3.12! The following “just works”:

class MyPath(pathlib.Path):
    ...

If so, could people who need these methods simply subclass pathlib.Path?

In some cases, the methods I listed above apply even to non-local filesystems. For example, .tar files support symlinks, so a TarPath.symlink_to() method might be reasonable, but this doesn’t apply for .zip files. FTP has a notion of a current directory, so FTPPath.absolute() might also make sense, but S3 has no such notion. With so many possibilities, I worry that protocols like SupportsMove will become numerous and interdependent.

barneygale · June 18, 2023, 4:49pm

Thinking about a possible rmtree() method or function that works on AbstractPath objects:

If we go with option 1, then AbstractPath would not have rmdir() nor unlink() methods. This prevents us from adding an AbstractPath.rmtree() method. No matter! We can write a module-level function like shutil.rmtree() that operates on abstract paths and performs its own feature detection. Presumably it would raise an exception if the given path object doesn’t define the needed methods.

I like this in certain respects. It seems similar to checking for the availability of functions in the os module. It’s also the simplest to implement for me!

… but users would be unable to optimize rmtree() for their particular path type. For example, rmtree() could be implemented efficiently for .tar files by iterating over the linked list of file metadata and dropping any entries with a certain path prefix. But if we went with option 1, the shutil.rmtree() algorithm would instead walk directories in a (probably cached) virtual tree, and call unlink() / rmdir() as needed. Slower + more pointlesser. If we make rmtree() a method, this problem goes away.

The same applies for possible future additions like move(), copytree(), etc. And with all that in mind, I start leaning towards option 2 or 3. But then I recoil at the thought of someone’s IDE auto-completing ZipPath.hardlink_to(), given the .zip format doesn’t support hardlinks and so that method can only ever raise.

@AlexWaygood, oh typing guru, how do you think we should handle these “soft requirements”?

pf_moore · June 18, 2023, 7:56pm

When you say “like shutil.rmtree()” are you suggesting that we move the rmtree function from shutil into pathlib (but keep it as a function)? Even if we leave a copy available as shutil.rmtree, that seems like pointless churn. So I hope that what you mean is having shutil.rmtree itself be the function that acts on abstract paths. Or do you intend shutil to be specifically about filesystem paths, with a more generic version in pathlib?

Which makes me wonder where os.PathLike will figure in all of this. shutil functions mostly accept “path like objects” - i.e., os.PathLike. If AbstractPath subclasses also have the potential to support shutil-style functionality, but for a subtly different set of paths, it’s going to be really awkward for developers to know which to choose.

I’m assuming you have a plan for how all this will work, but I don’t recall if you’ve ever specifically explained it (and if you have, I’ve forgotten, sorry!)

barneygale · June 18, 2023, 9:35pm

This is indeed what I mean. Roughly:

# shutil.py

def rmtree(path: os.PathLike | pathlib.AbstractPath):
    if not isinstance(path, pathlib.AbstractPath):
        path = pathlib.Path(path)
    elif not hasattr(path, 'unlink') or not hasattr(path, 'rmdir'):
        raise TypeError(f"Path object {path!r} does not support unlink() and rmdir()") 
    for dirpath, _, filenames in path.walk(top_down=False):
        for filename in filenames:
            dirpath.joinpath(filename).unlink()
        dirpath.rmdir()

… but if we decide to include rmdir() and unlink() in the AbstractPath interface (option 2 or 3), then we can also add an rmtree() method, and so shutil.rmtree() becomes:

# shutil.py

def rmtree(path: os.PathLike):
    if not isinstance(path, pathlib.Path):
        path = pathlib.Path(path)
    path.rmtree()

In either case, implementing rmtree() in or atop pathlib would also solve a recursion error.

encukou · June 19, 2023, 9:04am

For (1), you’d then want to say something like “if a path implements owner(), it must match the semantics of Path.owner()”. That’s a pretty weird requirement – it would beg the question of why it’s not part of the base interface in the first place.

Maybe raise a pathlib-specific exception, similar to io.UnsupportedOperation? If it was a subclass of NotImplementedError, you could make owner() raise it on Windows.

barneygale · June 19, 2023, 7:16pm

I think this is spot on, and a perfect precedent to draw upon. Thank you!

barneygale · June 23, 2023, 1:11pm

Q: is it important for AbstractPath to inherit from PurePath?

In the class diagram I shared earlier, AbstractPath is a sibling of PurePath, and not a subclass:

The reason for this setup is that PurePath needs __fspath__() and __bytes__() methods for backwards compatibility, but these methods must not be inherited by AbstractPath.

I realised yesterday that there’s a way to make AbstractPath inherit from PurePath, if we thought that important. The trick is that PurePath objects aren’t directly instantiable (you always get a PurePosixPath or PureWindowsPath back), and so the __fspath__() and __bytes__() methods can be moved to a private subclass (_PurePathExt below) that sits between PurePath and its Windows- and Posix-specific variants. If we went down this route, the public view of the class hierarchy perfectly squares with my intuition:

(note how AbstractPath is sandwiched between PurePath and Path)

However, if I show the new _PurePathExt class in the diagram, things get exciting, perhaps too exciting?

EDIT: here’s a diagram comparing both approaches from a public and private perspective:

So, what do you think? Should PurePath and AbstractPath be subclasses of a private superclass (as in my earlier post), or superclass and subclass as this post suggests, or something else? Thanks!

barneygale · June 23, 2023, 1:26pm

(If anyone balks at the idea of a new private class in pathlib, please bear in mind that I’ve removed nine of them since I started maintaining pathlib, which I hope buys me a little room!)