Incrementally move high-level path operations from shutil to pathlib

I’m never quite sure whether to motivate each change in isolation or focus on the big picture.

Short term, I’m only looking to make pathlib.PurePath and pathlib.Path subclassable. You can already subclass the Windows and Posix variants, but you get an AttributeError for the generic classes when you try to instantiate them. This was originally reported as a bug in 2015 and personally I still consider it a bug rather than a feature request. The PR does not introduce a notion of an abstract filesystem. It fixes an exception by simplifying the pathlib internals.

Maybe I should put together a youtube video for the longer-term motivation. In my day job I’m regularly mirroring and munging files between the local filesystem, tarballs, S3, Artifactory, etc. I firmly believe these DevOps-y tasks are major and growing use cases for Python, and that a common object-oriented path interface would be a “killer feature” on par with pathlib’s original introduction.

8 Likes

I’m in the process of converting all my programs to use pathlib. For shell operations, I typically use the sh module, which make it easy to interface with the shell without the complexity of using the SubProcess module. Not having to use os, os.path, sh, shutil and pathlib would simplify the process of updating several hundred programs a great deal; reducing the modules to two, sh and pathlib, would be a delight. I’m not certain that I’d use AbstractPath, but I can think of at least a few sections of code where I would at least examine it in detail to see if AbstractPath could simplify things.

Barney makes a good case to move the implementation of path operations into pathlib; I’d agree that shutil shouldn’t be removed, and I’d think that implementing shutil as a bunch of thin wrappers around pathlib would be ideal as a way to make the transfer simple without breaking existing code (of which I’m sure there is a vast amount).

(18 months later): we now have a private pathlib._abc.PathBase class that can be used to implement virtual filesystems, and a pathlib-abc PyPI package that makes it available publicly.

There’s a major shortcoming at the moment: to copy from one PathBase object to another, I need to do this:

with source_path.open('rb') as source_f:
    with target_path.open('wb') as target_f:
        shutil.copyfileobj(source_f, target_f)

… and that only works for files (not directories, or symlinks that I don’t want to follow), and it only copies (whereas I may want to move a file/directory tree).

I suggest the solution is to add PathBase.copy(), copytree(), rmtree() and move() methods as originally suggested in this thread. That would allow some tasty constructions like:

# Copy from tarball to FTP server without touching the local filesystem
source = TarPath('images', archive=tarfile.open(...))
target = FTPPath('public_html', ftp=ftplib.FTP(...))
source.copytree(target)

Do folks agree? Perhaps doing so only makes sense if we make PathBase public at the same time, as I think Eryk suggested earlier? Or perhaps folks feel that shutil.copytree() should be able to handle PathBase objects, and so new methods aren’t necessary? (My concern here is that it limits subclasses ability to implement these operations optimally, e.g. TarPath.rmtree() shouldn’t need to use walk())

1 Like

I like the idea of this sort of capability - although I don’t know how common it will be in practice. I don’t really care how it gets spelled, but I think the bigger question is what’s going to happen to shutil.

Ultimately, the question that needs to be answered is what we want people to reach for if they want to copy a directory structure from A to B. The “one obvious way”, to put it in terms of the Zen. At the moment that’s shutil, because that contains “high level file operations”, implemented in a way that’s maximally efficient. Will src.copytree(dst) be maximally efficient? For example, will it use os.sendfile on Linux? What if dst is a tarfile? Is it the responsibility of the implementer of src or dst to make sure the optimal approach is used?

What will happen to the 5 options to shutil.copytree? Will PathBase.copytree need to support them (so all subclasses will, as well), or will pathlib be “the obvious approach unless you want to ignore certain files in the tree (for example)”? And what about shutil.copy vs shutil.copy2 - which one will PathBase.copy replace?

I’m inclined to say that yes, PathBase should have copy, copytree, rmtree and move methods. To me, that’s the most attractive and flexible API. But that assumes that shutil as it stands, with its 2 copy functions, 5 optional args to copytree, and 4 options to rmtree, is massively over-engineered. And I’m not sure that’s true - I’ve seen a lot of those options used.

Sorry. More questions than answers here, I’m afraid…

3 Likes

Thanks for the questions (really!). Thoughts as follows:

Will src.copytree(dst) be maximally efficient? For example, will it use os.sendfile on Linux? What if dst is a tarfile? Is it the responsibility of the implementer of src or dst to make sure the optimal approach is used?

Yes. The Path.copytree(target) implementation will check if target is os.PathLike. If it is, it will use the efficient code that currently resides in shutil.py. A low-effort way to implement this is:

    def copy(self, target, follow_symlinks=True):
        if not isinstance(target, PathBase):
            target = self.with_segments(target)
        if isinstance(target, Path):
            # Copy between local files using efficient algorithms in shutil
            target = shutil.copy2(self, target, follow_symlinks=follow_symlinks)
            return self.with_segments(target)
        # Fall back to generic algorithm.
        return PathBase.copy(self, target, follow_symlinks=follow_symlinks)

What will happen to the 5 options to shutil.copytree?

I’m working on the assumption that the shutil functions will stick around forever, with their current arguments, as it’s too much churn to deprecate them (per Raymond’s reply).

In the equivalent PathBase.copytree() method:

  • symlinks: replace with follow_symlinks, with an inverted meaning. This matches the spelling/meaning in other methods.
  • ignore: I plan to ignore this :wink: at least until someone requests it. It might be better to provide a callable that accepts a path and returns true/false to filter what gets copied, e.g. copytree(filter=lambda path: path.match('*.py'))
  • copy_function: too low-level for pathlib. If folks need to control which metadata is copied, I think we’ll need to design new APIs, and it’s a bit hard to foresee what we’ll need.
  • ignore_dangling_symlinks: could be implemented a filter (see ‘ignore’ above)
  • dirs_exist_ok: keep. This gets forwarded to PathBase.mkdir(exist_ok=...).

And what about shutil.copy vs shutil.copy2 - which one will PathBase.copy replace?

copy2() to preserve as much metadata as possible. We could provide an API to specify which metadata to copy, but I’m finding the details of that difficult to pin down at the moment - it feels a bit chicken-and-egg.

That’s in Path, not PathBase? So it won’t help (for example) with an FTP transfer (which could also use os.sendfile)? Nor if target is an FTP location (which is harder to fix as it requires changing the implementation in Path)?

I’m not particularly saying it should, but I do think it’s worth ensuring that the API doesn’t prohibit efficient implementations - that was my point about this becoming the “one obvious way”.

Good point! :sweat_smile: Perhaps we could extract this code into a new function that PathBase.copy() could use instead of shutil.copyfileobj()? If both source and target paths return a file object with a backing file descriptor from their open() methods, the sendfile() path will be picked up.

Also just to be clear, an implementation like S3Path.copy() could issue a CopyObject request if the target is also an S3 path or a string, and fall back to PathBase.copy() if any other kind of PathBase object is given as the target.

What bothers me about this is that it’s clearly multiple dispatch “in disguise” - the best implementation depends on the combination of src and dst. There’s obviously a relatively simplistic and reasonable-for-most-cases base implementation, but there may be efficient approaches for various pairs of arguments.

This applies not just to copy but also to copytree and move (you mentioned TarPath not using walk, and there’s various renaming possibilities for move). But at least rmtree is easier as it is just single dispatch.

2 Likes

It bothers me too, tbh. I think what I’m proposing here will provide optimal performance where shutil does currently (i.e. local → local operations), and allow optimal performance in user subclasses of PathBase when the types of the source and target match. When they don’t match, my hope is to provide “good enough” performance, and your hint about using os.sendfile() should make it gooder enougher. Beyond that I’m not sure – it feels at risk of exploding complexity if we start registering adaptors to (source_type, target_type) pairs or anything fancy like that. I’m hoping to avoid thinking about that just yet, but if I must I will!

2 Likes

although I don’t know how common it will be in practice.

(I meant to reply to this earlier:) I think copy() particularly is important to making PathBase attractive (both in PyPI form and perhaps in the standard library), as it functions as both an “upload” and “download” method if you specify a pathlib.Path object as one operand and some other kind of PathBase object as the other. Without it the pathlib-abc PyPI package is hobbled because transferring data is so awkward. It’s one of only a handful of issues that are keeping me from releasing a 1.0 that I think users would adopt more readily.

1 Like

There’s some interesting discussion on the tracker about whether it makes sense for pathlib to import shutil.copy2, and whether copy2() and friends are best placed in shutil or some other module.

I think it makes sense to continue the discussion here. My take:

The implementations of shutil.copy2(), copyfile(), and copystat() deal with some low-level OS and filesystem details (code), whereas pathlib normally delegates all that stuff to os and os.path. In these instances, shutil seems lower-level than anything in pathlib.

The implementations of shutil.copytree(), rmtree() and move() are mostly without any OS-specific details - they string together the lower-level copy() functions, and other functions from os, in just the right way. They seem to be on about the same conceptual level as pathlib’s features, like Path.glob() and Path.walk().

The archiving functions provided by shutil seem to be on a higher level still, as they provide a generic interface to multiple formats. I’d say they’re higher level than anything in pathlib.

What to make of all this? @steve.dower suggested that the copy functions might be best placed in the os module. @storchaka pushed back on this. I feel ambivalent - I agree with Steve that the copy functions feel more at home in os, but they don’t seem obviously out-of-place in shutil either, and I’d rather not move ~1000 lines of code unless I have plenty of buy-in! @thesamesam noted an issue and PR about using copy_file_range() on Linux to speed up copying.

A shiny penny for anyone’s thoughts! Should the copy functions remain in shutil? If so, is it (un)reasonable to import and use them from pathlib?

I’ve never felt comfortable with the set of copy functions in shutil. I get the point that they are not “thin wrappers over OS functionality”, so I’m not sure os is the right place for them, but it still feels like they were put into shutil for the lack of anywhere better.

The problem is that when I reach for shutil I very much want functions that “just get the job done” - call it “one obvious way”, if you like. The amount of choice over which copy function[1] to pick is the exact opposite of what I expect to find in shutil.

I’m inclined to say that the functions should stay in shutil purely for backward compatibility reasons. Maybe move the documentation for them to a dedicated section entitled “low-level file copying operations”, to emphasise that they don’t really fit on the same level as the rest of shutil (much like the archive operations get their own section because they are higher level). But otherwise leave well alone.


  1. There’s more than you mention, actually - copyfile, copymode, copystat, copy and copy2. There’s also copyfileobj, which feels out of place as it operates on file-like objects, rather than named files ↩︎

4 Likes

Thanks very much! I’ll attempt to rework the shutil docs as you suggest, but it’s going to be difficult to reconcile the current title and synopsis (“High-level file operations”) with a section describing the copying operations as low-level. I might take some inspiration from the name and original purpose of the module – “utility functions usable in a shell-like program” in the first commit of Python.

Separately I’ll try to put together better-researched proposal to move the copying functions out of shutil, retaining the old names for backwards-compatibility. I suspect that a survey of other languages will show our eccentricity in placing file copying functions in a “shell utilities” module rather than a “io” / “fs” / “os” module. I vaguely remember my own confusion on first learning that I need to shutil.copy2() to copy a file in Python. And I already knew a little bit about Unix shells (and shell scripts), whereas many folks learning Python today will have never written a shell script that calls cp!

1 Like