Incrementally move high-level path operations from shutil to pathlib

barneygale · September 17, 2022, 10:30pm

This might be an enormous can of worms, but I’d like to suggest that certain high-level path operations in the (longstanding) shutil module might be more at home in the (relative newcomer) pathlib module, and that we could move them without breaking backwards compatibility, and unlock other benefits along the way.

Why?

shutil long predates pathlib. Its “shell utilities” remit is broad and overlaps with other modules, including pathlib. I wonder if Guido might be able to comment, but I get the impression it was the “module of last resort” for things that:

Were written in Python (so couldn’t be added to os), and
Didn’t need platform-specific implementations (so couldn’t be added to os.path), and
Were too small to deserve their own module (unlike glob, shlex, etc)

In PEP 428, Antoine suggested that pathlib might provide a good home for these functions:

More operations could be provided, for example some of the functionality of the shutil module.

These pathlib features have been requested perennially ever since.

By also introducing pathlib.AbstractPath (see this topic), we’d unlock the potential to apply some of these functions to different filesystem backends, such as S3 and its ilk. Users would be able to write path.move() without caring about the backing filesystem(s), which, like nature, is pretty neat!

What?

In my view, the functions in question are:

copy*(), including copytree() but excluding copyfileobj()
move()
rmtree()
chown()

How?

These functions could be added as methods of pathlib.Path, and in turn implemented using lower-level methods like Path.stat(), Path.open(), etc in many cases. When pathlib.AbstractPath is introduced, users would be able to supply their own implementations of these lower-level methods.

The Path.copy*() API may benefit from some revision, e.g. merging methods and adding arguments to control behaviour. Perhaps not.

The original implementations in shutil would call through to pathlib ~~and probably undergo an extended deprecation period due to their high level of usage~~.

To make the implementation fully backwards compatible, we’d need to make the following (highly controversial!) changes to pathlib:

Support for bytes paths. I’m pretty sure it’s a settled question that pathlib should not support bytes, but I’d like to unsettle it . The shutil functions support bytes; glob.glob() supports bytes; any POSIX application built with portability and correctness in mind should use bytes. Correctness and ease of use are not enemies in Python, and we shouldn’t make them enemies in pathlib. On a technical level this is totally doable, and indeed lately we’re moving more towards treating the underlying “raw” path as an opaque object, and leaning more on posixpath and ntpath for low-level stuff.
Support for supplying directory/file descriptors. I believe Antoine intended to add support for this in pathlib but never finished it; remnants of this implementation survive in the pathlib codebase to this day!
Support for disabling path normalization. This is to ensure that shutil.rmtree('...') etc aren’t affected by subtle quirks in pathlib’s normalization logic, particularly on Windows.

When?

There’s a lot to do first. For me, this only becomes compelling once we’ve introduced AbstractPath. I’ll also make the case for supporting bytes separately to this proposal when the time comes.

Still, is this worth (eventually) working towards? Thoughts? Thanks!

rhettinger · September 17, 2022, 10:51pm

If your goal is to expand what pathlib does, that seems reasonable.

If you want to remove capabilities from that shutil, that is problematic. This module is very old and was widely used to replace shell scripts. Often this was done without tests. Removing the functionality would likely break lots of old infrastructure that has been quietly doing its job.

Also, not everyone likes pathlib and instead prefers the simpler tech that mirrors what they already know.

barneygale · September 17, 2022, 10:55pm

That’s fair. I wouldn’t mind keeping the shutil names around indefinitely as a shortcut for Path(x).copytree(y) etc. There’s already a correspondence between pathlib methods and functions in os, os.path, glob, fnmatch and others.

Rosuav · September 17, 2022, 11:17pm

That seems reasonable. I’d be inclined to reword it as “implementing shutil functionality in pathlib”, with the deduplication (by having shutil call on pathlib) more of an underlying detail than an announced feature, but I think it’d be handy to have the functionality in both places.

barneygale · September 17, 2022, 11:43pm

I suppose I put the emphasis on the “move” because there’s such a high bar to add things to Python. One implementation asking a lot; two is surely too many! My wording in the title is ambiguous on the “implementation” vs “public API” bit though. Hope that makes some sense.

steven.daprano · September 18, 2022, 2:41am

I’m not sure why you think that functions written in Python can’t be added to the os.py module

I think that there is a somewhat arbitrary distinction between file system routines which are part of the OS versus those which are part of the shell, but given that distinction does exist, it makes sense to have a module for functions which are thin routines provided by the OS itself, and another module for more substantial functions which emulate routines that are provided by the shell.

shutils is not a grab-bag of miscellaneous routines that were placed in a single module because they didn’t fit anywhere else. They’re shell utilities. The name is kinda a hint

I’m not sure that we should be overloading pathlib with every routine that operates on a path. The module started life as a way to manipulate path names that was a bit easier than string manipulation, and now seems to be growing to the point that people want anything and everything that touches the file system to be a method on a path object

That leaves at least 12 other functions, so even if we follow your plan, we cannot remove shutil.

What use-cases are there for users reimplementing (probably badly…) these functions? Aside from Windows, Posix and Mac OS, are there enough common file systems to justify the added complexity and engineering to support this level of abstraction?

barneygale · September 18, 2022, 3:09am

Thanks for the feedback!

I’m not sure why you think that functions written in Python can’t be added to the os.py module

D’oh! Thanks.

shutils is not a grab-bag of miscellaneous routines that were placed in a single module because they didn’t fit anywhere else. They’re shell utilities. The name is kinda a hint

Most things which qualify as “shell utilities” aren’t part of shutil. I mentioned glob and shlex, but there are also things like os.makedirs(), os.path.expandvars() and subprocess. Conversely, most of the things in shutil are more commonly used to implement things other than shells.

I’m not sure that we should be overloading pathlib with every routine that operates on a path. The module started life as a way to manipulate path names that was a bit easier than string manipulation

It started life in CPython with glob(), open(), mkdir(parents=False) etc already present. The PurePath side is only half the story.

now seems to be growing to the point that people want anything and everything that touches the file system to be a method on a path object

It’s purpose per its docstring and documentation is high-level path operations. Things like move() fall more comfortably into pathlib than shutil IMHO. PEP 428 briefly mentions this.

What use-cases are there for users reimplementing (probably badly…) these functions?

Users would implement the low level methods like open() and iterdir() in their AbstractPath subclass, and by doing so would gain methods like glob() and, in this proposal, move() for free. We’re doing for path-like objects what the io module did for file-like objects. cloudpathlib is my go-to example for this, but there are plenty of other examples of users applying the pathlib API to things other than the local filesystem on PyPI and GitHub.

steven.daprano · September 20, 2022, 3:25am

Agreed that file/directory handling is scattered all over the place. I’m not sure that we need to do anything about it.

shutil is not a module for implementing shells, it is a module of utilities commonly found in shells. As I mentioned earlier, it is fairly arbitrary what bits belong to the os, a shell, or whether (like glob) it should be in its own module. Much of that is due to historical accidents.

pathlib is a clear win when it comes to being able to operate on paths as a data structure, e.g. joining paths. But when performing file/directory operations, it is a subjective matter of taste whether we prefer path.do_something(arg) or do_something(path, arg).

That’s why I have little or even negative interest in the churn of moving things from shutil, which uses a function API, to pathlib, which uses a method API. Whether you spell it sugar, suiker, сахар, or zucchero, it’s still got the same amount of calories.

If shutil supports path objects, and it should, then it should support any object which provides the same interface as path objects. There’s little need to move the implementation into pathlib if we use duck-typing.

storchaka · September 20, 2022, 6:41am

Do you want to deprecate the path-like protocol or what? Because the purpose of introducing that protocol was that you no need to add more and more methods in pathlib.Path and you can just use existing functions with path-like objects. It is also easy to add support of path-like objects in your code without explicitly depending on pathlib.

barneygale · September 20, 2022, 12:56pm

The path-like protocol could be expanded once we add AbstractPath. It currently covers two use cases:

You need a string because you’re doing string manipulation (e.g. os.path.join())
You need a string because you’re going to use an OS API (e.g. os.readlink())

At the moment these use cases are unified, so there is no issue. When we introduce AbstractPath we may want to introduce the distinction in the API. See this cloudpathlib issue for more.

It’s not directly relevant to moving shutil things, but only indirectly via the association I’ve made with the AbstractPath work.

barneygale · September 20, 2022, 8:59pm

Simpler example of the above is to use the “wrong” OS flavour of PurePath. For example:

import os, ntpath, pathlib

# perfectly reasonable on all OSs as we're just doing string manip
ntpath.join(pathlib.PureWindowsPath('c:/users'), 'me')

# what is the expected behaviour on non-Windows platforms here?
os.makedirs(pathlib.PureWindowsPath('c:/users/you'))

At the moment this is an edge case, but if we introduce AbstractPath it could become a more of a problem.

Otherwise I have no objection to os.PathLike and os.fspath(). Very happy for folks to implement path-like objects using a single method; the AbstractPath business is only there for folks who want something richer. I see no reason to deprecate them, but perhaps there might be a reason to expand them to distinguish the above two cases.

toddrjen · October 24, 2022, 2:51pm

I don’t think they should be moved, but I think at least some of the functions should be wrapped in pathlib. For better or worse, pathlib has become the recommended tool for working with filesystems, so lacking basic filesystem operations is a problem for pathlib.

The three main operations I see being relevant are some version of copy, move, and rmtree.

For copy and move, these are elementary filesystem operation not supported in pathlib. They are treated as a basic filesystem operation even as far back as Unix 1, OS/2, and PC-DOS 1. So as basic operations I think it is important to have them in pathlib. Some people make the argument that these are I/O operations, but pathlib already has I/O operations like read/write for text and binary.

rmtree is a basic operation as well, but I think more importantly there is a matter of symmetry. In pathlib it is possible to create a tree of directories, but not possible to remove the tree you created. If you can do it, you should be able to also undo it.

For copy, rather than duplicating the entire API, I would suggest a single function, copy, with the following call signature:

Path.copy(dst, *, follow_symlinks=True, recursive=True, dir_exist_ok=True)

If recursive is True and the path is a directory, it would use copytree internally. Otherwise it would use copy2. There could also be a copy_metadata option to pick between copy and copy2, but I think if someone wants to do something as specialized as that they can use shutil. Alternatively there could be two methods, copy and copytree. But I having a single general-purpose function fits better with pathlib.

For rtmree and move I think we can copy the API directly. We could add a recursive operation to Path.remove, but because this is such a destructive operation it is probably safer to have a separate method.

toddrjen · October 31, 2022, 1:59pm

I would be willing to implement these if people think they are useful.

Melendowski · October 31, 2022, 2:44pm

I think the biggest hurdle is reviewership. Barney has an open PR that no core member has time to review and provide feedback /approve

toddrjen · November 2, 2022, 3:27pm

Can you please point me to the pull request? I wasn’t able to find it.

merwok · November 2, 2022, 3:40pm

That PR is about making Path easier to subclass (for virtual filesystems): gh-68320, gh-88302 - Allow for `pathlib.Path` subclassing by barneygale · Pull Request #31691 · python/cpython · GitHub

The biggest hurdle to move operations from shutil to pathlib is that the core team does not think it’s a good direction. You can read the discussion on another PR starting at this comment: Add `rmtree` & `copy` method to pathlib · Issue #92771 · python/cpython · GitHub

eryksun · November 2, 2022, 7:13pm

The arguments against implementing these high-level operations directly in pathlib haven’t addressed the suggestion to add AbstractPath, which would support hierarchical storage systems that aren’t OS filesystems. The high-level os.path and shutil functions are useless in such cases. Implementing high-level path operations such as move(), copy(), copytree(), and rmtree() in pathlib would allow a realized subclass of AbstractPath to get these complex operations without having to implement them from scratch.

This is quite a design challenge, however, which has to be weighed against the potential benefits. Are there enough strong use cases for AbstractPath to justify it? I think it’s better in general if storage systems interface with the OS for this – e.g. a FUSE filesystem on Unix, or a UNC provider on Windows.

EpicWink · November 2, 2022, 7:29pm

I’ve used S3 FUSE, and it’s a huge pain to deal with, including setup, secure credentials, etc, not to mention the extra effort required for use in Docker. Having native S3 interaction in Python is much move convenient and secure, and less friction.

merwok · November 2, 2022, 7:30pm

Indeed, the goal of my message was to separate these two things, which a previous message merged together.

It seems to me that there is support for the AbstractPath PR, but not for copying everything from shutil to pathlib, nor to deprecate functions in shutil.

eryksun · November 2, 2022, 9:09pm

I think the idea of moving high-level shutil functions into pathlib depends on the development of AbstractPath. I don’t view them as separate or something that should be implemented in stages. Getting copy(), move(), copytree(), and rmtree() for free would encourage projects to use AbstractPath. This translates to more real-world cases that harden the implementation and to more potential maintainers. I wouldn’t want AbstractPath to be added if it’s barely used and just becomes more cruft in the standard library.