Incrementally move high-level path operations from shutil to pathlib

This might be an enormous can of worms, but I’d like to suggest that certain high-level path operations in the (longstanding) shutil module might be more at home in the (relative newcomer) pathlib module, and that we could move them without breaking backwards compatibility, and unlock other benefits along the way.

Why?

shutil long predates pathlib. Its “shell utilities” remit is broad and overlaps with other modules, including pathlib. I wonder if Guido might be able to comment, but I get the impression it was the “module of last resort” for things that:

  • Were written in Python (so couldn’t be added to os), and
  • Didn’t need platform-specific implementations (so couldn’t be added to os.path), and
  • Were too small to deserve their own module (unlike glob, shlex, etc)

In PEP 428, Antoine suggested that pathlib might provide a good home for these functions:

More operations could be provided, for example some of the functionality of the shutil module.

These pathlib features have been requested perennially ever since.

By also introducing pathlib.AbstractPath (see this topic), we’d unlock the potential to apply some of these functions to different filesystem backends, such as S3 and its ilk. Users would be able to write path.move() without caring about the backing filesystem(s), which, like nature, is pretty neat!

What?

In my view, the functions in question are:

  • copy*(), including copytree() but excluding copyfileobj()
  • move()
  • rmtree()
  • chown()

How?

These functions could be added as methods of pathlib.Path, and in turn implemented using lower-level methods like Path.stat(), Path.open(), etc in many cases. When pathlib.AbstractPath is introduced, users would be able to supply their own implementations of these lower-level methods.

The Path.copy*() API may benefit from some revision, e.g. merging methods and adding arguments to control behaviour. Perhaps not.

The original implementations in shutil would call through to pathlib and probably undergo an extended deprecation period due to their high level of usage.

To make the implementation fully backwards compatible, we’d need to make the following (highly controversial!) changes to pathlib:

  • Support for bytes paths. I’m pretty sure it’s a settled question that pathlib should not support bytes, but I’d like to unsettle it :sweat_smile:. The shutil functions support bytes; glob.glob() supports bytes; any POSIX application built with portability and correctness in mind should use bytes. Correctness and ease of use are not enemies in Python, and we shouldn’t make them enemies in pathlib. On a technical level this is totally doable, and indeed lately we’re moving more towards treating the underlying “raw” path as an opaque object, and leaning more on posixpath and ntpath for low-level stuff.
  • Support for supplying directory/file descriptors. I believe Antoine intended to add support for this in pathlib but never finished it; remnants of this implementation survive in the pathlib codebase to this day!
  • Support for disabling path normalization. This is to ensure that shutil.rmtree('...') etc aren’t affected by subtle quirks in pathlib’s normalization logic, particularly on Windows.

When?

There’s a lot to do first. For me, this only becomes compelling once we’ve introduced AbstractPath. I’ll also make the case for supporting bytes separately to this proposal when the time comes.

Still, is this worth (eventually) working towards? Thoughts? Thanks!

10 Likes

If your goal is to expand what pathlib does, that seems reasonable.

If you want to remove capabilities from that shutil, that is problematic. This module is very old and was widely used to replace shell scripts. Often this was done without tests. Removing the functionality would likely break lots of old infrastructure that has been quietly doing its job.

Also, not everyone likes pathlib and instead prefers the simpler tech that mirrors what they already know.

6 Likes

That’s fair. I wouldn’t mind keeping the shutil names around indefinitely as a shortcut for Path(x).copytree(y) etc. There’s already a correspondence between pathlib methods and functions in os, os.path, glob, fnmatch and others.

That seems reasonable. I’d be inclined to reword it as “implementing shutil functionality in pathlib”, with the deduplication (by having shutil call on pathlib) more of an underlying detail than an announced feature, but I think it’d be handy to have the functionality in both places.

5 Likes

I suppose I put the emphasis on the “move” because there’s such a high bar to add things to Python. One implementation asking a lot; two is surely too many! My wording in the title is ambiguous on the “implementation” vs “public API” bit though. Hope that makes some sense.

I’m not sure why you think that functions written in Python can’t be added to the os.py module :slight_smile:

I think that there is a somewhat arbitrary distinction between file system routines which are part of the OS versus those which are part of the shell, but given that distinction does exist, it makes sense to have a module for functions which are thin routines provided by the OS itself, and another module for more substantial functions which emulate routines that are provided by the shell.

shutils is not a grab-bag of miscellaneous routines that were placed in a single module because they didn’t fit anywhere else. They’re shell utilities. The name is kinda a hint :slight_smile:

I’m not sure that we should be overloading pathlib with every routine that operates on a path. The module started life as a way to manipulate path names that was a bit easier than string manipulation, and now seems to be growing to the point that people want anything and everything that touches the file system to be a method on a path object :frowning:

That leaves at least 12 other functions, so even if we follow your plan, we cannot remove shutil.

What use-cases are there for users reimplementing (probably badly…) these functions? Aside from Windows, Posix and Mac OS, are there enough common file systems to justify the added complexity and engineering to support this level of abstraction?

Thanks for the feedback!

I’m not sure why you think that functions written in Python can’t be added to the os.py module :slight_smile:

D’oh! Thanks.

shutils is not a grab-bag of miscellaneous routines that were placed in a single module because they didn’t fit anywhere else. They’re shell utilities. The name is kinda a hint :slight_smile:

Most things which qualify as “shell utilities” aren’t part of shutil. I mentioned glob and shlex, but there are also things like os.makedirs(), os.path.expandvars() and subprocess. Conversely, most of the things in shutil are more commonly used to implement things other than shells.

I’m not sure that we should be overloading pathlib with every routine that operates on a path. The module started life as a way to manipulate path names that was a bit easier than string manipulation

It started life in CPython with glob(), open(), mkdir(parents=False) etc already present. The PurePath side is only half the story.

now seems to be growing to the point that people want anything and everything that touches the file system to be a method on a path object :frowning:

It’s purpose per its docstring and documentation is high-level path operations. Things like move() fall more comfortably into pathlib than shutil IMHO. PEP 428 briefly mentions this.

What use-cases are there for users reimplementing (probably badly…) these functions?

Users would implement the low level methods like open() and iterdir() in their AbstractPath subclass, and by doing so would gain methods like glob() and, in this proposal, move() for free. We’re doing for path-like objects what the io module did for file-like objects. cloudpathlib is my go-to example for this, but there are plenty of other examples of users applying the pathlib API to things other than the local filesystem on PyPI and GitHub.

4 Likes

Agreed that file/directory handling is scattered all over the place. I’m not sure that we need to do anything about it.

shutil is not a module for implementing shells, it is a module of utilities commonly found in shells. As I mentioned earlier, it is fairly arbitrary what bits belong to the os, a shell, or whether (like glob) it should be in its own module. Much of that is due to historical accidents.

pathlib is a clear win when it comes to being able to operate on paths as a data structure, e.g. joining paths. But when performing file/directory operations, it is a subjective matter of taste whether we prefer path.do_something(arg) or do_something(path, arg).

That’s why I have little or even negative interest in the churn of moving things from shutil, which uses a function API, to pathlib, which uses a method API. Whether you spell it sugar, suiker, сахар, or zucchero, it’s still got the same amount of calories.

If shutil supports path objects, and it should, then it should support any object which provides the same interface as path objects. There’s little need to move the implementation into pathlib if we use duck-typing.

1 Like

Do you want to deprecate the path-like protocol or what? Because the purpose of introducing that protocol was that you no need to add more and more methods in pathlib.Path and you can just use existing functions with path-like objects. It is also easy to add support of path-like objects in your code without explicitly depending on pathlib.

2 Likes

The path-like protocol could be expanded once we add AbstractPath. It currently covers two use cases:

  1. You need a string because you’re doing string manipulation (e.g. os.path.join())
  2. You need a string because you’re going to use an OS API (e.g. os.readlink())

At the moment these use cases are unified, so there is no issue. When we introduce AbstractPath we may want to introduce the distinction in the API. See this cloudpathlib issue for more.

It’s not directly relevant to moving shutil things, but only indirectly via the association I’ve made with the AbstractPath work.

Simpler example of the above is to use the “wrong” OS flavour of PurePath. For example:

import os, ntpath, pathlib

# perfectly reasonable on all OSs as we're just doing string manip
ntpath.join(pathlib.PureWindowsPath('c:/users'), 'me')

# what is the expected behaviour on non-Windows platforms here?
os.makedirs(pathlib.PureWindowsPath('c:/users/you'))

At the moment this is an edge case, but if we introduce AbstractPath it could become a more of a problem.

Otherwise I have no objection to os.PathLike and os.fspath(). Very happy for folks to implement path-like objects using a single method; the AbstractPath business is only there for folks who want something richer. I see no reason to deprecate them, but perhaps there might be a reason to expand them to distinguish the above two cases.