Make pathlib extensible

Ovsyanka · December 26, 2022, 8:42pm

@barneygale am obviously not a core dev, but am always happy to help with anything I can. I love pathlib and pathlab so I would love to assist with anything you need.

ajoino · December 26, 2022, 10:55pm

Out of curiosity: does the addition of AbstractPath, assuming AbstractPath is added to pathlib and doesn’t change other parts of Python, require a PEP?

barneygale · December 27, 2022, 1:03am

I strongly suspect it will affect or interact with other bits of Python. Two examples:

We may wish to make AbstractPath.stat() an abstract method, but what would it return? os.stat_result works, but it has an OS-specific implementation (for OS-specific fields) and no public constructor. We could add a compatible class to pathlib, but its interface may still be too low-level for pathlib users. My present thinking is that we add an AbstractPath.status() method that returns a rich object, perhaps pathlib.Status, that can be converted to and from a stat_result, and that features high-level methods for determining file type, permissions, etc. Arguably this class could live in the stat module.

Unless we make a serious intervention, open(ZipPath('README.md')) will open a local file in the current directory, potentially causing serious confusion to users. Technically they should be using ZipPath('README.md').open() instead, but I’d rather not lay such a trap in the first place. To fix this, we’d need to set AbstractPath.__fspath__ = None, thus making AbstractPath un-PathLike. This bewildering state of affairs would make os.path.splitext(ZipPath('README.md')) also fail, despite it only doing lexical work on the input. Other purely lexical functions that accept os.PathLike arguments would be similarly unable to accept AbstractPath. This might require us to make a distinction between pure and concrete paths in os.PathLike, os.fspath() and the __fspath__() magic method.

Rosuav · December 27, 2022, 1:52am

Barney Gale:

Unless we make a serious intervention, open(ZipPath('README.md')) will open a local file in the current directory, potentially causing serious confusion to users. Technically they should be using ZipPath('README.md').open() instead, but I’d rather not lay such a trap in the first place. To fix this, we’d need to set AbstractPath.__fspath__ = None, thus making AbstractPath un-PathLike. This bewildering state of affairs would make os.path.splitext(ZipPath('README.md')) also fail, despite it only doing lexical work on the input. Other purely lexical functions that accept os.PathLike arguments would be similarly unable to accept AbstractPath. This might require us to make a distinction between pure and concrete paths in os.PathLike, os.fspath() and the __fspath__() magic method.

Crazy thought: What if a path-like object had an inherent “context”? If it’s None (the default, for backward compatibility), it represents a file system path, and can be treated as such. But a ZipPath could set the context to be the zip file, and you could have a concept of remote paths (maybe in an SCP transfer, where the “context” would be the server connection). The context object would have to be responsible for opening files and any other concrete operations, but purely-lexical functions could ignore (and maintain) it.

barneygale · December 28, 2022, 1:01am

It’s a good thought, and something I’ve explored myself, but I found that the differences between the ‘path’ and the ‘context’ classes is too slight, and that users would need to implement both in a range of scenarios, so it gets clunky/boilerplate-y fast. As @Conchylicultor points out, it’s more natural for users to customize __init__() to their liking than to pass in a ‘context’ object.

barneygale · December 28, 2022, 6:58pm

I might take you up on that actually!

We’re in need of a performance benchmarking suite for pathlib. It’s necessary for evaluating the impact of gh-100481 and the targets in ideas-194. I’d love for it to have:

Real-world vectors, e.g. from public source code
Benchmarking of PurePath construction, joining, parent, parents, name, suffix, with_name(), with_suffix() and any other super-common operations
Benchmarking of Path iterdir(), glob(), walk(), absolute()
Comparisons with equivalent os.path/etc code for the above.
Benchmarking of representative snippets that combine multiple operations (e.g. read a few files in a ‘config’ directory; check whether .tar and .tar.gz siblings exist of a directory; attempt to determine a file type by checking the file extension and/or first few bytes; etc)

If you wanted to have a go at any of those I’d greatly I appreciate it! But if not, I’ll find some time over the coming weeks.

Ovsyanka · December 28, 2022, 7:33pm

Yeah, I’d love to help you with that. Let me start with point 2 after the holidays, and then I’ll update you on their status and, if I have time, I’ll move on to points 3 and 4:

Benchmarking of PurePath construction, joining, parent, parents, name, suffix, with_name(), with_suffix() and any other super-common operations

Benchmarking of Path iterdir(), glob(), walk(), absolute()

Comparisons with equivalent os.path/etc code for the above.

One question though: where would we store the benchmarks? My personal separate github repo? Or somewhere under python/ organization? And if so, where?

TeamSpen210 · December 28, 2022, 9:15pm

Perhaps expanding on the existing pyperformance benchmark? That’s used to test the performance of interpreter optimisations in general, and is aiming to have more real-world tests.

pitrou · January 2, 2023, 10:55am

Hey,

It’s 2023 now and I don’t have anything concrete to contribute to this discussion, but as the original author of pathlib I would like to congratulate you all (and especially @barneygale ) for advancing this despite my inactivity. Happy new year, and keep up the good work!

brettcannon · January 4, 2023, 10:37pm

Easiest way is to update .github/CODEOWNERS in the exact opposite way I am doing it in Drop myself from pathlib maintenance by brettcannon · Pull Request #100757 · python/cpython · GitHub . There is also Issues · python/cpython · GitHub which anyone can use to quickly see pathlib-related issues.

barneygale · January 15, 2023, 5:20pm

Some thoughts on pathlib performance:

PurePath objects have two main constructors: _from_parts() and _from_parsed_parts().

_from_parts() is used in the majority of cases, including when you call PurePath('foo', 'bar'). It performs the following (expensive) normalization + parsing routine:

Join the arguments together with os.path.join()
On Windows, convert forward slashes to backward slashes
Partition the path into drive, root, tail segments
Split the tail on path separators into ‘parts’
a. Remove ‘.’ segments and empty segments
b. Prepend the drive + root, if not empty
Create the PurePath object and assign _drv, _root and _parts.

Thus path objects are fully normalized + parsed on construction.

_from_parsed_parts() is used in cases where we can skip the above routine and instead directly assign _drv, _root and _parts. These are:

When iterating or walking directories with iterdir(), glob(), walk(). Under the hood, these use os.listdir() and os.scandir(), which return names that are guaranteed not to contain drives, path separators, etc, so they can be naively appended to _parts.
When computing parent directories in .parent and .parents. In this case we can safely pop the items off _parts when constructing the new paths. But this isn’t usually in performance-sensitive areas of code.
To a certain extent, with_name() and with_suffix(), though some parsing + error checking is still performed.

The result is that it’s “cheap” to keep paths fully normalized when walking directories. The following code only runs _from_parts() once:

import pathlib

path = pathlib.Path('cpython/Lib')
for py_path in path.rglob('*.py'):
    print(py_path.name)

You may then ask “when are normalized paths useful?”.

In PurePath, most operations (such as suffix, with_name(), __hash__() and match()) require a fully-normalized path. There are some notable exceptions: joinpath() and __truediv__() could be made to work without even os.path.join()!

In Path, passing an unnormalized path to the OS should be equivalent to a normalized path, otherwise pathlib’s normalization logic is broken! Hence normalization confers no benefit, though we still need to call os.path.join().

Putting these pieces together, we can conclude that pathlib is currently optimized for the following use case: iterate a directory (or walk a directory tree) and perform pure operations on the directory children, e.g. name, with_suffix(), match(), etc. The previous code example demonstrates this.

Question to the audience: how common does that use case seem to you? Is it worth us slowing down some other parts of pathlib by keeping paths fully normalized + parsed?

Ovsyanka · January 15, 2023, 5:24pm

It is quite common for me, actually. Though I will most likely perform more operations than just the pure ones but I definitely remember writing a few functions that did exactly what you described.

Ovsyanka · January 15, 2023, 5:30pm

Here’s the first draft of purepath benchmark extension. Honestly, it seems like I am doing something wrong but that’s because I’ve never written pyperformance benchmarks.

If anyone has suggestions on improving my draft, I’d be happy to hear them.

barneygale · January 15, 2023, 6:04pm

Yeah, it’s a common use case for me too. I’m really on the fence.

The problems I have with this optimization:

It slows down cases where you’re either not walking directories, or not doing relevant PurePath operations. E.g. (Path('foo') / 'bar').read_text() performs two rounds of normalization, and they’re both pointless. I think this is the reason that pathlib is considered slow. And aside from micro-optimizations and possibly re-implementing it in C, there’s little room to improve.
It precludes us from sharing state between path objects in user subclasses, because we can’t provide a single method that users can override to customize construction of derivative paths. There are alternatives (e.g. deriving new types on-the-fly) but they feel unnatural to me.

And so your pyperformance benchmarks have arrived at the perfect moment for me to stop theorising and actually measure the impact of implementing a version of PurePath that defers joining, parsing and normalizing until its needed.

Ovsyanka · January 15, 2023, 6:20pm

Let’s hope that my benchmarks can actually prove useful. If they do not yet – I’m ready to improve them in any way I can.

barneygale · January 22, 2023, 7:37pm

January 2023 progress report

GH-100351 has landed, which improved the Python implementations of ntpath.splitdrive() and ntpath.normpath(), and brought the latter more in line with the native NT behaviour. Thanks to Steve Dower and Eryk Sun for the reviews!

I’ve opened GH-101002, which adds a new os.path.splitroot() function. The function parses paths into a (drive, root, tail) tuple, using OS-specific rules. By calling it from pathlib we considerably increase performance of WindowsPath construction. It’s also pretty useful in a variety of non-pathlib scenarios. Thank you Alex Waygood and Eryk Sun for helping review this.

If/when that lands, I’ll open two PRs focusing on pathlib performance. The first will tune the performance of path construction - basically a series of micro-optimizations. The second will be more radical: I want to see what sort of performance we can achieve from deferring path joining/parsing/normalization. It’s likely to have an adverse effect on the speed of directory walking, but it should be either performance-enhancing or performance-neutral everywhere else. It’s going to be really interesting to see!

If that lands, we can add a makepath() method without a performance hit. I expect a chorus of angels to accompany whoever hits “merge” on that one. It will have been a long time coming!

I’m following @Ovsyanka’s GH-100282 with excitement, and wondering whether we could implement Path.glob() using Path.walk(), and thereby make it safe from recursion errors on deep trees. It might also allow us to fix a glob() performance problem – I think currently unlogged – that every “**” wildcard in your pattern introduces an extra scandir() call on all visited directories.

I’m also reviewing @jugmac00’s GH-101223, which adds an explanation of match(), glob() and rglob() patterns beyond “see fnmatch”. Although fnmatch is used, it is called only to match individual path segments, and so the “*” wildcard doesn’t match path separators in pathlib.

Finally, I’m excited to share that I’m now part of the Python Triage team. Thank you Zachary Ware and Alex Waygood for sponsoring me! Honestly it’s probably going to take a year or two to get the pathlib issues/PRs backlog to a more manageable size. We’ll get there though!

Ciao for now o/

barneygale · February 19, 2023, 2:43pm

February 2023 progress report

GH-101002 has landed, and so Python 3.12 has gained an os.path.splitroot() function, which can split a path into a tuple of (drive, root, tail). Pathlib uses this function to efficiently parse paths according to OS-specific rules. Thanks again to Alex Waygood and Eryk Sun for their invaluable input, and respect to Antoine Pitrou for identifying the importance of three-part division when he created pathlib.

My plan now looks like this:

Address GH-101362: Optimize pathlib path construction
- I’ve opened three PRs that make small individual improvements: GH-101664, GH-101665 and GH-101667. I have one more of these on the way.
- I’ll then open a larger PR that makes pathlib defer parsing/normalization until its needed
Address GH-76846: pathlib.Path._from_parsed_parts() should call cls.__new__(cls) and GH-85281: subclasses of pathlib.PurePosixPath never call __init__() or __new__()
- This will reduce performance of some pathlib operations, notably iterdir(), glob() and walk().
- I’m hoping to make this performance loss as small as possible through the optimisations in step #1.
Address GH-100479: Support for sharing state between pathlib subclasses
Add pathlib.AbstractPath

I’m also looking at issues and feature requests related to glob() – the largest category of pathlib issues on GitHub. There’s three lines of work that I think will converge:

Make glob() treat symlinks consistently – see GH-77609 for discussion
@Ovsyanka’s fast iterative implementation of walk() – PR: GH-100282
My fast regex-based implementation of match() – PR: GH-101398

With these in place, we can write a fast implementation of glob(), including a really chonky speedup for recursive globs. This should help relieve any lingering pain caused by the main plan (see step 2 above).

Thanks for reading! Bye for now

eorojas · February 27, 2023, 9:04pm

I’ve been looking around at this because I found it annoying that when I create a pathlib.Path() on windows I get a WindowsPath, and that its str produces a path with backslashes ().
The Windows OS recognizes forward-slashes (/); it is only is a display or UI data entry issue that requires a backslash!
It would seem to me that if you this fact, it would simplify things (and indecently inform programmers of the fact). The only Windows requirement would be to accept paths with backsplashes and a drive letter. The internal representation should always be posix and an optional drive letter. And I guess a special win-path printing option.
I have not dug into the code, so I apologize if this is in someway misguided.

brettcannon · February 27, 2023, 10:00pm

FYI be very careful about taking the string representation of a pathlib.Path object and using it as an argument to something; you don’t want None to be a valid path. os.fspath() and os.fsdecode() both exist to get the string representation of a path-like object in the proper format. You can also use pathlib.PurePath.as_posix() to get a path with forward slashes.

domdfcoding · February 27, 2023, 10:30pm

For pathlib’s Path, PurePath etc. os.fspath ends up calling str(self) on the path object, so the result ends up the same regardless. Not necessarly true for other PathLike objects though.