Make pathlib extensible

It is quite common for me, actually. Though I will most likely perform more operations than just the pure ones but I definitely remember writing a few functions that did exactly what you described.

1 Like

Here’s the first draft of purepath benchmark extension. Honestly, it seems like I am doing something wrong but that’s because I’ve never written pyperformance benchmarks.

If anyone has suggestions on improving my draft, I’d be happy to hear them.

1 Like

Yeah, it’s a common use case for me too. I’m really on the fence.

The problems I have with this optimization:

  1. It slows down cases where you’re either not walking directories, or not doing relevant PurePath operations. E.g. (Path('foo') / 'bar').read_text() performs two rounds of normalization, and they’re both pointless. I think this is the reason that pathlib is considered slow. And aside from micro-optimizations and possibly re-implementing it in C, there’s little room to improve.
  2. It precludes us from sharing state between path objects in user subclasses, because we can’t provide a single method that users can override to customize construction of derivative paths. There are alternatives (e.g. deriving new types on-the-fly) but they feel unnatural to me.

And so your pyperformance benchmarks have arrived at the perfect moment for me to stop theorising and actually measure the impact of implementing a version of PurePath that defers joining, parsing and normalizing until its needed.

Let’s hope that my benchmarks can actually prove useful. If they do not yet – I’m ready to improve them in any way I can.

:sparkles: January 2023 progress report :sparkles:

GH-100351 has landed, which improved the Python implementations of ntpath.splitdrive() and ntpath.normpath(), and brought the latter more in line with the native NT behaviour. Thanks to Steve Dower and Eryk Sun for the reviews!

I’ve opened GH-101002, which adds a new os.path.splitroot() function. The function parses paths into a (drive, root, tail) tuple, using OS-specific rules. By calling it from pathlib we considerably increase performance of WindowsPath construction. It’s also pretty useful in a variety of non-pathlib scenarios. Thank you Alex Waygood and Eryk Sun for helping review this.

If/when that lands, I’ll open two PRs focusing on pathlib performance. The first will tune the performance of path construction - basically a series of micro-optimizations. The second will be more radical: I want to see what sort of performance we can achieve from deferring path joining/parsing/normalization. It’s likely to have an adverse effect on the speed of directory walking, but it should be either performance-enhancing or performance-neutral everywhere else. It’s going to be really interesting to see!

If that lands, we can add a makepath() method without a performance hit. I expect a chorus of angels to accompany whoever hits “merge” on that one. It will have been a long time coming!

I’m following @Ovsyanka’s GH-100282 with excitement, and wondering whether we could implement Path.glob() using Path.walk(), and thereby make it safe from recursion errors on deep trees. It might also allow us to fix a glob() performance problem – I think currently unlogged – that every “**” wildcard in your pattern introduces an extra scandir() call on all visited directories.

I’m also reviewing @jugmac00’s GH-101223, which adds an explanation of match(), glob() and rglob() patterns beyond “see fnmatch”. Although fnmatch is used, it is called only to match individual path segments, and so the “*” wildcard doesn’t match path separators in pathlib.

Finally, I’m excited to share that I’m now part of the Python Triage team. Thank you Zachary Ware and Alex Waygood for sponsoring me! Honestly it’s probably going to take a year or two to get the pathlib issues/PRs backlog to a more manageable size. We’ll get there though! :slight_smile:

Ciao for now o/

16 Likes

:sparkles: February 2023 progress report :sparkles:

GH-101002 has landed, and so Python 3.12 has gained an os.path.splitroot() function, which can split a path into a tuple of (drive, root, tail). Pathlib uses this function to efficiently parse paths according to OS-specific rules. Thanks again to Alex Waygood and Eryk Sun for their invaluable input, and respect to Antoine Pitrou for identifying the importance of three-part division when he created pathlib.

My plan now looks like this:

  1. Address GH-101362: Optimize pathlib path construction
    • I’ve opened three PRs that make small individual improvements: GH-101664, GH-101665 and GH-101667. I have one more of these on the way.
    • I’ll then open a larger PR that makes pathlib defer parsing/normalization until its needed
  2. Address GH-76846: pathlib.Path._from_parsed_parts() should call cls.__new__(cls) and GH-85281: subclasses of pathlib.PurePosixPath never call __init__() or __new__()
    • This will reduce performance of some pathlib operations, notably iterdir(), glob() and walk().
    • I’m hoping to make this performance loss as small as possible through the optimisations in step #1.
  3. Address GH-100479: Support for sharing state between pathlib subclasses
  4. Add pathlib.AbstractPath :partying_face:

I’m also looking at issues and feature requests related to glob() – the largest category of pathlib issues on GitHub. There’s three lines of work that I think will converge:

  • Make glob() treat symlinks consistently – see GH-77609 for discussion
  • @Ovsyanka’s fast iterative implementation of walk() – PR: GH-100282
  • My fast regex-based implementation of match() – PR: GH-101398

With these in place, we can write a fast implementation of glob(), including a really chonky speedup for recursive globs. This should help relieve any lingering pain caused by the main plan (see step 2 above).

Thanks for reading! Bye for now

19 Likes

I’ve been looking around at this because I found it annoying that when I create a pathlib.Path() on windows I get a WindowsPath, and that its str produces a path with backslashes ().
The Windows OS recognizes forward-slashes (/); it is only is a display or UI data entry issue that requires a backslash!
It would seem to me that if you this fact, it would simplify things (and indecently inform programmers of the fact). The only Windows requirement would be to accept paths with backsplashes and a drive letter. The internal representation should always be posix and an optional drive letter. And I guess a special win-path printing option.
I have not dug into the code, so I apologize if this is in someway misguided.

FYI be very careful about taking the string representation of a pathlib.Path object and using it as an argument to something; you don’t want None to be a valid path. :wink: os.fspath() and os.fsdecode() both exist to get the string representation of a path-like object in the proper format. You can also use pathlib.PurePath.as_posix() to get a path with forward slashes.

3 Likes

For pathlib’s Path, PurePath etc. os.fspath ends up calling str(self) on the path object, so the result ends up the same regardless. Not necessarly true for other PathLike objects though.

Most of the functions in the Windows file API first normalize a path into native NT form before making a system call such as NtCreateFile() or NtOpenFile(). Among other things, path normalization replaces forward slashes with backslashes. There are exceptions.

Of course, normalization is intentionally skipped for “\\?\” device paths. For example, r"\\?\C:\Windows/System32" is an invalid path because NTFS reserves forward slash as an invalid name character. Like all code in the the native NT API and system services, the NTFS filesystem only handles backslash as a path separator.

>>> os.stat(r'\\?\C:\Windows/System32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: '\\\\?\\C:\\Windows/System32'

None of the Path* API functions, such as PathCchSkipRoot(), handle forward slash as a path separator.

When creating a relative symbolic link, CreateSymbolicLinkW() (i.e. os.symlink()) does not replace forward slashes with backslashes in the target path. This creates a broken symlink since paths in the kernel only use backslash as a path separator.

>>> os.mkdir('spam')
>>> open('spam\\eggs', 'w').close()
>>> os.symlink('spam/eggs', 'eggslink')
>>> os.stat('eggslink')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'eggslink'

There are several other functions in the Windows API that take file paths and fail to normalize forward slashes as backslashes, such as NeedCurrentDirectoryForExePathW().

It’s really a laundry list of exceptions to the rule. Better to just use the native path separator than to worry about what does and does not support forward slashes.

Also, when paths are parsed as command-line arguments, applications may fail to handle paths that use forward slashes. Notably, the CMD shell has this problem. For example:

>>> os.system('dir C:/Windows')
Parameter format not correct - "Windows".
3 Likes

What I’m saying is there normalization is not required.

The key thing is os.fspath() prevents you from accidentally calling str() on a non-path-like object like None.

The idea is that the string representation is like an encoding of a path just like some integer can be an encoding for a Unicode code point, and thus not something to directly think about if you’re using pathlib.

2 Likes

:sparkles: March 2023 progress report :sparkles:

Thank you to @AlexWaygood, @hauntsaninja and @steve.dower for reviewing and merging performance improvements to path construction. There’s one remaining PR to land on that issue, after which it can be resolved. I’ve logged an issue for optimizing PurePath.__fspath__() by returning an unnormalized path, and another for implementing os.path.splitroot() in C. I’m also looking at an issue with glob() performance.

Adding AbstractPath is a multi-year yak-shaving exercise, and with some of those performance improvements now in place, I can approach the yak with shears in hand:

That PR unifies and simplfies path construction, and opens the door to adding AbstractPath in short order. It’s something of a milestone for this project! I’m beginning to believe we could land AbstractPath in time for Python 3.13 :slight_smile:

Thanks as ever for reading, ta ra!

15 Likes

Oh, and lest we forget to mention, perhaps one of the most important updates is to congratulate @barneygale on his nomination to core developer (and pathlib maintainer) on the basis of his exceptionally diligent, thoughtful and tireless work on pathlib and beyond!

16 Likes

And now it’s official!

3 Likes

Congratulations, @barneygale . You really deserve it and it has been a pleasure working with you on pathlib so far!

3 Likes

:sparkles: April 2023 progress report :sparkles:

Big thanks to Steve Dower for reviewing GH-102789, which shaved the aforementioned yak. Path object construction now uses a single code path, so user subclasses can override __new__() and __init__() and expect that their methods will actually be called when new path objects are created.

For Python 3.12 beta 1, I’m hoping to get one more improvement in: GH-100481, which adds a new makepath() method. User subclasses can override this method to customize how path objects are created. Among other things, this allows users to share objects such as sockets or fileobjs between path objects. When this lands I will consider support for subclassing of pathlib classes complete :partying_face:

For Python 3.13, I’ll aim to add a tarfile.TarPath class (see GH-89812) utilizing something like pathlib._AbstractTraversable or pathlib._AbstractPath. This should bring any remaining shortcomings to the fore; once resolved, we can drop the _ prefix(es); that could happen in time for 3.13, but more likely it will be a 3.14 or 3.15 thing.

In mostly-unrelated pathlib news, I have two open PRs that implement glob()-related features (GH-101398, GH-102710), and two that slightly improve performance (GH-103526, GH-103549). Appreciative of any reviews!

That’s all for now, cheers!

12 Likes

Hey, I am not sure if this is an appropriate thread, but I have an idea as to how to make Pathlib even better.

When I write files to disk, I seem to repeat this pattern very often:

import itertools
from pathlib import Path

file_name, file_counter = Path(“file_name.here”), itertools.count(1)
stem, suffix = file_name.stem, file_name.suffix
while file_name.is_file():
    file_name = Path(f”{stem} ({next(file_counter)}).{suffix}”)

Would it be possible if Pathlib contained a built-in method which gave you the first name which doesn’t overwrite an already existing file? I would imagine that this is a very common problem to run into, not just by me.

Would tempfile.mkstemp() work for that use case? It avoids a race condition.

import tempfile
import pathlib

fd, path = tempfile.mkstemp(prefix="file_name.")
path = pathlib.Path(path)
3 Likes

Very interesting solution here, and also a very good way of avoiding a race condition. Would it be possible if something like this could be called from a Path instance itself? e.g.

import pandas as pd
from pathlib import Path

path = Path(r"/path/to/csv/test_csv.csv")
df = pd.read_csv(path)

...

# Given right flags, writes to mkstemp-style path like "/path/to/csv/test_csv_x0u3hb3m.csv", thus keeping both the original and a modified copy of the CSV file
df.to_csv(path.mkstemp(...))