@barneygale am obviously not a core dev, but am always happy to help with anything I can. I love pathlib and pathlab so I would love to assist with anything you need.
Out of curiosity: does the addition of AbstractPath
, assuming AbstractPath
is added to pathlib and doesn’t change other parts of Python, require a PEP?
I strongly suspect it will affect or interact with other bits of Python. Two examples:
We may wish to make AbstractPath.stat()
an abstract method, but what would it return? os.stat_result
works, but it has an OS-specific implementation (for OS-specific fields) and no public constructor. We could add a compatible class to pathlib
, but its interface may still be too low-level for pathlib users. My present thinking is that we add an AbstractPath.status()
method that returns a rich object, perhaps pathlib.Status
, that can be converted to and from a stat_result
, and that features high-level methods for determining file type, permissions, etc. Arguably this class could live in the stat
module.
Unless we make a serious intervention, open(ZipPath('README.md'))
will open a local file in the current directory, potentially causing serious confusion to users. Technically they should be using ZipPath('README.md').open()
instead, but I’d rather not lay such a trap in the first place. To fix this, we’d need to set AbstractPath.__fspath__ = None
, thus making AbstractPath
un-PathLike
. This bewildering state of affairs would make os.path.splitext(ZipPath('README.md'))
also fail, despite it only doing lexical work on the input. Other purely lexical functions that accept os.PathLike
arguments would be similarly unable to accept AbstractPath
. This might require us to make a distinction between pure and concrete paths in os.PathLike
, os.fspath()
and the __fspath__()
magic method.
Crazy thought: What if a path-like object had an inherent “context”? If it’s None (the default, for backward compatibility), it represents a file system path, and can be treated as such. But a ZipPath could set the context to be the zip file, and you could have a concept of remote paths (maybe in an SCP transfer, where the “context” would be the server connection). The context object would have to be responsible for opening files and any other concrete operations, but purely-lexical functions could ignore (and maintain) it.
It’s a good thought, and something I’ve explored myself, but I found that the differences between the ‘path’ and the ‘context’ classes is too slight, and that users would need to implement both in a range of scenarios, so it gets clunky/boilerplate-y fast. As @Conchylicultor points out, it’s more natural for users to customize __init__()
to their liking than to pass in a ‘context’ object.
I might take you up on that actually!
We’re in need of a performance benchmarking suite for pathlib. It’s necessary for evaluating the impact of gh-100481 and the targets in ideas-194. I’d love for it to have:
- Real-world vectors, e.g. from public source code
- Benchmarking of PurePath construction, joining,
parent
,parents
,name
,suffix
,with_name()
,with_suffix()
and any other super-common operations - Benchmarking of Path
iterdir()
,glob()
,walk()
,absolute()
- Comparisons with equivalent
os.path
/etc code for the above. - Benchmarking of representative snippets that combine multiple operations (e.g. read a few files in a ‘config’ directory; check whether
.tar
and.tar.gz
siblings exist of a directory; attempt to determine a file type by checking the file extension and/or first few bytes; etc)
If you wanted to have a go at any of those I’d greatly I appreciate it! But if not, I’ll find some time over the coming weeks.
Yeah, I’d love to help you with that. Let me start with point 2 after the holidays, and then I’ll update you on their status and, if I have time, I’ll move on to points 3 and 4:
- Benchmarking of PurePath construction, joining,
parent
,parents
,name
,suffix
,with_name()
,with_suffix()
and any other super-common operations- Benchmarking of Path
iterdir()
,glob()
,walk()
,absolute()
- Comparisons with equivalent
os.path
/etc code for the above.
One question though: where would we store the benchmarks? My personal separate github repo? Or somewhere under python/
organization? And if so, where?
Perhaps expanding on the existing pyperformance
benchmark? That’s used to test the performance of interpreter optimisations in general, and is aiming to have more real-world tests.
Hey,
It’s 2023 now and I don’t have anything concrete to contribute to this discussion, but as the original author of pathlib
I would like to congratulate you all (and especially @barneygale ) for advancing this despite my inactivity. Happy new year, and keep up the good work!
Easiest way is to update .github/CODEOWNERS
in the exact opposite way I am doing it in Drop myself from pathlib maintenance by brettcannon · Pull Request #100757 · python/cpython · GitHub . There is also Issues · python/cpython · GitHub which anyone can use to quickly see pathlib-related issues.
Some thoughts on pathlib performance:
PurePath
objects have two main constructors: _from_parts()
and _from_parsed_parts()
.
_from_parts()
is used in the majority of cases, including when you call PurePath('foo', 'bar')
. It performs the following (expensive) normalization + parsing routine:
- Join the arguments together with
os.path.join()
- On Windows, convert forward slashes to backward slashes
- Partition the path into
drive, root, tail
segments - Split the tail on path separators into ‘parts’
a. Remove ‘.’ segments and empty segments
b. Prepend the drive + root, if not empty - Create the
PurePath
object and assign_drv
,_root
and_parts
.
Thus path objects are fully normalized + parsed on construction.
_from_parsed_parts()
is used in cases where we can skip the above routine and instead directly assign _drv
, _root
and _parts
. These are:
- When iterating or walking directories with
iterdir()
,glob()
,walk()
. Under the hood, these useos.listdir()
andos.scandir()
, which return names that are guaranteed not to contain drives, path separators, etc, so they can be naively appended to_parts
. - When computing parent directories in
.parent
and.parents
. In this case we can safely pop the items off_parts
when constructing the new paths. But this isn’t usually in performance-sensitive areas of code. - To a certain extent,
with_name()
andwith_suffix()
, though some parsing + error checking is still performed.
The result is that it’s “cheap” to keep paths fully normalized when walking directories. The following code only runs _from_parts()
once:
import pathlib
path = pathlib.Path('cpython/Lib')
for py_path in path.rglob('*.py'):
print(py_path.name)
You may then ask “when are normalized paths useful?”.
In PurePath
, most operations (such as suffix
, with_name()
, __hash__()
and match()
) require a fully-normalized path. There are some notable exceptions: joinpath()
and __truediv__()
could be made to work without even os.path.join()
!
In Path
, passing an unnormalized path to the OS should be equivalent to a normalized path, otherwise pathlib’s normalization logic is broken! Hence normalization confers no benefit, though we still need to call os.path.join()
.
Putting these pieces together, we can conclude that pathlib is currently optimized for the following use case: iterate a directory (or walk a directory tree) and perform pure operations on the directory children, e.g. name
, with_suffix()
, match()
, etc. The previous code example demonstrates this.
Question to the audience: how common does that use case seem to you? Is it worth us slowing down some other parts of pathlib by keeping paths fully normalized + parsed?
It is quite common for me, actually. Though I will most likely perform more operations than just the pure ones but I definitely remember writing a few functions that did exactly what you described.
Here’s the first draft of purepath benchmark extension. Honestly, it seems like I am doing something wrong but that’s because I’ve never written pyperformance benchmarks.
If anyone has suggestions on improving my draft, I’d be happy to hear them.
Yeah, it’s a common use case for me too. I’m really on the fence.
The problems I have with this optimization:
- It slows down cases where you’re either not walking directories, or not doing relevant PurePath operations. E.g.
(Path('foo') / 'bar').read_text()
performs two rounds of normalization, and they’re both pointless. I think this is the reason that pathlib is considered slow. And aside from micro-optimizations and possibly re-implementing it in C, there’s little room to improve. - It precludes us from sharing state between path objects in user subclasses, because we can’t provide a single method that users can override to customize construction of derivative paths. There are alternatives (e.g. deriving new types on-the-fly) but they feel unnatural to me.
And so your pyperformance benchmarks have arrived at the perfect moment for me to stop theorising and actually measure the impact of implementing a version of PurePath
that defers joining, parsing and normalizing until its needed.
Let’s hope that my benchmarks can actually prove useful. If they do not yet – I’m ready to improve them in any way I can.
January 2023 progress report
GH-100351 has landed, which improved the Python implementations of ntpath.splitdrive()
and ntpath.normpath()
, and brought the latter more in line with the native NT behaviour. Thanks to Steve Dower and Eryk Sun for the reviews!
I’ve opened GH-101002, which adds a new os.path.splitroot()
function. The function parses paths into a (drive, root, tail)
tuple, using OS-specific rules. By calling it from pathlib we considerably increase performance of WindowsPath
construction. It’s also pretty useful in a variety of non-pathlib scenarios. Thank you Alex Waygood and Eryk Sun for helping review this.
If/when that lands, I’ll open two PRs focusing on pathlib performance. The first will tune the performance of path construction - basically a series of micro-optimizations. The second will be more radical: I want to see what sort of performance we can achieve from deferring path joining/parsing/normalization. It’s likely to have an adverse effect on the speed of directory walking, but it should be either performance-enhancing or performance-neutral everywhere else. It’s going to be really interesting to see!
If that lands, we can add a makepath()
method without a performance hit. I expect a chorus of angels to accompany whoever hits “merge” on that one. It will have been a long time coming!
I’m following @Ovsyanka’s GH-100282 with excitement, and wondering whether we could implement Path.glob()
using Path.walk()
, and thereby make it safe from recursion errors on deep trees. It might also allow us to fix a glob()
performance problem – I think currently unlogged – that every “**
” wildcard in your pattern introduces an extra scandir()
call on all visited directories.
I’m also reviewing @jugmac00’s GH-101223, which adds an explanation of match()
, glob()
and rglob()
patterns beyond “see fnmatch”. Although fnmatch
is used, it is called only to match individual path segments, and so the “*
” wildcard doesn’t match path separators in pathlib.
Finally, I’m excited to share that I’m now part of the Python Triage team. Thank you Zachary Ware and Alex Waygood for sponsoring me! Honestly it’s probably going to take a year or two to get the pathlib issues/PRs backlog to a more manageable size. We’ll get there though!
Ciao for now o/
February 2023 progress report
GH-101002 has landed, and so Python 3.12 has gained an os.path.splitroot()
function, which can split a path into a tuple of (drive, root, tail)
. Pathlib uses this function to efficiently parse paths according to OS-specific rules. Thanks again to Alex Waygood and Eryk Sun for their invaluable input, and respect to Antoine Pitrou for identifying the importance of three-part division when he created pathlib.
My plan now looks like this:
- Address GH-101362: Optimize pathlib path construction
- Address GH-76846:
pathlib.Path._from_parsed_parts()
should callcls.__new__(cls)
and GH-85281: subclasses ofpathlib.PurePosixPath
never call__init__()
or__new__()
- This will reduce performance of some pathlib operations, notably
iterdir()
,glob()
andwalk()
. - I’m hoping to make this performance loss as small as possible through the optimisations in step #1.
- This will reduce performance of some pathlib operations, notably
- Address GH-100479: Support for sharing state between pathlib subclasses
- Add
pathlib.AbstractPath
I’m also looking at issues and feature requests related to glob()
– the largest category of pathlib issues on GitHub. There’s three lines of work that I think will converge:
- Make
glob()
treat symlinks consistently – see GH-77609 for discussion -
@Ovsyanka’s fast iterative implementation of
walk()
– PR: GH-100282 - My fast regex-based implementation of
match()
– PR: GH-101398
With these in place, we can write a fast implementation of glob()
, including a really chonky speedup for recursive globs. This should help relieve any lingering pain caused by the main plan (see step 2 above).
Thanks for reading! Bye for now
I’ve been looking around at this because I found it annoying that when I create a pathlib.Path() on windows I get a WindowsPath, and that its str produces a path with backslashes ().
The Windows OS recognizes forward-slashes (/); it is only is a display or UI data entry issue that requires a backslash!
It would seem to me that if you this fact, it would simplify things (and indecently inform programmers of the fact). The only Windows requirement would be to accept paths with backsplashes and a drive letter. The internal representation should always be posix and an optional drive letter. And I guess a special win-path printing option.
I have not dug into the code, so I apologize if this is in someway misguided.
FYI be very careful about taking the string representation of a pathlib.Path
object and using it as an argument to something; you don’t want None
to be a valid path. os.fspath()
and os.fsdecode()
both exist to get the string representation of a path-like object in the proper format. You can also use pathlib.PurePath.as_posix()
to get a path with forward slashes.
For pathlib’s Path, PurePath etc. os.fspath
ends up calling str(self)
on the path object, so the result ends up the same regardless. Not necessarly true for other PathLike objects though.