Make pathlib extensible

barneygale · February 19, 2023, 2:43pm

February 2023 progress report

GH-101002 has landed, and so Python 3.12 has gained an os.path.splitroot() function, which can split a path into a tuple of (drive, root, tail). Pathlib uses this function to efficiently parse paths according to OS-specific rules. Thanks again to Alex Waygood and Eryk Sun for their invaluable input, and respect to Antoine Pitrou for identifying the importance of three-part division when he created pathlib.

My plan now looks like this:

Address GH-101362: Optimize pathlib path construction
- I’ve opened three PRs that make small individual improvements: GH-101664, GH-101665 and GH-101667. I have one more of these on the way.
- I’ll then open a larger PR that makes pathlib defer parsing/normalization until its needed
Address GH-76846: pathlib.Path._from_parsed_parts() should call cls.__new__(cls) and GH-85281: subclasses of pathlib.PurePosixPath never call __init__() or __new__()
- This will reduce performance of some pathlib operations, notably iterdir(), glob() and walk().
- I’m hoping to make this performance loss as small as possible through the optimisations in step #1.
Address GH-100479: Support for sharing state between pathlib subclasses
Add pathlib.AbstractPath

I’m also looking at issues and feature requests related to glob() – the largest category of pathlib issues on GitHub. There’s three lines of work that I think will converge:

Make glob() treat symlinks consistently – see GH-77609 for discussion
@Ovsyanka’s fast iterative implementation of walk() – PR: GH-100282
My fast regex-based implementation of match() – PR: GH-101398

With these in place, we can write a fast implementation of glob(), including a really chonky speedup for recursive globs. This should help relieve any lingering pain caused by the main plan (see step 2 above).

Thanks for reading! Bye for now