Make pathlib extensible

:sparkles: February 2023 progress report :sparkles:

GH-101002 has landed, and so Python 3.12 has gained an os.path.splitroot() function, which can split a path into a tuple of (drive, root, tail). Pathlib uses this function to efficiently parse paths according to OS-specific rules. Thanks again to Alex Waygood and Eryk Sun for their invaluable input, and respect to Antoine Pitrou for identifying the importance of three-part division when he created pathlib.

My plan now looks like this:

  1. Address GH-101362: Optimize pathlib path construction
    • I’ve opened three PRs that make small individual improvements: GH-101664, GH-101665 and GH-101667. I have one more of these on the way.
    • I’ll then open a larger PR that makes pathlib defer parsing/normalization until its needed
  2. Address GH-76846: pathlib.Path._from_parsed_parts() should call cls.__new__(cls) and GH-85281: subclasses of pathlib.PurePosixPath never call __init__() or __new__()
    • This will reduce performance of some pathlib operations, notably iterdir(), glob() and walk().
    • I’m hoping to make this performance loss as small as possible through the optimisations in step #1.
  3. Address GH-100479: Support for sharing state between pathlib subclasses
  4. Add pathlib.AbstractPath :partying_face:

I’m also looking at issues and feature requests related to glob() – the largest category of pathlib issues on GitHub. There’s three lines of work that I think will converge:

  • Make glob() treat symlinks consistently – see GH-77609 for discussion
  • @Ovsyanka’s fast iterative implementation of walk() – PR: GH-100282
  • My fast regex-based implementation of match() – PR: GH-101398

With these in place, we can write a fast implementation of glob(), including a really chonky speedup for recursive globs. This should help relieve any lingering pain caused by the main plan (see step 2 above).

Thanks for reading! Bye for now

19 Likes