This PR makes glob.glob()
and glob.iglob()
use several performance tricks developed in pathlib
. In fact it unifies the implementations of globbing across the two modules! The speedup depends on many factors including the pattern itself, but on my machine I see at least a 1.5x improvement for the majority of test cases I try.
I’ve already laid some groundwork with private glob._GlobberBase
and _StringGlobber
classes, which power pathlib’s globbing. The PR adds support for dir_fd
and include_hidden
in those classes, and makes glob()
and iglob()
use them.
I’d be grateful for reviews. Here’s the PR:
python:main
← barneygale:gh-116380
opened 11:13PM - 05 Mar 24 UTC
Speed up `glob.glob()` and `glob.iglob()` by reducing the number of system calls… made.
This unifies the implementations of globbing in the `glob` and `pathlib` modules.
## Depends on
- #117589
- #117732
## Filtered recursive walk
Expanding a recursive `**` segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, `glob.glob("foo/**/*.py", recursive=True)` recursively walks `foo/` with `os.scandir()`, and then filters paths through a regex based on "`**/*.py`, with no further filesystem access needed.
This solves #104269 as a side-effect.
## Tracking path existence
We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:
- Certain special pattern segments (`""`, `"."` and `".."`) leave the flag unchanged
- Literal pattern segments (e.g. `foo/bar`) set the flag to false
- Wildcard pattern segments (e.g. `*/*.py`) set the flag to true (because children are found via `os.scandir()`)
- Recursive pattern segments (e.g. `**`) leave the flag unchanged for the root path, and set it to true for descendants discovered via `os.scandir()`.
If the flag is false at the end, we call `lstat()` on each path to filter out missing paths.
## Minor speed-ups
We:
- Exclude paths that don't match a non-terminal non-recursive wildcard pattern _prior_ to calling `is_dir()`.
- Use a stack rather than recursion to implement recursive wildcards.
- Addresses #89727 for the `glob` module.
- Pre-compile regular expressions and pre-join literal pattern segments.
- Convert to/from `bytes` (a minor use-case) in `iglob()` rather than supporting `bytes` throughout. This particularly simplifies the code needed to handle relative bytes paths with `dir_fd`.
- Avoid calling `os.path.join()`; instead we keep paths in a normalized form and append trailing slashes when needed.
- Avoid calling `os.path.normcase()`; instead we use case-insensitive regex matching.
## Implementation notes
Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:
1. Support for `dir_fd`
2. Support for `include_hidden`
3. Support for generating paths relative to `root_dir`
## Results
Speedups via `python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)"` from CPython source directory on Linux:
| pattern | speedup |
|------------------------|---------|
| `Lib/*` | 1.48x |
| `Lib/*/` | 1.82x |
| `Lib/*.py` | 1.15x |
| `Lib/**` | 4.98x |
| `Lib/**/` | 1.31x |
| `Lib/**/*` | 1.82x |
| `Lib/**/**` | 14.9x |
| `Lib/**/*/` | 2.25x |
| `Lib/**/*.py` | 1.81x |
| `Lib/**/__init__.py` | 1.08x |
| `Lib/**/*/*.py` | 2.38x |
| `Lib/**/*/__init__.py` | 1.19x |
* Issue: gh-116380
Thanks all!
6 Likes
barneygale
(Barney Gale)
August 26, 2024, 5:29pm
2
Hey, I’d like to re-request code review for the above PR. Thankings!
3 Likes