Is there a `pathlib` equivalent of `os.scandir()`?

bittner · February 22, 2024, 10:37am

I stumbled across os.scandir just recently while refactoring a code base from using os.path to pathlib.

After doing a bit of research I’m still not sure whether an equivalent of it exists in the pathlib module. My question could, as an alternative, be phrased, “Is there a solution with pathlib that is as fast as os.scandir()?”

There is no mention of os.scandir() in the table of correspondence in pathlib’s documentation, and Path.walk() is listed as being equivalent to os.walk(), hence traditionally slow. The only mention of scandir is in the Path.walk section, stating:

By default, errors from os.scandir() are ignored.

Taking a peek at pathlib’s implementation confirms that os.scandir() is used by Path.walk() under the hood.

Does that mean that Path.walk is as efficient as os.scandir, and one can simply disregard scandir and use pathlib’s walk instead?

kknechtel · February 22, 2024, 1:35pm

My recommendation is to start by trying it and seeing if it performs well enough for your purposes, or if you can discern a performance difference implementing your overall code each way.

JamesParrott · February 22, 2024, 3:13pm

I can’t say how the performance compares. But I would say Path.iterdir is the canonical pathlib equivalent to os.scandir. `pathlib — Object-oriented filesystem paths — Python 3.12.2 documentation

If you don’t need literally everything, only results matching a known pattern, e.g. for the file names or extensions, globbing is very handy too: pathlib — Object-oriented filesystem paths — Python 3.12.2 documentation

bittner · February 22, 2024, 6:33pm

Thanks for mentioning Path.iterdir. When I look at the implementation it turns out that it’s a generator of calls to os.listdir. Hence, it must be just as slow as its os-equivalent.

Note that the entire argument is about os.scandir being performant, while os.listdir, os.walk, and likely Path.walk, is not. I’d like to avoid that I – and other people – earn a speed penalty only because they want to be purist and blindly switch to what pathlib offers.

For everyone’s convenience, the motivation and history of os.scandir is well explained at, e.g.

MegaIng · February 22, 2024, 6:47pm

Note that the performance benefit of os.scandir only exists depending on how you are using the resulting entries. This means that it is a bit pointless to ask for an alternative in isolation. The question is “what do you want to do with the children you find”?

bittner · February 22, 2024, 7:03pm

That’s an excellent point, and I agree.

According to my understanding os.scandir is what you want when you want all the directory information now. That’s why Guido refused to let it return Path instances and have them hold cached file system information. If you expect to inspect Path objects later you’re better off using a os.scandir alternative.

If that’s the entire wisdom I’d like to have that captured in the Python pathlib docs. A hint similar to the “See also” box in the os.listdir documentation might do it. Any opinions on that?

JamesParrott · February 22, 2024, 7:24pm

Fair enough - good point. I just thought it and glob were both conspicuous by their absence in your summary.

I naively assumed that given its name, Path.iterdir, returns a pure iterator. Would an simple tweak to use os.scandir instead gain much in terms of performance for pathlib?

bittner · February 22, 2024, 9:20pm

That’s a good thought experiment! – I would guess, it certainly would. However, os.scandir would never return Path objects but os.DirEntry objects instead, which have file system information cached as opposed to Path objects. We gain nothing compared to using os.scandir directly.

In addition, the discussion to make os.scandir use pathlib’s Path interface has already been taken place, as documented by the author in PEP 471. So, scandir will probably never be integrated with pathlib.

JamesParrott · February 23, 2024, 11:20am

Sure. I’m not suggesting changing os.scandir (or any breaking changes at all). The required helper methods are even already there (originally added for Path.walk). I haven’t tested this (it’s so straightforward, I suspect Barney and Brett et al considered it already and ruled it out), but the diff could be as simple as changing Path.iterdir to:

return (self._make_child_direntry(entry) for entry in self._scandir())

from

return (self._make_child_relpath(name) for name in os.listdir(self))

github.com

python/cpython/blob/e74cd0f9101d06045464ac3173ab73e0b78d175e/Lib/pathlib/init.py#L579


      
          
          def write_text(self, data, encoding=None, errors=None, newline=None):
              """
              Open the file in text mode, write to it, and close the file.
              """
              # Call io.text_encoding() here to ensure any warning is raised at an
              # appropriate stack level.
              encoding = io.text_encoding(encoding)
              return _abc.PathBase.write_text(self, data, encoding, errors, newline)
          
          def iterdir(self):
              """Yield path objects of the directory contents.
          
              The children are yielded in arbitrary order, and the
              special entries '.' and '..' are not included.
              """
              return (self._make_child_relpath(name) for name in os.listdir(self))
          
          def _scandir(self):
              return os.scandir(self)

MegaIng · February 23, 2024, 1:23pm

And does that have any performance benefit? We are throwing away the extra information contained in the DirEntry objects, so what do we gain? I would imagine that without the extra information being used os.listdir is faster than os.scandir. But I would suggest you measure that.

Changing Path objects to contain this extra information is probably not a good idea, because people don’t expect Path objects to “expire” like DirEntry will.

JamesParrott · February 23, 2024, 2:22pm

And does that have any performance benefit?

Indeed. I’m assuming os.scandir is a true iterator without a cache. The main advantage is making Path.iterdir return a cache-less iterator, instead of a trivial iterator based on a hidden cache from os.listdir.

If so, then it doesn’t store the whole directory contents in memory until needed, so I imagine it would be more efficient to iterate over. Certainly on directories containing large enough numbers of items. But I agree, this remains to be shown. Currently for discussion only. I suggest before serious work is started, we wait to hear from those that worked on Pathlib, to make sure revamping Path.iterdir wasn’t already considered and ruled out, when Path.walk was last worked on. Maybe they just thought users who want more performance can simply use Path.walk over Path.iterdir (and so Peter is right).

We are throwing away the extra information contained in the DirEntry objects, so what do we gain?

The ability to iterate over a directory’s contents only when needed, without storing them all in memory first.

Changing Path objects to contain this extra information is probably not a good idea, because people don’t expect Path objects to “expire” like DirEntry will.

I’m sorry, I know Guido intervened on this, but I haven’t really grasped why expiration of DirEntrys is important. The suggested new Path.iterdir does not do anything different to create the yielded Path instances from a DirEntrys, that Path.walk does not do already (it reuses Path.walk’s helper methods) .

Creating a Path doesn’t add any sort of lock at the OS level does it? So if something else outside the Python process deletes a file, the value returned by Path(files_path).is_file() will change. So don’t users of Path instances handle expiration already?

MegaIng · February 23, 2024, 2:26pm

Yes, but DirEntry.is_file() doesn’t change (at least not necessarily, system dependent), at least that is my reading of the documentation. That is why this class still exists. With “expire” I mean that the object will tell you wrong things, not that race conditions are impossible.

JamesParrott · February 23, 2024, 3:07pm

Thanks. I see.

So was the reason it was chosen to make os.scandir not return Paths, that there’s legacy code out there, that historically already checked or allowed for DirEntry possibly returning the wrong thing (from an out of date cached value), that would become more inefficient if it was replaced by a Path, as the same checks (sys calls?) would be done twice, even though a Path returns the correct value?

MegaIng · February 23, 2024, 3:14pm

Not legacy code. Code that cares about speed. Note that race conditions aren’t impossible with Path.is_file(). The point is that Path.is_file() will be slow in contrast to DirEntry.is_file(). And if you can reasonably expect that the file system doesn’t change at the point where you actually look at the DirEntry objects, then there is little danger. But in general there is the expectation that Path objects can be stored for a long time and it’s functions will still return correct values, whereas DirEntry objects should be consumed soon after generation. They serve different purposes.

I don’t think pathlib needs a replacement for os.scandir. If you need the speed gains from os.scandir, just call it.

JamesParrott · February 23, 2024, 3:38pm

It sounds like os.scandir is also creating a cache, just not in Python. The suggestion would just also avoid the sys calls of os.listdir in that case, not create a true iterator.

It should be checked of course, but from a brief read of the code, the Path objects returned by the suggested Path.iterdir, are not caching values from DirEntry. _make_child_direntry only uses DirEntry.name and DirEntry.path. If the user wants to know if Path.is_file(), that method will still make a sys call when called itself.

I don’t think pathlib needs a replacement for os.scandir .

Me neither. I don’t even need this myself! I just wondered if this might be a quick easy win, that could help someone else.

CAM-Gerlach · February 24, 2024, 7:09pm

James Parrott:

I haven’t tested this (it’s so straightforward, I suspect Barney and Brett et al considered it already and ruled it out), but the diff could be as simple as changing Path.iterdir to:
return (self._make_child_direntry(entry) for entry in self._scandir())
from

return (self._make_child_relpath(name) for name in os.listdir(self))

Let’s see what our friendly neighborhood pathlib superhero himself has to say, shall we — @barneygale ?

barneygale · February 24, 2024, 7:27pm

I think I did try that at one point, but I can’t remember why I didn’t pursue it. Maybe the _make_child_direentry(entry) method didn’t exist at that point, and calling _make_child_relpath(entry.name) wasn’t any faster. If it provides a performance improvement feel free to open an issue and a PR and I’ll review!

JamesParrott · February 24, 2024, 11:01pm

Thanks Barney - I really appreciate that

Unfortunately, @CAM-Gerlach, @MegaIng I’ve taken you all on a complete wild goose chase. I apologise. In summary, my suggestion could well make the performance between slightly and significantly worse (except on Mac OS, on which the performance is even worse regardless of my suggestion).

Cornelius - you were absolutely right, this needed to be measured. I have done so.
Ubuntu 22.04:

 Testing reps=20 of listing a directory of: 50000 files
Time using Path.iterdir: 2.142239902
Time using ScanDirPath.iterdir: 2.313271893999996
Time using os.listdir: 0.00019698799999900984
Time using os.scandir: 0.0001934010000042008

Windows Server 2022:

Testing reps=20 of listing a directory of: 50000 files
Time using Path.iterdir: 1.841156299999966
Time using ScanDirPath.iterdir: 2.198301700000002
Time using os.listdir: 0.000457399999959307
Time using os.scandir: 0.00043940000000475266

MacOS 13

Testing reps=20 of listing a directory of: 50000 files
Time using Path.iterdir: 4.364073147000454
Time using ScanDirPath.iterdir: 4.452478466000684
Time using os.listdir: 0.00039130699951783754
Time using os.scandir: 0.00034591399980854476

There’s still the very real possibility I’ve done something silly, in particular something that means these tests are unfair. My choice of 50,000 files and 20 repetitions is influenced as much by my patience in waiting for tests to finish, as by my idea of a realistic usage scenario, in which any difference could be important. Be ever wary of isolated benchmarks, etc.

If these tests are not flawed, then you were completely right too Cornelius about using the os module, where performance is needed.

If nothing else, I believe I now have a definitive answer to Peter’s original (rephrased) question:

" “Is there a solution with pathlib that is as fast as os.scandir()?”"

No. Not even one as fast as os.listdir.

Pathlib is superb, but its primary benefit is code readability (and writeability). Not raw performance on unmanageably extreme numbers of files.

MegaIng · February 24, 2024, 11:20pm

I am impressed that scandir is consistently faster than listdir. Doesn’t the former have to do strictly more work than the latter? Specifically it has to allocate and construct all the DirEntry objects. There might be improvements to be made in the os.listdir implementation (or your testing is flawed for some non-obvious reason)

JamesParrott · February 24, 2024, 11:31pm

scandir is only faster for directories containing a certain number of files. For smaller numbers, listdir is faster. My understanding was yes it does do more work, but only more work within Python. Whereas listdir makes more sys calls. Or is the latter no longer the case?

Anyway, this was the result on my laptop in Python 3.12:

C:\...\py Path_iterdir_scandir_test.py 10000 100 
Time using Path.iterdir: 4.629436599996552
This test relies on implementation details of Python 3.13's pathlib, unavailable in earlier Pythons.
Time using os.listdir: 0.013278999998874497
Time using os.scandir: 0.015009400001872564

Even so, I was surprised the difference was so negligible for a single directory, after the anecdotes in the blog (but as that mentions, the difference is noticeable for other network file systems). I wondered when scandir was faster too, so tested a directory of a million files. The highest improvement from scandir over listdir was on Windows:

Testing reps=5 of listing a directory of: 1000000 files
Time using Path.iterdir: 9.390752900000052
Time using ScanDirPath.iterdir: 11.222257000000127
Time using os.listdir: 0.0002230000000054133
Time using os.scandir: 0.00013970000009067007