Currently iterdir materializes the ScandirIterator in a list:
I think it would be valuable to pull elements progressively as allowed by scandir.
Using a generator would be problematic as we want to get the iterator eagerly (so we get OS errors of scandir directly, not waiting for the first iteration).
So I propose to use an iterator. Something like:
I don’t know the original motivation for materializing the scandir iterator in a list, but whenever I’ve done something similar the purpose was to create an “atomic snapshot” of the directory.
The problem with directory iterators in general is that directories can mutate during iteration leading to garbage results.
Conceptually yes; however, if it is your own code that is making the modifications, then practically it becomes an atomic operation (there is a race condition, but if you’re the only one in the race you always win):
for entry in list(os.scandir(path)):
if predicate(entry):
os.unlink(entry.path)
This isn’t much different from the idiom for mutating dictionaries while iterating over them:
for key in list(somedict):
if predicate(key):
del somedict[key]
Again, I don’t know the actual reason the original author materialized the path list. You could do a git blame and then ask them. I just recognized a pattern that I’ve used before. Even if my use case doesn’t match the original intent, it is very like that some code (mine for example) relies on the current implementation.
Also before accepting the proposal, the Chesteron’s Fence parable suggests trying to find out why the original author made a deliberate choice to materialize the path. Presumably, it wasn’t an accident.
I don’t know the actual reason the original author materialized the path list.
Before using os.scandir, it was using os.listdir which was returning a list so it was probably not the purpose of the change at the time to make it not materializing.
Also, ScandirIterator needs to be closed. Maybe there are some hard stuff with that. Like the DirIterator should have a close and __enter__/__exit__ methods to be able to do it easily on the user side just like with scandir.
I don’t get what you’re saying here. What you get from the context manager of os.scandir is an iterator. Any OSError would be produced by entering the context manager already. No need for a separate iterator wrapper class.
In changeset 08f8301b216, Barney Gale wrote os.listdir as a replacement for the “accessor” tooling. According to Antoine Pitrou, its original purpose was to create race-free access to the directory.
Are you sure that code won’t be broken by removing the materialization feature? Will the loss of atomicity not matter to anyone? How about the deferred resource finalization? Perhaps @pitrou can opine before you made this possibly breaking change.
It would only be atomic if scandir uses an OS level API that guaranteed that it was atomic. Making the list has nothing to do with that guarantee or lack there of. Can you show Windows API and Linux API docs claiming that the API is atomic?
This is intentional. This is needed to avoid leaking an open file descriptor. with guarantees that the scandir iterator will be closed and the corresponding file descriptor will be closed. Without this, you would rely on the garbage collector. In non reference counting implementations (PyPy) object finalization can be delayed. Open file descriptors is a limited resource, so you can face a shortage of file descriptors when thousands of scandir iterators wait for garbage collection. Even in reference counting implementations it is easy to prolonge the life of the object unintentionally.
This is why every use of os,scandir() should be with with or try/finally with close() in the finally block.
If we had a for statement combined with with, that guarantees closing the iterator after leaving the body of the for, this issue would not exist. Even then, we perhaps would need to keep the current code for backward compatibility.
That comment is pretty much obsolete. That said, I agree with your arguments regarding atomicity. If people want a non-materializing version of iterdir (is performance the concern? are there any benchmarks?), then perhaps an optional argument can be added for that.
What you get from the context manager of os.scandir is an iterator. Any OSError would be produced by entering the context manager already. No need for a separate iterator wrapper class.
Generators are evaluated lazily
Eg:
>>> def gen():
... print("enter")
... yield from range(10)
...
>>> g = gen()
>>> list(g)
enter
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Thanks for the insight. This is what I was imagining after looking more closely.
I guess managing cleanly the closing of the ScandirIterator would indeed mean introducing close and with semantics for iterdir. It’s a bit breaking and I understand it was necessary to do so in 3.6 to make scandir fully functional but for iterdir there’s no such need.
Well, it’s not strictly atomic, because I doubt the list(scandir_iterator) call itself is atomic. But the window for potential inconsistencies is still much smaller than if it were converted to lazy materialization. More annoyingly, it would be a visible behavior change: code that mutates a directory while iterating over it could be broken by a switch to lazy materialization.
Hence my suggestion to make lazy behavior governed by a new option if there are real benefits to it.
I doubt the list(scandir_iterator) call itself is atomic
Indeed, not really atomic.
There is also the issue of closing the scandir iterator.
So I remove my proposal : iterdir should not be changed.
And other methods shall be introduced if needed (I’ve seen a discussion about a pathlib scandir).
going off topic on the thread - but: it looks to me this is a good idea on its own.
Should it be pursued? Maybe it could be achieved with an optional syntactic addition to for
It adds additional indentation level. If you have severa nested loops, the code becames less readable.
It forses you to introduce a variable for iterator. I usually use “it” (from “iterator”), but it is less obvious in case of multiple loops.
It cannot be used in generator expressions. I thought about adding the with clause in generator expressions, but came to concludion that in most cases that needs this, it is better to write a local generator function.
Other issues, is that this does not work if the iterable does not support the context manager protocol. If the code should work for all types of iterables. Then you need to write:
with contextlib.closing(iter(iterable)) as iterator:
for item in iterator:
...
or
iterator = iter(iterable)
try:
for item in iterator:
...
finally:
iterator.close()