Fair enough. And just to be clear, you’re discounting the idea of a PathBase.scandir()
method that returns os.DirEntry
-like objects because, from a user perspective, its only effect is to add a Path.scandir()
method that duplicates os.scandir()
?
Right. The only thing it can do is invert the call order, and/or make life harder for non-FS path implementations. Let it stay for filesystem paths only, and the easiest way to do that is to stick to calling the os
version when needed.
Thanks as ever for your input Steve, that’s very helpful. I think I’ll wait for this PR to land, and then put in another PR that renames PathBase.scandir()
to _scandir()
and removes the docs. I still believe I can motivate the addition of a scandir()
method (see below), but probably only in conjunction with making PathBase
public, which is still a ways away.
I think most non-toy implementations will want a scandir()
-like method to make use of cached directory children info, otherwise they’re leaving performance on the table or setting themselves up for a heap of work re-implementing globbing. If there’s no information available other than child names, a user implementation of scandir()
could return contextlib.nullcontext(self.iterdir())
instead. Perhaps this should be the default implementation of PathBase.scandir()
, with PathBase.iterdir()
remaining abstract for ease of entry.
Can’t we just define our own short-lived DirEntry
with cached attributes, allow you to write an implementation for non filesystem paths and make .path
a Path
instance?
Then we could return that for Path.scandir()
. Sure, it doesn’t cover all usecases, but it doesn’t have to, right?
If the type of os.DirEntry.path
were the only problem, I’d suggest we soft-deprecate dir_entry.path
in favour of os.fspath(dir_entry)
, which to my mind is more explicitly str
/ bytes
. But IIUC, Steve’s argument is that any scandir()
method isn’t compelling enough to add at the moment, even without that particular wart.
I thought the main problem was that os.scandir()
only returns file system paths:
So, if we allow you to overwrite the behaviour in your S3 implementation, that wouldn’t be an issue.
But even so, this might be overkill and simply not worth the effort.
FWIW, I do think there’s a bit of a divide between folks with a formal education in API design vs casual Python users in terms of how some pathlib
functionality is viewed.
Code like this might sends shivers up some folks’ backs (and I’m not saying they’re wrong):
But for a lot of beginner/intermediate Python users, being able to path.rmdir()
rather than os.rmdir(path)
is compelling, and this pattern might be the selling point of the entire module for some folk. Whether scandir()
deserves a spot in the pantheon is another matter, mind.
On a different (but possibly related) note. Are there some serious issues of not being able to inherit from DirEntry
/ initialise explicitly?
E.g. I have my own DirEntry
class defined to have consistent elements returned from some utilities.
Maybe if it was a bit more flexible/inheritable object, dealing with this issue would be a bit easier?
Yeah, I think it’s not worth the effort. Why should we make subclassers implement both iterdir
and scandir
? (FWIW, it’s probably even more efficient to implement an online storage glob with a search query and not directory iteration, so implementing scandir
isn’t always going to help anyway.)
I mean, I much prefer having all the essential functionality on my Path
directly, as it also means I can write algorithms around Path
-like objects of any kind without having to assume that they’re file system paths. But scandir()
doesn’t deserve that spot.
I’ve now reverted the addition of Path.scandir()
:
There’s some discussion here about how we might expose the cached file type in pathlib:
IIUC the current proposal is as follows: we add a Path.info
property, which exposes an object with is_file()
, is_dir()
and is_symlink()
methods. Accessing the property doesn’t perform FS access. The property type is either os.DirEntry
(for paths generated by Path.iterdir()
) or a new pathlib type (other paths). In both cases, stat()
results are cached.
I propose that p = Path(p)
is sufficient for clearing the p.info
cache, and so we don’t need to provide further methods/property deleters/etc.
If we add something like Path.info
, then:
- Users will be able to get at the
os.DirEntry
objects internally generated byPath.iterdir()
, which are otherwise thrown away - Users writing performance-sensitive code can rely on
path.info
for cached file status whether or not the path object was generated byiterdir()
- In the pathlib ABCs, we can replace
PathBase.stat()
withPathBase.info
. The former is too low-level and awkward to implement for most virtual filesystems.
Here’s a rough example:
>>> from pathlib import Path
>>> p = Path.cwd()
>>> p.info
<pathlib._local._FileInfo object at 0x7f6f357fcf00>
>>> p.info.is_dir()
True
>>> q = next(p.iterdir())
>>> q
PosixPath('Lib')
>>> q.info
<DirEntry 'Lib'>
Thoughts?
This sounds amazing!
@barneygale The Path.info
approach sounds good to me. I think it would nicely avoid the conceptual problems that confounded the first attempt that relied solely on DirEntry
.
Thinking about this further, I’m more and more convinced that there should be a get_info()
method, which could have arguments:
follow_symlinks
cached=True
. When you need fresh values,p.get_info(cached=False)
would communicate the intent more clearly thanPath(p).info
.
Consider not using os.DirEntry
directly. A new type could also have st_*
attributes from stat_result
(each only present if available for the filesystem), so you can use the same caching mechanism for the low-level values.
A new class could also be extendable in other ways. Brainstorming:
posix_permissions()
to get the standardized part ofst_mode
link_target()
(a path)get_key()
(forsamefile()
, defaulting to e.g.(FSFileInfo, st_dev, st_ino)
)access_time()
,mod_time()
,change_time()
,birth_time()
returning datetime objects- other implementations could be encouraged to use a project-specific prefix for custom fields (
zip_compression
,tar_offset
,git_oid
,s3_acl
).
Yeah, or at least not specifying that it’s used. I don’t see anything wrong with the object being defined as a protocol rather than a concrete object, and if it’s one that a DirEntry
can fulfill for stdlib needs, well that’s just convenient
I would personally still like to see a model where the cached objects look enough like the Path
instance that we can duck type the entire thing (IOW, a subclass that has pre-cached values, as I described on the issue). I’ve never hit my own case where the difference between the type of a path at the time I created/acquired[1] the Path
and when I tested the value.[2]
For whatever reason, every time I’ve held onto a path for a long time (i.e. between operations, where it’s reasonable for a user to have gone in and modified the files they told me to use) it’s been stored as a string. Maybe that’s just my scenarios, though? Storing a timestamp and invalidating the cache after ??? seconds would still be a win, perf-wise, if that’s a concern.
The advantage is that all users get performance improvements at very little risk, and users only have to learn a new concept if they’re doing the less common operation. I don’t like Python being known as a language where you have to learn lots of tricks to make it faster - it should just be faster, and the tricks are there to resolve complicated issues or mismatched expectations.
I wholly agree here, and I recall Guido previously saying something similar about adding new APIs vs fixing what’s already there.
But IMO making Path.is_dir()
and friends return cached results is a bridge too far. Users that rely on those methods retrieving up-to-date results would find their scripts break in hard-to-understand ways, and I can’t see a way to cushion the change with some kind of warning. Even if we restrict it to objects generated from iterdir()
, the significant behaviour change remains.
We could maybe do it by introducing a new superclass (not subclass) of Path
that’s allowed to cache results. That way we don’t break any Path
guarantees. If there’s a route toward doing this then I’m open-minded, but right now it’s not clear to me.
Maybe a keyword option for iterdir
that enables caching on generated Path
objects? That would at least make this an explicit opt in with very little hassle.
Even that is problematic:
# 3rd party library code
def delete(path):
assert isinstance(path, pathlib.Path)
os.unlink(path)
assert not path.exists() # should be up-to-date
# application code
for p in Path().iterdir(cache_results=True):
delete(p)
Suddenly delete()
needs to deal with stale results, with no opt-out.
It always needed to deal with them. Unfortunately in this case, it goes from an occasional issue where the file is locked, isn’t a real file, or gets recreated quickly[1] into one that always fails the assertion, but the possibility was already there. LBYL simply can’t be relied on for filesystem stuff.
The most obvious fix (for a caching change; not for the existing TOCTOU issue) would be to make it assert not os.path.exists(path)
, since they’ve already chosen to use os
methods. Or path.unlink()
which could, conceivably, update/invalidate its cache.
It makes the transition more complicated, but I’m not massively upset by mixing libraries leading to non-obvious behaviour.
In relative terms. For a mounted network file system, there could be plenty of real time. ↩︎
I’m gathering some historical background on caching/not caching.
Antoine’s initial version of PEP 428 posted to python-ideas included stat caching:
Guido warns about caching:
Alyssa explains why we might want different behavior in os.DirEntry
and pathlib.Path
:
Paul Moore laments the lack of caching in pathlib.Path
:
In-depth discussion of pathlib caching behaviour kicked off by Ram:
Guido works on adding a stat cache based on Ram’s suggestion:
Barney Gale’s suggestion is currently the design I’m most comfortable with: It allows caching in the new .info
property, but doesn’t change any existing Path
behaviour, while at the same time not being an overly convoluted change.
The documentation for the info property could also prominently include that the information is potentially cached and that Path.is_dir()
should be used instead of Path.info.is_dir()
(or similar) if up-to-date info is required.