Ergonomics of new pathlib.Path.scandir()

I think so. Since it’ll sometimes be cached, any exception would be inconsistent, not to mention being raised from unexpected places. Better to just return False consistently, and use EAFP for more specific errors.

4 Likes

I hadn’t been following, but the mention did the trick :slight_smile:

I think the “attribute with query methods” approach will provide more scope for the kind of opportunistic caching that @steve.dower suggests (and I also think is a good approach) than we would get the other way around. Attribute access triggering implicit IO is always a somewhat dubious idea (outside specific use cases like database ORMs where it’s a necessary part of the service they provide).

General ideas:

  • accessing .info would only create the PathInfo object, and populate the fields that can be populated without making any system calls (this may include info from a DirEntry object for paths created that way)
  • the method calls on info objects would differ from their Path counterparts in that they’d only query the filesystem if they didn’t have any cached info (but if they are forced to query the filesystem, they’ll prefer to issue queries that allow multiple cache fields to be populated at once)
  • we should offer a method like .reset_cache() to allow a PathInfo object to be told to start refreshing its filesystem state cache entries

For symlinks, I’d be inclined to offer a second symlink_info attribute, which would always be None for non-symlink paths (making the link-following version have the shorter name is based on pathlib APIs generally defaulting to follow_symlinks=True when they offer the option - Path.walk() is the only exception due to the potential for encountering symlink loops).

I wouldn’t provide a way to opt out of the caching - the non-caching API is to call the Path methods directly instead of their PathInfo counterparts.

Given a caching lookup design along those lines, it would be potentially interesting to revisit my old (long archived) walkdir module as something that could be based on pathlib rather than directly on os.path, since stat caching (or the lack thereof) was the fundamental unsolved problem that meant the stacked filter pipeline performance was abysmally worse than doing the same thing in an imperative loop: Reconsider the filesystem status querying design · Issue #18 · ncoghlan/walkdir · GitHub

3 Likes

Excellent, thanks v much!

This is the only point of difference with the implementation in GH-127730. Do you think we need reset_cache() straight away? I’m still not completely convinced, and I’d prefer to leave it for a later PR if no one minds. I don’t think deferring it will make it harder to add later.

2 Likes

When I wrote that, I actually forgot about your previous point that new_path = Path(old_path) is a sufficient cache reset mechanism. So consider the explicit reset suggestion withdrawn - we can just put the “make a new Path instance to reset the info cache” guidance in the docs.

2 Likes

I suppose we still have naming to talk about! :slight_smile:

“info” feels overly broad to me, and constructions like “compare two informations” are awkward because “information” is usually uncountable.

“status” is a better name IMHO, though it’s close to the existing stat() method. It’s also uncountable in some contexts (but often countable in computing contexts I reckon?)

“cache” might work, given its main benefit is that results are cached?

“entry” is another option, if we imagine that the object represents some underlying filesystem entry about the path.

Thoughts and suggestions very welcome!

metadata is probably most accurate, or system_metadata, though we’re definitely getting too long there.

I like info best of the short names. I don’t want to “compare two informations” anyway - I want to “compare two paths’ info” or “two paths’ metadata”.

1 Like

I think info_cache and meta_cache were also suggested along the way.

The fact “info” isn’t a standardised term feels like a feature here - there’s likely to be a strong element of “we cache this because we learn it as a side benefit of looking up something else may as well store it” in some of the cached values, and some of the things that can be queried might come from somewhere other than the file’s own filesystem metadata (such as the scandir info from the parent directory).

cache on its own feels a bit ambiguous (due to the prevalence of implicit filesystem caches in modern OS designs).

info_cache as the attribute and PathInfoCache as the type feel like they might hit the sweet spot of being sufficiently concise while still being sufficiently descriptive.

2 Likes

I am being indecisive :upside_down_face:. Please could folks vote for their preferred option(s) below?

As a brief recap, this object will have exists(), is_dir(), is_file() and is_symlink() methods initially, with scope to add methods like filesize(), permissions(), xattrs() etc later. Unlike anything else in pathlib.Path, this object caches results.

  • path.info
  • path.system_info
  • path.info_cache
  • path.meta
  • path.system_meta
  • path.meta_cache
  • path.metadata
  • path.system_metadata
  • path.metadata_cache
  • path.status
  • path.status_cache
0 voters

:sunny: pathlib.Path.info has landed! Thanks to the many folks who contributed to the design and helped review.

Earlier in this thread we discussed adding more metadata methods like posix_permissions(), access_time() etc. FWIW, I’m planning to add a few private methods along these lines shortly, in order to facilitate copying metadata between local files in pathlib.Path.copy(). Metadata is currently provided by a Path._copy_reader object, but it’s an ugly appendage that’s no longer needs to exist.

To support preserving the same metadata as shutil.copy2(), I’m planning to add these methods to Path.info:

  • _posix_mode() (st_mode)
  • _access_time_ns() (st_atime_ns)
  • _mod_time_ns() (st_mtime_ns)
  • _bsd_flags() (st_flags)
  • _xattrs() (os.listxattr(), os.getxattr())

Perhaps some of these could form the basis of public methods in future.

9 Likes

Following feedback in another thread, I’m increasingly worried I’ve made a mistake by exposing this as an attribute that caches by default, rather than a method that returns a cached object only when requested. e.g. we could instead have def metadata(self, *, cached=False). Petr tried to warn me but I got hung up on the follow_symlinks details.

It will be pretty flipping awkward, but we could still switch to a metadata() method in 3.15 and soft-deprecate info, if people thought that was desirable.

I still think the method must not include a follow_symlinks arguments for the reasons I gave earlier. It probably shouldn’t perform filesystem access either - that’s a job for Info methods like path.metadata().is_dir().

4 Likes

Not sure if you need more feedback, but when I read the initial proposal I never really understood why it was cached by default. It seemed to be a footgun waiting to happen. Considering you made many well thought out changes to pathlib over these years I kind of assumed that you and other experts knew what you were doing and never spoke out.

But consider me a +1 on the feedback and towards soft-deprecating. Having this cached as default is not what I would expect as default.

As further “arguments” for doing this change. At my day job, the engineers that reach for patlib and friends the most are our DevOps engineers writing scripts and tooling for developer experience. They switch programming languages regularly due to different systems/repositories having different requirements. I would not expect them to assume info is cached.

2 Likes

I’m thinking out loud here, but maybe there should be a whole class CachedPath(Path): instead of trying to add it to Path?

Thank you Daniël, I very much appreciate the feedback.

This feature started as a way to conveniently expose the information from os.DirEntry in pathlib, which I think is why I was focused on caching from the start. If we switched to a metadata() method, that use case would look like:

for child_path in path.iterdir():
    metadata = child_path.metadata(cached=True)  # or perhaps sync=False?
    if metadata.is_symlink():  # calls os.DirEntry.is_symlink() under-the-hood
        ...
    elif metadata.is_dir():
        ...
    else:
        ...

… which is only slightly less convenient (the metadata = line is new).

1 Like

We can’t make Path methods like is_dir() cache because it would break user code (see earlier example). That also constrains subclasses of Path by the Liskov substitution principle. But some other class relationship could work.

Because in some situations we have all the information for free on creation, and so if it starts cached then we don’t have to do an expensive operation to look it up (most notably Windows, but often networked or hybrid storage setups can have the same thing). At it’s simplest, a directory listing may include whether an item is a file or a directory, which means we can return objects that answer is_file and is_dir very quickly. (This is most of the reason behind listdir vs scandir as well.)

Making it uncached by default is status quo - we have no shortage of uncached ways to get this information, so there’s no need to add anything new at all. But then people are ~forced to bypass our features if they need the higher performance at the risk of “maybe this file was replaced by a directory with the same name in the last 100ms”. Not having a way to get an equally efficient pathlib object from scandir is the reason we’re here now.

4 Likes

If this does turn out to be a footgun for users, the other popular name choice instead of .info was .info_cached which would make that clear. We could deprecate .info (a very long time horizon deprecation - it’s mostly about the docs with rationale and the preferred API, no need to break code still supporting 3.14) and direct people to explicitly named properties or a method with a required cached= keyword argument instead - I see .metadata(*, cached: bool) proposed above.

2 Likes

Thanks both.

I’ve logged an issue here: Add `pathlib.Path.metadata()` and soft-deprecate `info` · Issue #140127 · python/cpython · GitHub

I’m really sorry for all this. I should have been more careful.

3 Likes