Ergonomics of new pathlib.Path.scandir()

barneygale · November 21, 2024, 9:23pm

Fair enough. And just to be clear, you’re discounting the idea of a PathBase.scandir() method that returns os.DirEntry-like objects because, from a user perspective, its only effect is to add a Path.scandir() method that duplicates os.scandir()?

steve.dower · November 21, 2024, 9:26pm

Right. The only thing it can do is invert the call order, and/or make life harder for non-FS path implementations. Let it stay for filesystem paths only, and the easiest way to do that is to stick to calling the os version when needed.

barneygale · November 21, 2024, 9:38pm

Thanks as ever for your input Steve, that’s very helpful. I think I’ll wait for this PR to land, and then put in another PR that renames PathBase.scandir() to _scandir() and removes the docs. I still believe I can motivate the addition of a scandir() method (see below), but probably only in conjunction with making PathBase public, which is still a ways away.

I think most non-toy implementations will want a scandir()-like method to make use of cached directory children info, otherwise they’re leaving performance on the table or setting themselves up for a heap of work re-implementing globbing. If there’s no information available other than child names, a user implementation of scandir() could return contextlib.nullcontext(self.iterdir()) instead. Perhaps this should be the default implementation of PathBase.scandir(), with PathBase.iterdir() remaining abstract for ease of entry.

Nineteendo · November 21, 2024, 9:49pm

Can’t we just define our own short-lived DirEntry with cached attributes, allow you to write an implementation for non filesystem paths and make .path a Path instance?

Then we could return that for Path.scandir(). Sure, it doesn’t cover all usecases, but it doesn’t have to, right?

barneygale · November 21, 2024, 9:59pm

If the type of os.DirEntry.path were the only problem, I’d suggest we soft-deprecate dir_entry.path in favour of os.fspath(dir_entry), which to my mind is more explicitly str / bytes. But IIUC, Steve’s argument is that any scandir() method isn’t compelling enough to add at the moment, even without that particular wart.

Nineteendo · November 21, 2024, 10:04pm

I thought the main problem was that os.scandir() only returns file system paths:

So, if we allow you to overwrite the behaviour in your S3 implementation, that wouldn’t be an issue.

But even so, this might be overkill and simply not worth the effort.

barneygale · November 21, 2024, 10:12pm

FWIW, I do think there’s a bit of a divide between folks with a formal education in API design vs casual Python users in terms of how some pathlib functionality is viewed.

Code like this might sends shivers up some folks’ backs (and I’m not saying they’re wrong):

github.com

python/cpython/blob/09c240f20c47db126ad7e162df41e5c2596962d4/Lib/pathlib/_local.py#L827-L831


      
          def rmdir(self):
              """
              Remove this directory.  The directory must be empty.
              """
              os.rmdir(self)

But for a lot of beginner/intermediate Python users, being able to path.rmdir() rather than os.rmdir(path) is compelling, and this pattern might be the selling point of the entire module for some folk. Whether scandir() deserves a spot in the pantheon is another matter, mind.

dg-pb · November 21, 2024, 10:28pm

On a different (but possibly related) note. Are there some serious issues of not being able to inherit from DirEntry / initialise explicitly?

E.g. I have my own DirEntry class defined to have consistent elements returned from some utilities.

Maybe if it was a bit more flexible/inheritable object, dealing with this issue would be a bit easier?

steve.dower · November 22, 2024, 12:05am

Yeah, I think it’s not worth the effort. Why should we make subclassers implement both iterdir and scandir? (FWIW, it’s probably even more efficient to implement an online storage glob with a search query and not directory iteration, so implementing scandir isn’t always going to help anyway.)

I mean, I much prefer having all the essential functionality on my Path directly, as it also means I can write algorithms around Path-like objects of any kind without having to assume that they’re file system paths. But scandir() doesn’t deserve that spot.

barneygale · December 6, 2024, 10:42pm

I’ve now reverted the addition of Path.scandir():

github.com/python/cpython

GH-125413: Revert addition of `pathlib.Path.scandir()` method

python:main ← barneygale:scandir-revert

opened 12:23AM - 29 Nov 24 UTC

barneygale

+22 -85

Remove documentation for `pathlib.Path.scandir()`, and rename the method to `_sc…andir()`. In the private pathlib ABCs, make `iterdir()` abstract and call it from `_scandir()`. It's not worthwhile to add this method at the moment - see discussion: https://discuss.python.org/t/ergonomics-of-new-pathlib-path-scandir/71721 * Issue: gh-125413 ---- 📚 Documentation preview 📚: https://cpython-previews--127377.org.readthedocs.build/

There’s some discussion here about how we might expose the cached file type in pathlib:

github.com/python/cpython

GH-127456: pathlib ABCs: add protocols for supporting types

python:main ← barneygale:gh-127456

opened 11:55PM - 01 Dec 24 UTC

barneygale

+79 -112

Change the default value of `PurePathBase.parser` from `ParserBase()` to `posixp…ath`. As a result, user subclasses of `PurePathBase` and `PathBase` use POSIX path syntax by default, which is very often desirable. Move `pathlib._abc.ParserBase` to `pathlib._types.Parser`, and convert it to a runtime-checkable protocol. Add `pathlib._types.DirEntry`, which is a protocol for directory entries. Objects of this type provide a subset of the `os.DirEntry` API, specifically those methods and attributes needed to implement `glob()` and `walk()`. Add `pathlib._types.StatResult`, which is a protocol for stat() results. Objects of this type provide a small subset of the `os.stat_result` API, specifically attributes for the file type, permissions and location/offset. No user-facing changes as the pathlib ABCs are still private. * Issue: gh-127456

IIUC the current proposal is as follows: we add a Path.info property, which exposes an object with is_file(), is_dir() and is_symlink() methods. Accessing the property doesn’t perform FS access. The property type is either os.DirEntry (for paths generated by Path.iterdir()) or a new pathlib type (other paths). In both cases, stat() results are cached.

I propose that p = Path(p) is sufficient for clearing the p.info cache, and so we don’t need to provide further methods/property deleters/etc.

If we add something like Path.info, then:

Users will be able to get at the os.DirEntry objects internally generated by Path.iterdir(), which are otherwise thrown away
Users writing performance-sensitive code can rely on path.info for cached file status whether or not the path object was generated by iterdir()
In the pathlib ABCs, we can replace PathBase.stat() with PathBase.info. The former is too low-level and awkward to implement for most virtual filesystems.

Here’s a rough example:

>>> from pathlib import Path
>>> p = Path.cwd()
>>> p.info
<pathlib._local._FileInfo object at 0x7f6f357fcf00>
>>> p.info.is_dir()
True
>>> q = next(p.iterdir())
>>> q
PosixPath('Lib')
>>> q.info
<DirEntry 'Lib'>

Thoughts?

Ferdi265 · December 6, 2024, 11:47pm

This sounds amazing!

ncoghlan · December 8, 2024, 6:01am

@barneygale The Path.info approach sounds good to me. I think it would nicely avoid the conceptual problems that confounded the first attempt that relied solely on DirEntry.

encukou · December 9, 2024, 9:22am

Thinking about this further, I’m more and more convinced that there should be a get_info() method, which could have arguments:

follow_symlinks
cached=True. When you need fresh values, p.get_info(cached=False) would communicate the intent more clearly than Path(p).info.

Consider not using os.DirEntry directly. A new type could also have st_* attributes from stat_result (each only present if available for the filesystem), so you can use the same caching mechanism for the low-level values.
A new class could also be extendable in other ways. Brainstorming:

posix_permissions() to get the standardized part of st_mode
link_target() (a path)
get_key() (for samefile(), defaulting to e.g. (FSFileInfo, st_dev, st_ino))
access_time(), mod_time(), change_time(), birth_time() returning datetime objects
other implementations could be encouraged to use a project-specific prefix for custom fields (zip_compression, tar_offset, git_oid, s3_acl).

steve.dower · December 9, 2024, 11:21am

Yeah, or at least not specifying that it’s used. I don’t see anything wrong with the object being defined as a protocol rather than a concrete object, and if it’s one that a DirEntry can fulfill for stdlib needs, well that’s just convenient

I would personally still like to see a model where the cached objects look enough like the Path instance that we can duck type the entire thing (IOW, a subclass that has pre-cached values, as I described on the issue). I’ve never hit my own case where the difference between the type of a path at the time I created/acquired^[1] the Path and when I tested the value.^[2]

For whatever reason, every time I’ve held onto a path for a long time (i.e. between operations, where it’s reasonable for a user to have gone in and modified the files they told me to use) it’s been stored as a string. Maybe that’s just my scenarios, though? Storing a timestamp and invalidating the cache after ??? seconds would still be a win, perf-wise, if that’s a concern.

The advantage is that all users get performance improvements at very little risk, and users only have to learn a new concept if they’re doing the less common operation. I don’t like Python being known as a language where you have to learn lots of tricks to make it faster - it should just be faster, and the tricks are there to resolve complicated issues or mismatched expectations.

I expect we’d only cache values when they’re known by the constructor. So Path(x) does not cache, but p.glob("*") on a system with readdir/FindNextFile would cache whatever it knows. ↩︎
There’s still a TOCTOU risk that gets bigger with caching, but doesn’t go away by clearing the cache. ↩︎

barneygale · December 9, 2024, 6:31pm

I wholly agree here, and I recall Guido previously saying something similar about adding new APIs vs fixing what’s already there.

But IMO making Path.is_dir() and friends return cached results is a bridge too far. Users that rely on those methods retrieving up-to-date results would find their scripts break in hard-to-understand ways, and I can’t see a way to cushion the change with some kind of warning. Even if we restrict it to objects generated from iterdir(), the significant behaviour change remains.

We could maybe do it by introducing a new superclass (not subclass) of Path that’s allowed to cache results. That way we don’t break any Path guarantees. If there’s a route toward doing this then I’m open-minded, but right now it’s not clear to me.

MegaIng · December 9, 2024, 6:42pm

Maybe a keyword option for iterdir that enables caching on generated Path objects? That would at least make this an explicit opt in with very little hassle.

barneygale · December 9, 2024, 7:22pm

Even that is problematic:

# 3rd party library code
def delete(path):
    assert isinstance(path, pathlib.Path)
    os.unlink(path)
    assert not path.exists()  # should be up-to-date

# application code
for p in Path().iterdir(cache_results=True):
    delete(p)

Suddenly delete() needs to deal with stale results, with no opt-out.

steve.dower · December 9, 2024, 7:58pm

It always needed to deal with them. Unfortunately in this case, it goes from an occasional issue where the file is locked, isn’t a real file, or gets recreated quickly^[1] into one that always fails the assertion, but the possibility was already there. LBYL simply can’t be relied on for filesystem stuff.

The most obvious fix (for a caching change; not for the existing TOCTOU issue) would be to make it assert not os.path.exists(path), since they’ve already chosen to use os methods. Or path.unlink() which could, conceivably, update/invalidate its cache.

It makes the transition more complicated, but I’m not massively upset by mixing libraries leading to non-obvious behaviour.

In relative terms. For a mounted network file system, there could be plenty of real time. ↩︎

barneygale · December 9, 2024, 9:54pm

I’m gathering some historical background on caching/not caching.

Antoine’s initial version of PEP 428 posted to python-ideas included stat caching:

https://mail.python.org/archives/list/python-ideas@python.org/message/2FYSJJZ5XIVY5MAFDIE6G6K2AXGLOGPB/

Guido warns about caching:

https://mail.python.org/archives/list/python-ideas@python.org/message/SSMN57AHKM6HBKDXQUDEGWMF2LNOGAUU/

Alyssa explains why we might want different behavior in os.DirEntry and pathlib.Path:

https://mail.python.org/archives/list/python-ideas@python.org/message/KI2BQL64IAUVK5ZMR2ZRGXLCKQAHMXRO/

Paul Moore laments the lack of caching in pathlib.Path:

https://mail.python.org/archives/list/python-ideas@python.org/thread/C7RDAJZN2DHNV5JQUKLHZZINVAFAHT3X/#C7RDAJZN2DHNV5JQUKLHZZINVAFAHT3X

In-depth discussion of pathlib caching behaviour kicked off by Ram:

https://mail.python.org/archives/list/python-ideas@python.org/thread/RFTCPWUR6BS6EDVYSV5YTY3IDK345HGA/#6XO2PWUQ44LLXXRD5MJZNEUZIKDHF6VE

Guido works on adding a stat cache based on Ram’s suggestion:

github.com/python/cpython

Add stat caching option to pathlib

opened 10:41PM - 06 Jan 16 UTC

closed 09:43PM - 04 Nov 20 UTC

gvanrossum

type-feature

BPO | [26031](https://bugs.python.org/issue26031) --- | :--- Nosy | @gvanrossum,… @pitrou Files | <li>[statcache.diff](https://bugs.python.org/file41521/statcache.diff "Uploaded as text/plain at 2016-01-06.22:41:28 by @gvanrossum")</li> <sup>*Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.*</sup> <details><summary>Show more details</summary><p> GitHub fields: ```python assignee = None closed_at = <Date 2020-11-04.21:43:41.496> created_at = <Date 2016-01-06.22:41:29.808> labels = ['type-feature'] title = 'Add stat caching option to pathlib' updated_at = <Date 2020-11-04.21:43:41.495> user = 'https://github.com/gvanrossum' ``` bugs.python.org fields: ```python activity = <Date 2020-11-04.21:43:41.495> actor = 'gvanrossum' assignee = 'none' closed = True closed_date = <Date 2020-11-04.21:43:41.496> closer = 'gvanrossum' components = [] creation = <Date 2016-01-06.22:41:29.808> creator = 'gvanrossum' dependencies = [] files = ['41521'] hgrepos = [] issue_num = 26031 keywords = ['patch'] message_count = 4.0 messages = ['257651', '257652', '257658', '380366'] nosy_count = 2.0 nosy_names = ['gvanrossum', 'pitrou'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue26031' versions = ['Python 3.6'] ``` </p></details>

Ferdi265 · December 9, 2024, 11:01pm

Barney Gale’s suggestion is currently the design I’m most comfortable with: It allows caching in the new .info property, but doesn’t change any existing Path behaviour, while at the same time not being an overly convoluted change.

The documentation for the info property could also prominently include that the information is potentially cached and that Path.is_dir() should be used instead of Path.info.is_dir() (or similar) if up-to-date info is required.