Ergonomics of new pathlib.Path.scandir()

cmaloney · December 9, 2024, 11:07pm

io.FileIO caches a stat to optimize common cases, but code tries to be correct if the stat values are wrong (ex. if the size of a file changes after open, you may get extra read() system calls and/or extra memory allocation, but you’ll get all the bytes). bpo-21679, gh-90102 and gh-120754 are iterations on that. It would be useful if a fresh stat could be passed into the I/O stack for those uses, but definitely are lots of ways can go wrong… gh-90102 is the most logically complex case to me of those / it relies on during one python library call (builtin open()) the fd shouldn’t change to being a tty between calling OS open, stat, and when the now often skipped isatty call would have occurred.

A lot of code (ex. importlib, build tools) does a walk +stat() then an open (walk, check if file needs to be re-processed based on stat timestamp vs last processed timestamp, open/read the file). It would be nice if there was a consistent way to get the stat into from the outside into the _io machinery but options I’ve looked at generally aren’t sufficiently safe / things can go wrong in ways that are likely to show up in standard usage patterns.

encukou · December 10, 2024, 8:01am

This kind of race issues are so hard to solve in general that you often do the work in a private tmpdir (or another area that other processes can’t/aren’t supposed to touch), and move the result when done.
Barney’s example is perfectly valid in such a situation.

Hm, but isn’t that what Python is? Just like x + 1 handles cases where x might overflow a machine word, or might not be an int at all, Path methods spend some extra effort to give you the up-to-date result at the time of call.

Stale caches are also complicated issues where you need to learn extra tricks to detect them, let alone fix them. Knowing that you need to add a p = Path(p) is not trivial.
If we add a caching object that quacks like a Path, we’ll definitely run into mismatched expectations.

If you use NumPy, you learn to avoid for loops. If you do massive globs, you’ll learn to use info. It’ll be fine.

steve.dower · December 10, 2024, 12:47pm

I was thinking more like the secret optimization we did for str.__iadd__, rather than continuously going around shouting at people to just use str.join (though plenty of us do that, too).

I’d argue (and in fact, did argue) that mixing libraries is the problem. Regardless of the situation, when you mix two different libraries together, you need to be aware of the implications of mixing them.

But in any case, what’s done is done, and we’re unlikely to see any broad change to default paths. So my preference then leans towards preventing API creep, and adding Path.scandir() feels like API creep. It’s not “creep” if it’s justified, but the justifications haven’t convinced me.

The primary justification has been that it makes glob() (et al.) faster. My counter to this is that glob() can already be specialised for Path subclasses, and so if we want a faster glob() then we should just make it faster rather than adding a generic interface that might sometimes make it faster.^[1]

The secondary justification has been that it makes it easier for subclasses to implement an efficient glob() by only having to implement scandir(). My counter to this is entirely hypothetical, and partially disproven, but I would expect it to be even more efficient in most cases for glob() to be overridden anyway (e.g. a REST-based “file system” ought to offer server-side filtering and recursion, which scandir() cannot emulate, but glob() can use). Unfortunately, a few prominent examples don’t offer these APIs (GitHub and Artifactory), so having subclassers provide scandir is indeed simpler.

But ultimately it all boils down to pathlib paths not knowing whether they intend the path to be a file or a directory - or more generically, a leaf or a container. If that was part of the path, then we wouldn’t be here at all, because we’d rely on intention (“was this path meant to be recursed”) rather than actuality (“dear mr. stat, please make a syscall and tell me what this path means, I’ll wait”). This one has come up before, so perhaps we can find a way to introduce it safely?^[2]

Given a time machine, I’d make all the stat methods on Path cached by default (at some unspecified time between creation and access) and give them all an argument to recalculate. Since we’re past that, and since it really is only the directory vs. file distinction that could matter, maybe we can figure out something specific to that. Ideally which won’t lead to API creep (which implies to me that iterdir() is able to fill it in).

The proposed interface has improved over the discussion on GitHub, IMHO, but still doesn’t make me happy. ↩︎
Bearing in mind that our glob() implementation totally can have the ability to handle attempts to recurse into a file that used to be a directory, so it doesn’t have to be affected by TOCTOU issues. ↩︎

barneygale · December 10, 2024, 5:43pm

Same! I’d probably make it retain trailing slashes too…

You’ve convinced me that adding Path.scandir() is not the right way forward. My current thinking is that we add a Path.status attribute that explicitly caches. Here’s a preview of its documentation: pathlib — Object-oriented filesystem paths — Python 3.14.0a2 documentation (also see the new pathlib.types.Status protocol at the bottom).

I like this idea. Could we add some of these to os.DirEntry too?

Perhaps Path.status (and hence os.DirEntry in my current proposal) could support __eq__() rather than having a get_key() method? If I’m understanding the os.DirEntry.inode() documentation correctly, we should be able to return False if the inodes are different without making any system calls on POSIX. (If the inodes are the same we’ll need to stat() to fetch and compare device IDs)

Symlink targets are an odd case - are they data or metadata? We already have Path.readlink() and I think that’s sufficient.

encukou · December 10, 2024, 6:25pm

Or, Path could be an interoperability mechanism; a generic object to be passed between various libraries
That seems the direction of the push to putting the same API on virtual filesystems.

Oof, well, there’s an edge case here, and it’s another reason why DirEntry might not be OK to use for status directly.
scandir return inodes from the parent directory, on the same device as the parent.
If one of the entries is a mount point, you’ll get a different device & inode number from stat (because stat will query the filesystem root that replaces the directory entry).
Spot the mount points:

>>> for entry in os.scandir('/'):
...     print(entry.name, entry.inode(), entry.stat(follow_symlinks=False).st_ino)
...     
boot 257 2
dev 258 1
home 259 256
proc 260 1
[...]
mnt 3100 3100
sbin 6671223 6671223

barneygale · December 10, 2024, 6:42pm

:o very interesting! Scratch what I said about inode() then…

Would you mind (re-)summarising why you think os.DirEntry is inappropriate? To be clear, I’m currently proposing that Path.status implements a new Status protocol, which only includes is_dir(), is_file() and is_symlink(). The real type is either os.DirEntry (for paths generated from Path.iterdir()), or a private pathlib type with only those three methods. It happens that os.DirEntry exposes a few more methods and attributes than required by the Status protocol – are you saying that’s a problem? Or is there something else I’m missing? I apologise for being slow on the uptake here… Thank you

steve.dower · December 10, 2024, 7:51pm

In that case, os.unlink should learn to recognise Path objects and call their unlink method in order to stay within the generic library. Currently, it converts out of the generic object and into a different mechanism entirely, which is why you only get partial functionality by doing it this way. (I’m not actually proposing that - I’m quite okay with the os module only being aware of the __fspath__ protocol and not the entire Path protocol. But it does require recognising that the os module is not actually participating in pathlib interop.)

Data, I would say. But that’s mostly gut feel and I don’t have a solid definition for why, or at least a definition that’s going to be portable. I want to say that metadata is solely referenceable by the path, and ignore bypassing paths using inode/file IDs and hard links. Probably the best we can really do is choosing some platform-specific API calls (presumably POSIX) and fudge the rest.

encukou · December 11, 2024, 9:48am

Conceptually, DirEntry is information about a directory entry, not about the file (as illustrated in the edge case above).
Practically: there is other metadata that would be useful to cache: file size, creation time, etc. DirEntry doesn’t contain the info, and no support for partially filling it in (e.g. one part from iterdir, then another from stat).
DirEntry is useful (and conceptually relevant) on its own. If we add a “FileInfo”, we should not turn DirEntry into it or use DirEntry as FileInfo. But, I also think DirEntry is much more relevant to Path subclassers (as the backend for in iterdir/glob from the base class) than to users. If you give it to users, they’ll rightly complain that it isn’t what they want.

I’ve seen file metadata stored in a dict, and I think that’s inferior to a well-defined, typing-protocol-friendly class (even if the class is slightly different for each kind of filesystem).

I see calling os.unlink as equivalent to shelling out to rm. The latter can’t get a Path, and neither can invalidate all Paths representing the removed file.

Having os.unlink understand Path wouldn’t solve cache invalidation. It still needs to be done with care; not caching is the safe (but not performant) default.

(Perhaps the more fundamental issue is mixing representations (Path/str), rather than libraries? There’s inevitably a boundary somewhere along the stack between you and a syscall. os is on the other side and, sadly, all pre-3.4 code uses it.)

I’d treat them as metadata. IMO, filesystems store them as data mainly because they’re not stat-friendly fixed-size integers (and there’s a ready-to-use mechanism to store data, which symlinks don’t use for anything else). That’s of no concern to Python.
You never need buffered or streamed IO for this short string.

steve.dower · December 11, 2024, 4:38pm

In C++ I’d argue representations, because the conversion from Path to str would occur because the caller passed type A as an argument of type B, which is valid because type A can be implicitly converted to type B.^[1]

For Python I wouldn’t apply this, because Python’s type system doesn’t work like this. In Python, type A (Path here) gets passed as itself, and it’s up to the recipient (in this case, os) to use it properly.

Modifying the state of the object without updating the cache of the state where the requirement is that the cache is immediately updated is not using it “properly”. But the problem is that requirement, not the type handling or the libraries. It’s just not possible to have a multi-referenceable object (such as a file, referenced by a path string) have immediately consistent state stored in the reference. So it needs to be defined as potentially outdated (perf > correctness), or non-cached (correctness > perf).

This is all a side discussion about API/library design, though. It’s not going to change anything we do here

Nowhere defines it as a “short” string. I presume Windows limits it to 32KB like other paths, and you’re certainly allowed to store other information in a reparse point besides just a symlink target. Buffered IO is not impossible.

I’m not aware of any cases where a generic system API (i.e. readdir or stat) is also going to give you symlink targets for “free”, so even if we were to treat it as metadata, it would just be covering up a more expensive call in most cases.

The equivalent in Python would be to write os.unlink(str(path)), which completely absolves unlink from any responsibility to understand a Path object. ↩︎

encukou · December 11, 2024, 5:25pm

Ah, I see where we’re talking past each other. os.unlink serves double duty as the backend and as a __fspath__-aware generic interface.
Yeah, let’s drop this conversation. This ain’t getting redesigned.

Oh, right: if a metadata class needs to read all its info at once, then leave readlink() out.
But if we already need to combine e.g. is_dir() from DirEntry and mtime from stat, adding another piece that needs an extra call shouldn’t be an issue.

barneygale · December 11, 2024, 10:47pm

Petr Viktorin:

Conceptually, DirEntry is information about a directory entry, not about the file (as illustrated in the edge case above).
Practically: there is other metadata that would be useful to cache: file size, creation time, etc. DirEntry doesn’t contain the info, and no support for partially filling it in (e.g. one part from iterdir, then another from stat).
DirEntry is useful (and conceptually relevant) on its own. If we add a “FileInfo”, we should not turn DirEntry into it or use DirEntry as FileInfo. But, I also think DirEntry is much more relevant to Path subclassers (as the backend for in iterdir/glob from the base class) than to users. If you give it to users, they’ll rightly complain that it isn’t what they want.

I’ve seen file metadata stored in a dict, and I think that’s inferior to a well-defined, typing-protocol-friendly class (even if the class is slightly different for each kind of filesystem).

Thanks for the explanation, that’s very useful! I was hoping that documenting the three methods of the pathlib.types.Status protocol would be sufficient to steer new users away from subtler methods like os.DirEntry.inode(), but perhaps it’s not good enough.

I don’t think it’s possible to export just the cached information from an os.DirEntry object and transplant it into a FileInfo object at the moment, so I think FileInfo would need to optionally wrap a DirEntry object. I’ll play around with some code

barneygale · December 11, 2024, 11:41pm

I’m playing around with wrapping os.DirEntry objects rather than exposing them directly, and yeah, this is clearly way better. Thanks @encukou

For starters, it means I can add exists() to the protocol without needing to modify os.DirEntry. Apart from being a useful feature in itself, this will be important for removing the PathBase.stat() method (hence banishing os.stat_result-like stuff from the ABCs).

But then we can think about adding file_size(), permissions(), access_time() etc!! These come up as pathlib feature requests perennially, but we’ve always rejected them, because adding uncaching Path methods would wreck performance for users. With Path.status that would no longer apply, and we wouldn’t need to worry about modifying the os.DirEntry implementation either. Zomg this is pretty exciting.

barneygale · December 13, 2024, 5:16pm

Pondering upon this idea:

Are you using a method rather than an attribute because it might perform an initial os.[l]stat() in your plan? (But not if the path object came from iterdir(), right?). Would the Info object still have methods, or would they become attributes (like path.get_info().is_dir?). If they become attributes, wouldn’t that make it awkward to add methods in future? (e.g. if they needed to call os.listxattr() from Status.xattrs())

If follow_symlinks=True is the default, then would path.get_info().is_symlink() always return False, because symlinks have already been followed?

On naming Path.info: we could call it cached_info or cached_status to make the cache explicit, perhaps?

encukou · December 16, 2024, 1:47pm

It would be allowed to, I guess?
It’s more of brainstorming than a plan :‍)

I think it would be fine if the info attributes would be allowed to query the filesystem.

That is a very good question.
Assuming the info contains a link_target, as a Path, I see questions like:
Should we have a proxy that ensures self.is_symlink() true, and proxies everything else to link_target’s info? Probably not.
Should we have two distinct info objects for follow_symlinks=True and follow_symlinks=False? Probably yes. (That’s how stat works on DirEntry, right?)
Should we have distinct info objects for a path and its link_target? Probably yes.
Should link_target’s info be initialized from the base path’s DirEntry? Maybe?

Alas, I don’t think I can find time to analyze all the pros and cons this year.

Not if it’s a method with a cached=True argument :‍)

barneygale · December 17, 2024, 8:33am

Thanks very much I’m suppose I’m failing to see the advantages of a get_info() method vs an info attribute, though I see some disadvantages (below)

IMO it makes the interface a tiny bit more awkward if both these lines might perform FS access (and might raise OSError):

info = path.get_info()
xattrs = info.xattrs  # possible future addition

I’m really not a fan of this! It would be inconsistent with everything else in pathlib. It was discussed and rejected for os.DirEntry: PEP 471 – os.scandir() function – a better and faster directory iterator | peps.python.org

All these questions vanish if we use an attribute rather than a method. The is_symlink() method can internally use lstat() so it only ever operates on the symlink’s stat. There’s no need for multiple objects. If a lstat() result indicates a non-symlink, we can record it as both the lstat() and stat() result internally to make future stat()s “free”:

github.com

barneygale/cpython/blob/dc6edc818077582f3e1ea4208213b23cb6bdc1a6/Lib/pathlib/_local.py#L92-L96


      
          if follow_symlinks or S_ISLNK(mode):
              self._mode[idx] = mode
          else:
              # Not a symlink, so stat() will give the same result
              self._mode = [mode, mode]

Such data-sharing becomes a little more complicated if we spin up different Info objects for follow_symlinks=False, =True, and for the link target.

More generally I want to stick close to relevant parts of the pathlib.Path and os.DirEntry APIs. Both of these offer an is_dir(*, follow_symlinks=True) method, not an is_dir attribute. I’d feel like I’m going out on a limb by second-guessing outcomes of the PEP 471 discussions.

I don’t want to take up any more of your time, but I’m not sure what to do here. Perhaps I should hold off getting the PR reviewed until you might have more availability? Or perhaps we could appeal to other core devs for answers to the method-vs-attribute question? (I’d love @ncoghlan’s take, if she’s still following).

encukou · December 17, 2024, 9:53am

You’ve thought about this more than I have, and your plan makes sense.
If you thought about the alternatives, my brainstorming has served its role. Don’t let it block you!

barneygale · December 17, 2024, 9:56am

Super. I really appreciate all the feedback - I feel much more comfortable with a design that’s been subject to scrutiny. And there’s still plenty of time to revise this or back it out (again!) before beta 1.

steve.dower · December 17, 2024, 11:40am

The part of this that feels like a limitation to me is assuming that we’ll cache at the level of a complete stat object. I’d prefer to be able to cache the path’s attributes individually, so that it isn’t necessary to calculate one expensive attribute in order to cache some that I may get for free.

For a concrete example, Windows for a long time has had “practically free” access to about half a stat object. But because our API has always been a complete stat object, we’ve never been able to make use of that. This is why DirEntry arose. I also advocated pretty hard for the new Windows stat-like API to include everything we want in a stat object so that we can use it - they didn’t see any need for some info, and I think we only got it because I was annoying about it

To be clear, I’m not advocating for storing stat and lstat results separately, even though that’s the bit I quoted. I’m advocating for storing st_mtime, st_mode, st_size, etc. directly, and allowing each one to individually be cached or uncached (with transparent “go and retrieve it when asked” semantics). It doesn’t have to mirror a stat object precisely, I just use those names for convenience.

Caching is_dir and is_symlink is exactly the kind of thing I mean, whereas caching stat is bundling up too many independent variables behind a single value. If you can’t fill in every part of that value for free, you can’t fill in any of it for free, and we leave performance behind. If it’s split up, there’s a much higher chance we fill in a necessary value for free, and the perf hit only affects those who needed something specific.

encukou · December 17, 2024, 12:07pm

Right. Hopefully, using two cached stat results as the “backend” for the methods could be a PosixPath implementation detail.

barneygale · December 17, 2024, 12:12pm

Agreed! I’ll make sure the Windows implementation doesn’t rely on caching stat results.

Somewhat related question: should the Status.is_() methods suppress OSError, and return False instead? This is the behaviour of os.path.isdir() and pathlib.Path.is_dir(), but not os.DirEntry.is_dir() (suppresses FileNotFoundError but not other OSError).

I’m inclined to suppress all OSErrors. For one thing, it means I can use ntpath.isdir() and friends on Windows