Make pathlib extensible

Most of the functions in the Windows file API first normalize a path into native NT form before making a system call such as NtCreateFile() or NtOpenFile(). Among other things, path normalization replaces forward slashes with backslashes. There are exceptions.

Of course, normalization is intentionally skipped for “\\?\” device paths. For example, r"\\?\C:\Windows/System32" is an invalid path because NTFS reserves forward slash as an invalid name character. Like all code in the the native NT API and system services, the NTFS filesystem only handles backslash as a path separator.

>>> os.stat(r'\\?\C:\Windows/System32')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: '\\\\?\\C:\\Windows/System32'

None of the Path* API functions, such as PathCchSkipRoot(), handle forward slash as a path separator.

When creating a relative symbolic link, CreateSymbolicLinkW() (i.e. os.symlink()) does not replace forward slashes with backslashes in the target path. This creates a broken symlink since paths in the kernel only use backslash as a path separator.

>>> os.mkdir('spam')
>>> open('spam\\eggs', 'w').close()
>>> os.symlink('spam/eggs', 'eggslink')
>>> os.stat('eggslink')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'eggslink'

There are several other functions in the Windows API that take file paths and fail to normalize forward slashes as backslashes, such as NeedCurrentDirectoryForExePathW().

It’s really a laundry list of exceptions to the rule. Better to just use the native path separator than to worry about what does and does not support forward slashes.

Also, when paths are parsed as command-line arguments, applications may fail to handle paths that use forward slashes. Notably, the CMD shell has this problem. For example:

>>> os.system('dir C:/Windows')
Parameter format not correct - "Windows".
3 Likes

What I’m saying is there normalization is not required.

The key thing is os.fspath() prevents you from accidentally calling str() on a non-path-like object like None.

The idea is that the string representation is like an encoding of a path just like some integer can be an encoding for a Unicode code point, and thus not something to directly think about if you’re using pathlib.

2 Likes

:sparkles: March 2023 progress report :sparkles:

Thank you to @AlexWaygood, @hauntsaninja and @steve.dower for reviewing and merging performance improvements to path construction. There’s one remaining PR to land on that issue, after which it can be resolved. I’ve logged an issue for optimizing PurePath.__fspath__() by returning an unnormalized path, and another for implementing os.path.splitroot() in C. I’m also looking at an issue with glob() performance.

Adding AbstractPath is a multi-year yak-shaving exercise, and with some of those performance improvements now in place, I can approach the yak with shears in hand:

That PR unifies and simplfies path construction, and opens the door to adding AbstractPath in short order. It’s something of a milestone for this project! I’m beginning to believe we could land AbstractPath in time for Python 3.13 :slight_smile:

Thanks as ever for reading, ta ra!

15 Likes

Oh, and lest we forget to mention, perhaps one of the most important updates is to congratulate @barneygale on his nomination to core developer (and pathlib maintainer) on the basis of his exceptionally diligent, thoughtful and tireless work on pathlib and beyond!

17 Likes

And now it’s official!

3 Likes

Congratulations, @barneygale . You really deserve it and it has been a pleasure working with you on pathlib so far!

3 Likes

:sparkles: April 2023 progress report :sparkles:

Big thanks to Steve Dower for reviewing GH-102789, which shaved the aforementioned yak. Path object construction now uses a single code path, so user subclasses can override __new__() and __init__() and expect that their methods will actually be called when new path objects are created.

For Python 3.12 beta 1, I’m hoping to get one more improvement in: GH-100481, which adds a new makepath() method. User subclasses can override this method to customize how path objects are created. Among other things, this allows users to share objects such as sockets or fileobjs between path objects. When this lands I will consider support for subclassing of pathlib classes complete :partying_face:

For Python 3.13, I’ll aim to add a tarfile.TarPath class (see GH-89812) utilizing something like pathlib._AbstractTraversable or pathlib._AbstractPath. This should bring any remaining shortcomings to the fore; once resolved, we can drop the _ prefix(es); that could happen in time for 3.13, but more likely it will be a 3.14 or 3.15 thing.

In mostly-unrelated pathlib news, I have two open PRs that implement glob()-related features (GH-101398, GH-102710), and two that slightly improve performance (GH-103526, GH-103549). Appreciative of any reviews!

That’s all for now, cheers!

12 Likes

Hey, I am not sure if this is an appropriate thread, but I have an idea as to how to make Pathlib even better.

When I write files to disk, I seem to repeat this pattern very often:

import itertools
from pathlib import Path

file_name, file_counter = Path(“file_name.here”), itertools.count(1)
stem, suffix = file_name.stem, file_name.suffix
while file_name.is_file():
    file_name = Path(f”{stem} ({next(file_counter)}).{suffix}”)

Would it be possible if Pathlib contained a built-in method which gave you the first name which doesn’t overwrite an already existing file? I would imagine that this is a very common problem to run into, not just by me.

Would tempfile.mkstemp() work for that use case? It avoids a race condition.

import tempfile
import pathlib

fd, path = tempfile.mkstemp(prefix="file_name.")
path = pathlib.Path(path)
3 Likes

Very interesting solution here, and also a very good way of avoiding a race condition. Would it be possible if something like this could be called from a Path instance itself? e.g.

import pandas as pd
from pathlib import Path

path = Path(r"/path/to/csv/test_csv.csv")
df = pd.read_csv(path)

...

# Given right flags, writes to mkstemp-style path like "/path/to/csv/test_csv_x0u3hb3m.csv", thus keeping both the original and a modified copy of the CSV file
df.to_csv(path.mkstemp(...))  

I’m not so keen on adding new methods to pathlib without a really good use case, and that one is probably a bit too uncommon.

It’s not quite your use case, but I wouldn’t mind if tempfile.NamedTemporaryFile and TemporaryDirectory were made to inherit Path in future. The snag is that they have a name attribute that’s incompatible with Path.name (full path vs base name)

2 Likes

:sparkles: May 2023 progress report :sparkles:

GH-100481 has landed! Path objects have a new with_segments() method that is called whenever a derived path is created, such as from path.parent or path.iterdir(). Thank you Alex Waygood and Éric for the reviews, and everyone who helped bikeshed the method.

What this means: from Python 3.12 you can subclass PurePath and Path, plus their Windows- and Posix-specific variants. You will not receive an AttributeError when you try to instantiate your subclass (issue); any custom initialiser you add will be called (issue, issue); and by overriding with_segments() you can pass information between path objects.

I’ve begun work on tarfile.TarPath, and I’m hoping to have a PR up within a few weeks (lots of tests to write and get passing!). I’m confident I can get this in for Python 3.13. It will utilize a new pathlib._AbstractPath class under-the-hood

However, in order to add a public AbstractPath class, I’m pretty sure we’ll need to move three methods from PurePath to Path, which effectively removes them from the AbstractPath interface. They are:

  • as_uri() – this returns a file:// URI, which is only applicable to local filesystem paths. It also uses os.fsencode() to encode the path, which can vary by system. This doesn’t make much sense for subclasses of AbstractPath in general, which may have a different URI representation or none at all. Library authors shouldn’t be expected to remember to delete or re-implement as_uri() when subclassing AbstractPath.
  • __bytes__() – as with the above, this uses os.fsencode() and is unlikely to be applicable to, say, a path stored in an ISO 9660 disc image with Joliet extensions, which uses UTF-16BE under the hood.
  • __fspath__() – because it would be catastrophically awful if open(TarPath('README.md', ...)) opened a local file in the current working directory called README.md.

These moves will require a deprecation period, and so I think we’re looking at Python 3.15 for the addition of pathlib.AbstractPath. There are ways to do it sooner (e.g. by having AbstractPath not subclass PurePath) but they have their own problems.

I suspect that I’ll be getting into the weeds of tarfile.TarPath development in future updates. Stay tuned, and thanks for following along!

12 Likes

What would be the resulting class hierarchy?

I think you may want an intermediate class PureLocalPath such that those three methods can simply be exposed on PurePath.

AbstractPath is inserted between PurePath and Path:

class PurePath:
   ...

class AbstractPath(PurePath):
   ...

class Path(AbstractPath):
   ...

Maybe! It would mean that AbstractPath isn’t a subclass of PurePath though, and I find that surprising. I’ll play around with it again.

Why not make PurePath inherit AbstractPath? I don’t think moving most methods to AbstractPath would break any existing users.

AbstractPath is roughly the abstract version of Path – it includes methods like stat(), open() and iterdir(). We don’t want those methods in PurePath, and so PurePath can’t inherit AbstractPath I don’t think.

2 Likes

IMO, it seems to me that the most logical structure, which would make it most clear where the various classes sit in the hierarchy, would conform to my intuitive expectations and preserves backward compatibility, would be just adding a PureAbstractPath class that defines the abstract methods of PurePath (e.g. those that aren’t specific to a local filesystem path, like those three methods).

I.e.:

class AbstractPurePath:
    # All PurePath methods that apply to abstract paths (i.e. all except for the three above)

class PurePath(AbstractPurePath):
    # PurePath methods specific to local file paths (i.e. the above, currently)

class AbstractPath(AbstractPurePath):
   # The Path methods like `open()`, etc. that apply to any abstract path

class Path(PurePath, AbstractPath):
   # Other Path methods that are specific to local filesystem paths

Which would look like:

  AbstractPurePath
       ^   ^
      /     \
     /       \
PurePath  AbstractPath
     ^        ^
      \      /
       \    /
        Path

However, I notice you mentioned

Could you maybe explain a little more about this, particularly with respect to the alternative proposed here?

FWIW, one data point but I would find the that structure to be less obvious and intuitive than the one diagrammed here (which if I understand it correctly sounds similar to what I think @pitrou is intending to propose), particularly given the confusion over the relationship between AbstractPath and PurePath in the former.

2 Likes

I don’t find the diamond inheritance and extra class any more intuitive tbh. The three-part linear inheritance makes most sense to me:

  • PurePath interface: purely lexical operations on paths, invariant across systems
  • AbstractPath interface: adds abstract stat(), open() and iterdir() methods, and other methods that depend on them (exists(), read_text(), glob(), etc)
  • Path interface: add concrete stat(), open() and iterdir() methods that access the local filesystem

IMO the as_uri() and __bytes__() methods are already misplaced, as the use of os.fsencode() is impure and relates to the current system. And I’d argue that makedirs() should throw a TypeError here, rather than creating a file with a backslash in its name:

>>> from pathlib import PureWindowsPath
>>> import os
>>> os.name
'posix'
>>> os.makedirs(PureWindowsPath('foo', 'bar'))
>>> os.listdir()
['foo\\bar']

“Pure” here doesn’t mean filesystem-agnostic, but simply that it doesn’t perform any I/O. pathlib was always meant to model the local filesystem and designed for that.

Regardless of where you want to lead pathlib to, it would be nice not to break compatibility.

3 Likes