Make pathlib extensible

jamestwebber · July 27, 2023, 4:09pm

Yeah I guess in the past I’ve just used a regular old Path and as long as you don’t try to access anything it works fine. It just feels like I’m doing something hacky.

pitrou · July 27, 2023, 4:12pm

This is a bad idea, since there is no reliable way to disambiguate between local path and URI. Every time I saw an API accept both paths and URIs it needed a lot of care to handle special cases (especially if you start thinking about alternate separators or Windows extended-length paths).

So my vote is strongly on PathBase.from_uri().

EpicWink · July 29, 2023, 2:30am

While adding from_uri, is it worth adding a standard-library UnsupportedURI exception class? This would support the registration pattern: a single function supports many URI types, then iterates through registered AbstractPath (or whatever) subclasses until one of their from_uris doesn’t requires this exception

Conchylicultor · July 29, 2023, 8:55am

This is a bad idea, since there is no reliable way to disambiguate between local path and URI. Every time I saw an API accept both paths and URIs it needed a lot of care to handle special cases (especially if you start thinking about alternate separators or Windows extended-length paths).

Maybe I’m not sure I understand the issue or which problem would from_uri be solving.

We’ve been using epath extensively in many projects (e.g. tensorflow_datasets and others) and never encountered any issues. On the contrary, epath allow to manipulate files without having to think about the underlying file system (local, GCS, Windows,…).

For us, for sure using .from_uri would make usage more complicated. Currently, all remote and local path system are supported the same way. This mean, that just by using the right pathlib API (e.g. epath, upath), our code is automatically compatible with remote file system !

For example, we use this very common file-pattern:

def load_img(path: os.PathLike | str):
  path = epath.Path(path)  # Normalize path
  ...

Without need for any special cases (URI vs local path), it allow our code to support all backends:

load_img('gs://local/file')
load_img('/local/file')
load_img('/<google-internal-file-system>/file')
load_img(pathlib.Path('/local/file'))
load_img(upath.Path('gs://file'))

Note the last example: Because epath / upath returns the URI on os.fspath, this allow cross-pathlib backend compatibilities for all URIs (even though upath was developed completely independently, it is natively compatible with epath, tf.io.gfile). For example:

assert os.fspath(upath.UPath('/aaa/bbb')) == '/aaa/bbb'
assert os.fspath(upath.UPath('gs://aaa/bbb')) == 'gs://aaa/bbb'

path = epath.Path(upath.Path('gs://path'))  # Works out of the box

path = upath.Path(epath.Path('gs://path'))  # Works out of the box

tf.io.gfile.exists(upath.Path('s3://path'))  # Works out of the box

This os.fspath behavior might be semantically incorrect but very convenient. This is a case where practicality beat purity.

For us, this really simplify our live because we don’t need to care whether the path is local or remote. Just using the right pathlib API and everything will magically work. And the cross-pathlib backend compatibility is bonus as you can just propagate pathlib-like objects from one module to another, even if the 2 modules are using different pathlib backend.

Note that personally, I don’t really have strong opinion whether a .from_uri is added to the API or not, but no matter what, epath will still continue to accept URI in __init__, just because it’s too convenient and the alternative would complexify everything for our code and our users. I hope I clarify my reasoning and motivation behind.

barneygale · July 29, 2023, 2:18pm

At least for file: URIs, it’s not always possible to distinguish a URI from a file path. file:/foo is both a valid URI (representing /foo) and a valid path.

Returning a URI from __fspath__() doesn’t seem convenient to me - it sounds dangerous! If a user runs os.makedirs(upath.Path('s3://path')) do they get an s3: folder in their working directory? They should get an exception, because S3 paths are not local paths and do not have a local filesystem representation, and therefore are not os.PathLike.

Could you provide a etils.path_from_uri() function that checks the scheme and defers to the from_uri() method of an appropriate class? We could potentially add a PathBase.uri_scheme attribute to make this easier, but adding a registration system to pathlib itself feels like a can of worms!

Conchylicultor · July 29, 2023, 2:44pm

At least for file: URIs, it’s not always possible to distinguish a URI from a file path. file:/foo is both a valid URI (representing /foo) and a valid path.

I still don’t see concretely how this is a problem in practice. If the user want a local URI, they can still call path.as_uri(). I don’t see which concrete problem users would encounter.

Returning a URI from __fspath__() doesn’t seem convenient to me - it sounds dangerous

I agree os.fspath is sub-optimal, but for us, cross-compatibility with other API is required. Many users pass pathlib API to TensorFlow like tf.data.Dataset, tf.io.gfile,…:

path = epath.Path('g3://path')

tf.io.gfile.exists(path)  # Should work
ds = tf.data.TFRecordDataset(path)  # Should work

This should work out-of-the box for users. TF does not now anything about epath. The only standard interface that exists to pass path is os.PathLike | str. So epath has to return gs:// URI in os.fspath. This is the only standard way for TensorFlow to correctly infer the path.

If a user runs os.makedirs(upath.Path('s3://path')) do they get an s3: folder in their working directory? They should get an exception.

I agree, but I also feel like os.makedirs('s3://path') should raise an exception (independently of __fspath__). It’s not the case currently but maybe it should ?

Could you provide a etils.path_from_uri()

What would be the benefit vs having this in __init__ ?

This would only make the API more complicated (3 ways of creating a path vs 1 currently).
The current code is meant to be compatible with remote files system by default (users cannot get it wrong). By adding etils.path_from_uri(), it’s very easy for users to keep calling epath.Path(path) and this would introduce bugs in their code.

barneygale · July 29, 2023, 3:17pm

Why would it raise? It’s a perfectly valid path! The thing that should raise is os.fspath(). By instead returning a URI from __fspath__() you are handing users a downward-facing shotgun that is liable to go off whenever they use an API that calls os.fspath() (with the exception of TensorFlow, apparently).

mdrissi · July 31, 2023, 5:51am

Scanning over pandas handling of gcs path it looks to make a similar assumption and expects gcs path (or s3 path/etc) to be returned from fspath. dask is another library that assumes similar and will coerce Path with fspath and later expects gs prefix/uri to be given. So I don’t think tensorflow is special here and it’s common (not only choice) for fspath to be expected to be uri.

One library that is pick something closer to your idea of not returing uri is cloudpathlib. It has interesting choice that for remote paths like gcs/s3 instead of returns gs://path for __fspath__ it copies file locally and returns the local path instead. Which can work fine for reading, but unsure that makes much sense for any write api

barneygale · August 4, 2023, 6:13pm

Small update: I’ve logged an issue and PR for adding pathlib.Path.from_uri().

My main priority is still to add pathlib._PathBase. I’ve highlighted three bits of code in the PR that would most benefit from review: implementations of _PathBase.is_junction(), resolve() and _scandir(). Grateful for any reviews!

departure3560 · August 8, 2023, 9:10am

@barneygale Somewhat related: is it possible for PathBase to differentiate between PathBase("") and PathBase(".")?
Currently both are the same data with PurePath.
Please see the thread for details.

Thanks in advance!

barneygale · August 19, 2023, 8:47pm

August 2023 progress report

Not so much news over the last month. My main blocker remains GH-106337, which adds pathlib._PathBase. I’ll start begging for reviewers in the usual places soon

I’ve opened GH-107320 to make Path.iterdir() raise exceptions immediately, rather than on iteration. It turns out this is necessary to implement PathBase.glob() and walk() efficiently. I’ve also opened GH-107640, which adds a Path.from_uri() method. Finally, I’m doing some on-off work on making glob() and match() faster and simpler – big thank you @jaraco for reviewing GH-106703.

Also, I should note that I’m planning to publish a PyPI package that backports Python 3.13+ pathlib for earlier versions of Python, so pathlib.PathBase will be widely available soon after it lands in main . pathlib2 might be that package.

Thanks for reading, may your filesystems be forever virtual!

ajoino · August 19, 2023, 9:02pm

While I’m not too familiar with the pathlib internals, I’d gladly help reviewing docs.

Regarding the backport lib, the typing backport lib is called typing_extensions, maybe it would sense to name the pathlib backport lib in a similar fashion. Calling it pathlib2 makes it sound like a replacement for pathlib, which it isn’t afaict (at least not for the latest Python version). Just my 2 cents.

barneygale · August 19, 2023, 9:11pm

Thank you! There will definitely be a bunch of docs to write when we make pathlib.PathBase public. I’ve been sketching them out in my head for the last couple of years, but I’m sure I’ll need help them just right.

The ‘pathlib2’ package has been around for some time – it’s what many folks (including myself) used when maintaining Python 2/3 compatible codebases. Its trusted name is something of an asset, and the nature of the package (backport of latest pathlib) lines up. The name is imperfect but I reckon it’s a fine compromise.

barneygale · September 17, 2023, 5:48pm

September 2023 progress report

Massive thanks to @AA-Turner and @gst for their reviews of GH-106337, which adds pathlib._PathBase. I’m now pretty comfortable with the implementations of is_junction() and _scandir().

The last thing needing review is the _PathBase.resolve() method, which is a generic implementation of os.path.realpath() atop _PathBase.absolute(), stat() and readlink(). The implementation in the PR follows the POSIX path resolution algorithm, so if /link is a symlink to /home/barney, then /link/.. resolves to /home. I think this is sensible default behaviour, and at any rate it’s only relevant to users implementing symlink support in their PathBase subclass, which I expect to be uncommon.

For historical reasons, pathlib.Path.resolve() raises RuntimeError rather than OSError(ELOOP) when a symlink loop is encountered, and does this even when the strict keyword argument is set to False. os.path.realpath() does not suffer from this problem. I’ve logged an issue and opened a PR to fix it: GH-109192. The fix is a change of behaviour, but the alternative is that we double down on the RuntimeError behaviour and implement it in _PathBase.resolve() too; I’d like to avoid this if we can.

If anyone would like to help review either of the above PRs, I’d really appreciate it

When _PathBase lands, the todo list is:

Fix up _PathBase.__eq__() and friends
Add _PathBase.from_uri() classmethod (GH-107640)
Add glob.translate() and simplify some pathlib internals (GH-106703)
Backport everything into the pathlib2 PyPI package
Drop the underscore prefix, add documentation.

This won’t all land before 3.13 alpha 1, but it should still land in time for beta 1 (feature freeze)

Hope you’re not too miffed, bye-ee x

barneygale · September 23, 2023, 6:50pm

Q: What should pathlib.PathBase.pathmod be set to? This attribute controls basic path syntax:

Whether forward or backward slashes are used as separators
Whether drives are supported
Whether path comparisons are case-sensitive

Other classes have it set as follows:

Note:

os.path evaluates to posixpath or ntpath at runtime, depending on whether you’re on Windows
PurePath, PathBase and Path can’t be directly instantiated, but they can be subclassed.

In favour of os.path: it would match the superclass (PurePath) and subclass (Path), which seems easier to explain/justify.

In favour of posixpath: implementations of PathBase other than Path generally won’t change the path syntax based on the host OS. For example, an S3Path class would consistently use POSIX syntax (forward slashes, no support for drive letters, etc) even on Windows. Unwary devs may develop entirely on Mac or Linux, without even considering that their PathBase subclass will behave differently on Windows.

Thoughts?

TeamSpen210 · September 23, 2023, 8:53pm

Perhaps we could refuse to guess - define __init_subclass__() in PathBase, then examine the class hierarchy and where pathmod is defined. If the subclass does not inherit from Path, but would inherit the pathmod definition from PurePath raise an exception. Maybe even move this to PureParh, require it to be defined if you inherit there too - but that would be backwards incompatible so it’d need to be just a warning.

barneygale · October 6, 2023, 5:29pm

Just noting that I’ve started writing some documentation. There’s a preview here: pathlib — Object-oriented filesystem paths — Python 3.13.0a0 documentation

I’ve included documentation a class called PurePathBase that doesn’t exist yet. But I’m pretty sure we’ll need to add it to ensure PathBase doesn’t inherit duner methods like __hash__() and __eq__() from PurePath.

Those docs won’t land until everything else is in.

barneygale · October 6, 2023, 6:02pm

And in case anyone gets a laugh out of it, I was trying to come up with a simple example for the docs, and somehow wrote this monstrosity:

import pathlib

MAX = 100

class MultiplierPath(pathlib._PathBase):
    def iterdir(self):
        if len(self.parts) < 2:
            for i in range(MAX // 2):
                yield self / str(i)
        elif len(self.parts) == 2:
            a, b = self.parts
            yield self / str(int(a) * int(b))
        else:
            raise FileNotFoundError(str(self))

    def is_dir(self, *, follow_symlinks=True):
        return len(self.parts) <= 2


def is_prime(n):
    path = MultiplierPath()
    for result in path.glob(f'*/*/{n}'):
        if '1' not in result.parts:
            return False
    return True


for i in range(MAX):
    if is_prime(i):
        print(f"{i} is prime")

Has anyone globbed for primes before? I’m not going to use it in the docs!

barneygale · October 13, 2023, 6:35pm

On pathmod:

I think it’s possible to replace this class attribute, which can be set to os.path, posixpath or ntpath, and is a bit of a blunt instrument, with three attributes and a classmethod that would give users fine-grained control over path syntax:

sep and altsep – just like in os.path
is_case_sensitive or similar – boolean
split_root() – just like os.path.splitroot()

These methods, plus the existing is_absolute() and is_reserved() methods, are enough to specify all the differences between POSIX and Windows path syntax, and allow several variants.

The simplest default values of these in PurePathBase / PathBase might be:

class PurePathBase:
    sep = '/'
    altsep = None
    is_case_sensitive = True

    @classmethod
    def split_root(cls, path):
        return '', '', path

    def is_absolute(self):
        return False

    def is_reserved(self):
        return False


class PathBase(PurePathBase):
    ...

… which results in a simplified POSIX syntax by default: case sensitive, forward slashes as separators, and every path is relative because there’s no support for drives or roots.

Is this any better? I’m not sure. But I figured I should at least mention it.

barneygale · October 24, 2023, 5:43pm

October 2023 progress report

We now have pathlib._PathBase in the codebase! Thank you so much to @AA-Turner and @gst for the reviews of GH-106337.

I’ve logged a new issue covering making this class public:

github.com/python/cpython

User-defined virtual path objects

opened 03:16PM - 29 Sep 23 UTC

barneygale

type-feature topic-pathlib

## Proposal Add a public `pathlib.PathBase` class that can be subclassed by u…sers to implement "virtual paths", e.g. paths within `.zip` files or on FTP servers. ## Tasks - [x] Add private `_PathBase` class - #106337 - [ ] Merge `WalkTests` into `DummyPathTest` - #110308 - #110655 - [ ] Sort out `__hash__()`, `__eq__`, etc - we shouldn't inherit the `PurePath` implementation - #110670 - [ ] ... and `[is_]relative_to()`? Needs `__eq__()`. - Only needs a note in the docs - [ ] Figure out the default value for `pathmod`, if any - [ ] `stat()` return value? - [ ] Simplify/speed up path walking and globbing - #106703 - [x] Add `from_uri()` classmethod - #107640 - [x] Add `uri_scheme` attribute? - Deferred - will wait and see if a user requests this. - [ ] Deprecate `PurePath.as_uri()`? - #110312 - [ ] Backport everything into `pathlib2` - [ ] Add docs, drop underscore prefix - #110321 ## Previous Discussion https://discuss.python.org/t/make-pathlib-extensible/3428 ### Linked PRs * gh-110308 * gh-110312 * gh-110321 * gh-110412 * gh-110655 * gh-110670

Also, we have _PathBase.from_uri() and a corresponding concrete implementation in Path.from_uri(). Many thanks to Adam, @poehlmann and @edgarrmondragon for the reviews

Major remaining work:

Don’t inherit __hash__(), __eq__(), etc, from PurePath
- PR: GH-110670
Figure out the default value of pathmod, if any
- Fixing (1) will make the best solution clearer.
Simplify/speed up globbing
- PR: GH-106703
- This has been reviewed but is lacking a final sign-off. I might just merge it soon!
(Maybe) provide some sort of os.stat_result-like class that users can return from their stat() implementations

For Python 3.13 I’m aiming to get everything done except dropping the underscore prefix + adding docs. I suspect the classes may need to privately gestate for a further release before we make them public, but let’s see.

That’s all the news, cheers