Make pathlib extensible

I wouldn’t like that.
I found PurePosixPath pretty useful for generic path-like operations – URL fragments, archive contents, even nested dict access. The minimal spec (only / and \0 are special, leading / or // are more special) works great even outside Posix.
Spelling PurePosixPath as PurePath('foo', pathmod=posixpath) sounds like unnecessary delving into implementation details.

2 Likes

Questions about churn aside, this is IMO equivalent in terms of “delving into implementation details”. It’s just more front-and-center in the name PurePosixPath. But as a beginner user, I’d certainly prefer not to have to think how my paths relate to posix (what’s that?), and minimize that exposure to a dim awareness (resp. short-and-sweet documentation) that there are different “path styles” between posix & windows.

Pulling on that thread a bit more, I don’t find pathmod= to be a good name for that kwarg, but PurePath('foo', style=posix|windows) looks like an appealing API to me.

Of course, the churn would be substantial, but if we are now able to envision / implement a way better long-term API, I don’t think we should forego such improvements indefinitely (as long as we can provide users with an easy way to migrate).

1 Like

I am unsure about the name VirtualPath.

In other contexts I have seen, vfs or virtual file systems are all about letting you type paths that look like regular filesystem paths, and some software does custom things to return metadata or file contents.
In the pathlib ecosystem, a TarPath class, or SshPath, or S3Path would be examples of virtual filesystems.
But their base class itself is not implementing virtual path.

Barney suggested PathBase when I commented this on the PR, which feels great and easy to understand to me!

2 Likes

Thank you for your feedback @encukou and @h-vetinari! The idea of adding a flavour argument to PurePath has been floating around in my head for years now, and I’m glad to finally be able to rule it out!

@merwok thanks, I’ve gone with PathBase in the PR! I think the docs (when we write them) could still mention virtual paths, right? e.g. “PathBase can be used to implement virtual path objects…”

:sparkles: July 2023 progress report :sparkles:

Thank you to all those who have been providing feedback on naming, hierarchies, etc. It’s so useful to bounce ideas off such talented and experienced devs!

As I mentioned in a previous post, I’ve put up a PR that adds a private _PathBase class:

That PR has been slimmed down: it originally added tarfile.TarPath too, but the expected behaviour of paths involving symlinks wasn’t clear, and so I’m going to work on TarPath in a PyPI package first.

When that PR lands, the remaining work is:

  • Add a public PurePath.pathmod class attribute (PR: GH-106533)
  • Figure out what to do with _PathBase.__hash__(), __eq__(), __lt__(), etc (any opinions?)
  • Make pathlib.PathBase public!

For the first time, I feel confident that this project will succeed. There are no architectural problems remaining in pathlib that would prevent it, nor any major decisions to be made (touch wood). It will be immediately useful upon release, and I think it could grow into one of Python’s best-loved features as third-party APIs begin to accept os.PathLike | pathlib.PathBase for path arguments. Eventually users should be able to do things like:

shutil.copytree(FTPPath(...), TarPath(...))
pandas.read_csv(S3Path(...))
image.save(TarPath(...))  # PIL

We’re doing for path objects what PEP 3116 and the io module did for file objects :slight_smile:

That’s it for now. Thanks again to everyone who has helped with this!

8 Likes

For my fellow visual learners, here’s a venn diagram showing os.PathLike and pathlib.PathBase:

8 Likes

The patch in review includes a PathBase.as_uri() method that raises UnsupportedOperation. I expect that some subclasses of PathBase will override that method, e.g. to return s3:// or ftp:// URIs.

Q: Should we add a symmetrical PathBase.from_uri() classmethod? This would provide an explicit means to contruct a path object (and its backend) from a URI - for example, an FTPPath.from_uri() method could parse the host/port/user/passwd from the URI, construct an ftplib.FTP object, and then wrap it in an FTPPath object.

For Path.from_uri() (local paths), I have a local branch that handles RFC 8089 file: URIs, including the weird ones with 4 or 5 leading slashes, such as those produced by urllib.request.pathname2url().

I don’t think there should be a PathBase.from_uri(). Instead I feel the subclasses should support uri directly in there __init__ function. This is already the case for some pathlib implementations:

from etils import epath
import upath

epath.Path('gs://xxx/yyy')
upath.UPath('s3://test_bucket')

An additional method would add redundancy/confusion and feel less natural I think. But that’s just my opinion.

Thanks! Assuming upath.UPath('s3:...') delegates to an S3Path class, how should users call its initialiser? So far I’ve been gunning for something like this:

client = boto3.client('s3')
path = S3Path('downloads', 'foo.tar.gz', client=client, bucket='foo')

The positional arguments are specified just like in PurePath and Path - a list of path segments to join. The keyword arguments add to the existing interface, rather than replacing it.

I’m not sure how to add URIs into this mix. In some cases you can distinguish URIs from file paths, but not always* and so positional arguments don’t seem right. A uri keyword argument might work, but it makes other arguments redundant and complicates the interface IMO. This is why I still lean towards a from_uri() classmethod, as the URI may be used to fill several initialiser parameters. What do you think?

(*) for example, file:/etc/hosts is both a valid file URI and a valid relative POSIX path

To be honest I didn’t know that was a thing, I almost always construct a Path with a single argument, and add on with / if needed.

I think it would be nice if the client wasn’t needed for s3 or gcs paths to work, since you don’t necessarily want to interact with your cloud storage when you’re working with paths. e.g. if you’re just formatting some metadata for documentation, or something.

Would PurePosixPath work? It’s designed not to perform any (virtual) filesystem access.

Yeah I guess in the past I’ve just used a regular old Path and as long as you don’t try to access anything it works fine. It just feels like I’m doing something hacky.

This is a bad idea, since there is no reliable way to disambiguate between local path and URI. Every time I saw an API accept both paths and URIs it needed a lot of care to handle special cases (especially if you start thinking about alternate separators or Windows extended-length paths).

So my vote is strongly on PathBase.from_uri().

8 Likes

While adding from_uri, is it worth adding a standard-library UnsupportedURI exception class? This would support the registration pattern: a single function supports many URI types, then iterates through registered AbstractPath (or whatever) subclasses until one of their from_uris doesn’t requires this exception

This is a bad idea, since there is no reliable way to disambiguate between local path and URI. Every time I saw an API accept both paths and URIs it needed a lot of care to handle special cases (especially if you start thinking about alternate separators or Windows extended-length paths).

Maybe I’m not sure I understand the issue or which problem would from_uri be solving.

We’ve been using epath extensively in many projects (e.g. tensorflow_datasets and others) and never encountered any issues. On the contrary, epath allow to manipulate files without having to think about the underlying file system (local, GCS, Windows,…).

For us, for sure using .from_uri would make usage more complicated. Currently, all remote and local path system are supported the same way. This mean, that just by using the right pathlib API (e.g. epath, upath), our code is automatically compatible with remote file system !

For example, we use this very common file-pattern:

def load_img(path: os.PathLike | str):
  path = epath.Path(path)  # Normalize path
  ...

Without need for any special cases (URI vs local path), it allow our code to support all backends:

load_img('gs://local/file')
load_img('/local/file')
load_img('/<google-internal-file-system>/file')
load_img(pathlib.Path('/local/file'))
load_img(upath.Path('gs://file'))

Note the last example: Because epath / upath returns the URI on os.fspath, this allow cross-pathlib backend compatibilities for all URIs (even though upath was developed completely independently, it is natively compatible with epath, tf.io.gfile). For example:

assert os.fspath(upath.UPath('/aaa/bbb')) == '/aaa/bbb'
assert os.fspath(upath.UPath('gs://aaa/bbb')) == 'gs://aaa/bbb'

path = epath.Path(upath.Path('gs://path'))  # Works out of the box

path = upath.Path(epath.Path('gs://path'))  # Works out of the box

tf.io.gfile.exists(upath.Path('s3://path'))  # Works out of the box

This os.fspath behavior might be semantically incorrect but very convenient. This is a case where practicality beat purity.

For us, this really simplify our live because we don’t need to care whether the path is local or remote. Just using the right pathlib API and everything will magically work. And the cross-pathlib backend compatibility is bonus as you can just propagate pathlib-like objects from one module to another, even if the 2 modules are using different pathlib backend.

Note that personally, I don’t really have strong opinion whether a .from_uri is added to the API or not, but no matter what, epath will still continue to accept URI in __init__, just because it’s too convenient and the alternative would complexify everything for our code and our users. I hope I clarify my reasoning and motivation behind.

At least for file: URIs, it’s not always possible to distinguish a URI from a file path. file:/foo is both a valid URI (representing /foo) and a valid path.

Returning a URI from __fspath__() doesn’t seem convenient to me - it sounds dangerous! If a user runs os.makedirs(upath.Path('s3://path')) do they get an s3: folder in their working directory? They should get an exception, because S3 paths are not local paths and do not have a local filesystem representation, and therefore are not os.PathLike.

Could you provide a etils.path_from_uri() function that checks the scheme and defers to the from_uri() method of an appropriate class? We could potentially add a PathBase.uri_scheme attribute to make this easier, but adding a registration system to pathlib itself feels like a can of worms!

At least for file: URIs, it’s not always possible to distinguish a URI from a file path. file:/foo is both a valid URI (representing /foo) and a valid path.

I still don’t see concretely how this is a problem in practice. If the user want a local URI, they can still call path.as_uri(). I don’t see which concrete problem users would encounter.

Returning a URI from __fspath__() doesn’t seem convenient to me - it sounds dangerous

I agree os.fspath is sub-optimal, but for us, cross-compatibility with other API is required. Many users pass pathlib API to TensorFlow like tf.data.Dataset, tf.io.gfile,…:

path = epath.Path('g3://path')

tf.io.gfile.exists(path)  # Should work
ds = tf.data.TFRecordDataset(path)  # Should work

This should work out-of-the box for users. TF does not now anything about epath. The only standard interface that exists to pass path is os.PathLike | str. So epath has to return gs:// URI in os.fspath. This is the only standard way for TensorFlow to correctly infer the path.

If a user runs os.makedirs(upath.Path('s3://path')) do they get an s3: folder in their working directory? They should get an exception.

I agree, but I also feel like os.makedirs('s3://path') should raise an exception (independently of __fspath__). It’s not the case currently but maybe it should ?

Could you provide a etils.path_from_uri()

What would be the benefit vs having this in __init__ ?

  • This would only make the API more complicated (3 ways of creating a path vs 1 currently).
  • The current code is meant to be compatible with remote files system by default (users cannot get it wrong). By adding etils.path_from_uri(), it’s very easy for users to keep calling epath.Path(path) and this would introduce bugs in their code.

Why would it raise? It’s a perfectly valid path! The thing that should raise is os.fspath(). By instead returning a URI from __fspath__() you are handing users a downward-facing shotgun that is liable to go off whenever they use an API that calls os.fspath() (with the exception of TensorFlow, apparently).

1 Like

Scanning over pandas handling of gcs path it looks to make a similar assumption and expects gcs path (or s3 path/etc) to be returned from fspath. dask is another library that assumes similar and will coerce Path with fspath and later expects gs prefix/uri to be given. So I don’t think tensorflow is special here and it’s common (not only choice) for fspath to be expected to be uri.

One library that is pick something closer to your idea of not returing uri is cloudpathlib. It has interesting choice that for remote paths like gcs/s3 instead of returns gs://path for __fspath__ it copies file locally and returns the local path instead. Which can work fine for reading, but unsure that makes much sense for any write api

2 Likes

Small update: I’ve logged an issue and PR for adding pathlib.Path.from_uri().

My main priority is still to add pathlib._PathBase. I’ve highlighted three bits of code in the PR that would most benefit from review: implementations of _PathBase.is_junction(), resolve() and _scandir(). Grateful for any reviews! :slight_smile:

2 Likes