Make pathlib extensible

Yeah I guess in the past I’ve just used a regular old Path and as long as you don’t try to access anything it works fine. It just feels like I’m doing something hacky.

This is a bad idea, since there is no reliable way to disambiguate between local path and URI. Every time I saw an API accept both paths and URIs it needed a lot of care to handle special cases (especially if you start thinking about alternate separators or Windows extended-length paths).

So my vote is strongly on PathBase.from_uri().

8 Likes

While adding from_uri, is it worth adding a standard-library UnsupportedURI exception class? This would support the registration pattern: a single function supports many URI types, then iterates through registered AbstractPath (or whatever) subclasses until one of their from_uris doesn’t requires this exception

This is a bad idea, since there is no reliable way to disambiguate between local path and URI. Every time I saw an API accept both paths and URIs it needed a lot of care to handle special cases (especially if you start thinking about alternate separators or Windows extended-length paths).

Maybe I’m not sure I understand the issue or which problem would from_uri be solving.

We’ve been using epath extensively in many projects (e.g. tensorflow_datasets and others) and never encountered any issues. On the contrary, epath allow to manipulate files without having to think about the underlying file system (local, GCS, Windows,…).

For us, for sure using .from_uri would make usage more complicated. Currently, all remote and local path system are supported the same way. This mean, that just by using the right pathlib API (e.g. epath, upath), our code is automatically compatible with remote file system !

For example, we use this very common file-pattern:

def load_img(path: os.PathLike | str):
  path = epath.Path(path)  # Normalize path
  ...

Without need for any special cases (URI vs local path), it allow our code to support all backends:

load_img('gs://local/file')
load_img('/local/file')
load_img('/<google-internal-file-system>/file')
load_img(pathlib.Path('/local/file'))
load_img(upath.Path('gs://file'))

Note the last example: Because epath / upath returns the URI on os.fspath, this allow cross-pathlib backend compatibilities for all URIs (even though upath was developed completely independently, it is natively compatible with epath, tf.io.gfile). For example:

assert os.fspath(upath.UPath('/aaa/bbb')) == '/aaa/bbb'
assert os.fspath(upath.UPath('gs://aaa/bbb')) == 'gs://aaa/bbb'

path = epath.Path(upath.Path('gs://path'))  # Works out of the box

path = upath.Path(epath.Path('gs://path'))  # Works out of the box

tf.io.gfile.exists(upath.Path('s3://path'))  # Works out of the box

This os.fspath behavior might be semantically incorrect but very convenient. This is a case where practicality beat purity.

For us, this really simplify our live because we don’t need to care whether the path is local or remote. Just using the right pathlib API and everything will magically work. And the cross-pathlib backend compatibility is bonus as you can just propagate pathlib-like objects from one module to another, even if the 2 modules are using different pathlib backend.

Note that personally, I don’t really have strong opinion whether a .from_uri is added to the API or not, but no matter what, epath will still continue to accept URI in __init__, just because it’s too convenient and the alternative would complexify everything for our code and our users. I hope I clarify my reasoning and motivation behind.

At least for file: URIs, it’s not always possible to distinguish a URI from a file path. file:/foo is both a valid URI (representing /foo) and a valid path.

Returning a URI from __fspath__() doesn’t seem convenient to me - it sounds dangerous! If a user runs os.makedirs(upath.Path('s3://path')) do they get an s3: folder in their working directory? They should get an exception, because S3 paths are not local paths and do not have a local filesystem representation, and therefore are not os.PathLike.

Could you provide a etils.path_from_uri() function that checks the scheme and defers to the from_uri() method of an appropriate class? We could potentially add a PathBase.uri_scheme attribute to make this easier, but adding a registration system to pathlib itself feels like a can of worms!

At least for file: URIs, it’s not always possible to distinguish a URI from a file path. file:/foo is both a valid URI (representing /foo) and a valid path.

I still don’t see concretely how this is a problem in practice. If the user want a local URI, they can still call path.as_uri(). I don’t see which concrete problem users would encounter.

Returning a URI from __fspath__() doesn’t seem convenient to me - it sounds dangerous

I agree os.fspath is sub-optimal, but for us, cross-compatibility with other API is required. Many users pass pathlib API to TensorFlow like tf.data.Dataset, tf.io.gfile,…:

path = epath.Path('g3://path')

tf.io.gfile.exists(path)  # Should work
ds = tf.data.TFRecordDataset(path)  # Should work

This should work out-of-the box for users. TF does not now anything about epath. The only standard interface that exists to pass path is os.PathLike | str. So epath has to return gs:// URI in os.fspath. This is the only standard way for TensorFlow to correctly infer the path.

If a user runs os.makedirs(upath.Path('s3://path')) do they get an s3: folder in their working directory? They should get an exception.

I agree, but I also feel like os.makedirs('s3://path') should raise an exception (independently of __fspath__). It’s not the case currently but maybe it should ?

Could you provide a etils.path_from_uri()

What would be the benefit vs having this in __init__ ?

  • This would only make the API more complicated (3 ways of creating a path vs 1 currently).
  • The current code is meant to be compatible with remote files system by default (users cannot get it wrong). By adding etils.path_from_uri(), it’s very easy for users to keep calling epath.Path(path) and this would introduce bugs in their code.

Why would it raise? It’s a perfectly valid path! The thing that should raise is os.fspath(). By instead returning a URI from __fspath__() you are handing users a downward-facing shotgun that is liable to go off whenever they use an API that calls os.fspath() (with the exception of TensorFlow, apparently).

1 Like

Scanning over pandas handling of gcs path it looks to make a similar assumption and expects gcs path (or s3 path/etc) to be returned from fspath. dask is another library that assumes similar and will coerce Path with fspath and later expects gs prefix/uri to be given. So I don’t think tensorflow is special here and it’s common (not only choice) for fspath to be expected to be uri.

One library that is pick something closer to your idea of not returing uri is cloudpathlib. It has interesting choice that for remote paths like gcs/s3 instead of returns gs://path for __fspath__ it copies file locally and returns the local path instead. Which can work fine for reading, but unsure that makes much sense for any write api

2 Likes

Small update: I’ve logged an issue and PR for adding pathlib.Path.from_uri().

My main priority is still to add pathlib._PathBase. I’ve highlighted three bits of code in the PR that would most benefit from review: implementations of _PathBase.is_junction(), resolve() and _scandir(). Grateful for any reviews! :slight_smile:

2 Likes

@barneygale Somewhat related: is it possible for PathBase to differentiate between PathBase("") and PathBase(".")?
Currently both are the same data with PurePath.
Please see the thread for details.

Thanks in advance!

:sparkles: August 2023 progress report :sparkles:

Not so much news over the last month. My main blocker remains GH-106337, which adds pathlib._PathBase. I’ll start begging for reviewers in the usual places soon :slight_smile:

I’ve opened GH-107320 to make Path.iterdir() raise exceptions immediately, rather than on iteration. It turns out this is necessary to implement PathBase.glob() and walk() efficiently. I’ve also opened GH-107640, which adds a Path.from_uri() method. Finally, I’m doing some on-off work on making glob() and match() faster and simpler – big thank you @jaraco for reviewing GH-106703.

Also, I should note that I’m planning to publish a PyPI package that backports Python 3.13+ pathlib for earlier versions of Python, so pathlib.PathBase will be widely available soon after it lands in main :crossed_fingers:. pathlib2 might be that package.

Thanks for reading, may your filesystems be forever virtual!

9 Likes

While I’m not too familiar with the pathlib internals, I’d gladly help reviewing docs.

Regarding the backport lib, the typing backport lib is called typing_extensions, maybe it would sense to name the pathlib backport lib in a similar fashion. Calling it pathlib2 makes it sound like a replacement for pathlib, which it isn’t afaict (at least not for the latest Python version). Just my 2 cents.

1 Like

Thank you! There will definitely be a bunch of docs to write when we make pathlib.PathBase public. I’ve been sketching them out in my head for the last couple of years, but I’m sure I’ll need help them just right.

The ‘pathlib2’ package has been around for some time – it’s what many folks (including myself) used when maintaining Python 2/3 compatible codebases. Its trusted name is something of an asset, and the nature of the package (backport of latest pathlib) lines up. The name is imperfect but I reckon it’s a fine compromise. :sweat_smile:

2 Likes

:sparkles: September 2023 progress report :sparkles:

Massive thanks to @AA-Turner and @gst for their reviews of GH-106337, which adds pathlib._PathBase. I’m now pretty comfortable with the implementations of is_junction() and _scandir().

The last thing needing review is the _PathBase.resolve() method, which is a generic implementation of os.path.realpath() atop _PathBase.absolute(), stat() and readlink(). The implementation in the PR follows the POSIX path resolution algorithm, so if /link is a symlink to /home/barney, then /link/.. resolves to /home. I think this is sensible default behaviour, and at any rate it’s only relevant to users implementing symlink support in their PathBase subclass, which I expect to be uncommon.

For historical reasons, pathlib.Path.resolve() raises RuntimeError rather than OSError(ELOOP) when a symlink loop is encountered, and does this even when the strict keyword argument is set to False. os.path.realpath() does not suffer from this problem. I’ve logged an issue and opened a PR to fix it: GH-109192. The fix is a change of behaviour, but the alternative is that we double down on the RuntimeError behaviour and implement it in _PathBase.resolve() too; I’d like to avoid this if we can.

If anyone would like to help review either of the above PRs, I’d really appreciate it :pray:

When _PathBase lands, the todo list is:

  • Fix up _PathBase.__eq__() and friends
  • Add _PathBase.from_uri() classmethod (GH-107640)
  • Add glob.translate() and simplify some pathlib internals (GH-106703)
  • Backport everything into the pathlib2 PyPI package
  • Drop the underscore prefix, add documentation.

This won’t all land before 3.13 alpha 1, but it should still land in time for beta 1 (feature freeze)

Hope you’re not too miffed, bye-ee x

8 Likes

Q: What should pathlib.PathBase.pathmod be set to? This attribute controls basic path syntax:

  • Whether forward or backward slashes are used as separators
  • Whether drives are supported
  • Whether path comparisons are case-sensitive

Other classes have it set as follows:

Note:

  • os.path evaluates to posixpath or ntpath at runtime, depending on whether you’re on Windows
  • PurePath, PathBase and Path can’t be directly instantiated, but they can be subclassed.

In favour of os.path: it would match the superclass (PurePath) and subclass (Path), which seems easier to explain/justify.

In favour of posixpath: implementations of PathBase other than Path generally won’t change the path syntax based on the host OS. For example, an S3Path class would consistently use POSIX syntax (forward slashes, no support for drive letters, etc) even on Windows. Unwary devs may develop entirely on Mac or Linux, without even considering that their PathBase subclass will behave differently on Windows.

Thoughts?

Perhaps we could refuse to guess - define __init_subclass__() in PathBase, then examine the class hierarchy and where pathmod is defined. If the subclass does not inherit from Path, but would inherit the pathmod definition from PurePath raise an exception. Maybe even move this to PureParh, require it to be defined if you inherit there too - but that would be backwards incompatible so it’d need to be just a warning.

4 Likes