Yeah I guess in the past I’ve just used a regular old Path
and as long as you don’t try to access anything it works fine. It just feels like I’m doing something hacky.
This is a bad idea, since there is no reliable way to disambiguate between local path and URI. Every time I saw an API accept both paths and URIs it needed a lot of care to handle special cases (especially if you start thinking about alternate separators or Windows extended-length paths).
So my vote is strongly on PathBase.from_uri()
.
While adding from_uri
, is it worth adding a standard-library UnsupportedURI
exception class? This would support the registration pattern: a single function supports many URI types, then iterates through registered AbstractPath
(or whatever) subclasses until one of their from_uri
s doesn’t requires this exception
This is a bad idea, since there is no reliable way to disambiguate between local path and URI. Every time I saw an API accept both paths and URIs it needed a lot of care to handle special cases (especially if you start thinking about alternate separators or Windows extended-length paths).
Maybe I’m not sure I understand the issue or which problem would from_uri
be solving.
We’ve been using epath
extensively in many projects (e.g. tensorflow_datasets
and others) and never encountered any issues. On the contrary, epath
allow to manipulate files without having to think about the underlying file system (local, GCS, Windows,…).
For us, for sure using .from_uri
would make usage more complicated. Currently, all remote and local path system are supported the same way. This mean, that just by using the right pathlib
API (e.g. epath
, upath
), our code is automatically compatible with remote file system !
For example, we use this very common file-pattern:
def load_img(path: os.PathLike | str):
path = epath.Path(path) # Normalize path
...
Without need for any special cases (URI vs local path), it allow our code to support all backends:
load_img('gs://local/file')
load_img('/local/file')
load_img('/<google-internal-file-system>/file')
load_img(pathlib.Path('/local/file'))
load_img(upath.Path('gs://file'))
Note the last example: Because epath
/ upath
returns the URI on os.fspath
, this allow cross-pathlib backend compatibilities for all URIs (even though upath
was developed completely independently, it is natively compatible with epath
, tf.io.gfile
). For example:
assert os.fspath(upath.UPath('/aaa/bbb')) == '/aaa/bbb'
assert os.fspath(upath.UPath('gs://aaa/bbb')) == 'gs://aaa/bbb'
path = epath.Path(upath.Path('gs://path')) # Works out of the box
path = upath.Path(epath.Path('gs://path')) # Works out of the box
tf.io.gfile.exists(upath.Path('s3://path')) # Works out of the box
This os.fspath
behavior might be semantically incorrect but very convenient. This is a case where practicality beat purity.
For us, this really simplify our live because we don’t need to care whether the path is local or remote. Just using the right pathlib API and everything will magically work. And the cross-pathlib backend compatibility is bonus as you can just propagate pathlib-like objects from one module to another, even if the 2 modules are using different pathlib backend.
Note that personally, I don’t really have strong opinion whether a .from_uri
is added to the API or not, but no matter what, epath
will still continue to accept URI in __init__
, just because it’s too convenient and the alternative would complexify everything for our code and our users. I hope I clarify my reasoning and motivation behind.
At least for file:
URIs, it’s not always possible to distinguish a URI from a file path. file:/foo
is both a valid URI (representing /foo
) and a valid path.
Returning a URI from __fspath__()
doesn’t seem convenient to me - it sounds dangerous! If a user runs os.makedirs(upath.Path('s3://path'))
do they get an s3:
folder in their working directory? They should get an exception, because S3 paths are not local paths and do not have a local filesystem representation, and therefore are not os.PathLike
.
Could you provide a etils.path_from_uri()
function that checks the scheme and defers to the from_uri()
method of an appropriate class? We could potentially add a PathBase.uri_scheme
attribute to make this easier, but adding a registration system to pathlib itself feels like a can of worms!
At least for
file:
URIs, it’s not always possible to distinguish a URI from a file path.file:/foo
is both a valid URI (representing/foo
) and a valid path.
I still don’t see concretely how this is a problem in practice. If the user want a local URI, they can still call path.as_uri()
. I don’t see which concrete problem users would encounter.
Returning a URI from
__fspath__()
doesn’t seem convenient to me - it sounds dangerous
I agree os.fspath
is sub-optimal, but for us, cross-compatibility with other API is required. Many users pass pathlib API to TensorFlow like tf.data.Dataset
, tf.io.gfile
,…:
path = epath.Path('g3://path')
tf.io.gfile.exists(path) # Should work
ds = tf.data.TFRecordDataset(path) # Should work
This should work out-of-the box for users. TF does not now anything about epath
. The only standard interface that exists to pass path is os.PathLike | str
. So epath
has to return gs://
URI in os.fspath
. This is the only standard way for TensorFlow to correctly infer the path.
If a user runs
os.makedirs(upath.Path('s3://path'))
do they get ans3:
folder in their working directory? They should get an exception.
I agree, but I also feel like os.makedirs('s3://path')
should raise an exception (independently of __fspath__
). It’s not the case currently but maybe it should ?
Could you provide a
etils.path_from_uri()
What would be the benefit vs having this in __init__
?
- This would only make the API more complicated (3 ways of creating a path vs 1 currently).
- The current code is meant to be compatible with remote files system by default (users cannot get it wrong). By adding
etils.path_from_uri()
, it’s very easy for users to keep callingepath.Path(path)
and this would introduce bugs in their code.
Why would it raise? It’s a perfectly valid path! The thing that should raise is os.fspath()
. By instead returning a URI from __fspath__()
you are handing users a downward-facing shotgun that is liable to go off whenever they use an API that calls os.fspath()
(with the exception of TensorFlow, apparently).
Scanning over pandas handling of gcs path it looks to make a similar assumption and expects gcs path (or s3 path/etc) to be returned from fspath. dask is another library that assumes similar and will coerce Path
with fspath and later expects gs prefix/uri to be given. So I don’t think tensorflow is special here and it’s common (not only choice) for fspath to be expected to be uri.
One library that is pick something closer to your idea of not returing uri is cloudpathlib. It has interesting choice that for remote paths like gcs/s3 instead of returns gs://path for __fspath__
it copies file locally and returns the local path instead. Which can work fine for reading, but unsure that makes much sense for any write api
Small update: I’ve logged an issue and PR for adding pathlib.Path.from_uri()
.
My main priority is still to add pathlib._PathBase
. I’ve highlighted three bits of code in the PR that would most benefit from review: implementations of _PathBase.is_junction()
, resolve()
and _scandir()
. Grateful for any reviews!
@barneygale Somewhat related: is it possible for PathBase
to differentiate between PathBase("")
and PathBase(".")
?
Currently both are the same data with PurePath
.
Please see the thread for details.
Thanks in advance!
August 2023 progress report
Not so much news over the last month. My main blocker remains GH-106337, which adds pathlib._PathBase
. I’ll start begging for reviewers in the usual places soon
I’ve opened GH-107320 to make Path.iterdir()
raise exceptions immediately, rather than on iteration. It turns out this is necessary to implement PathBase.glob()
and walk()
efficiently. I’ve also opened GH-107640, which adds a Path.from_uri()
method. Finally, I’m doing some on-off work on making glob()
and match()
faster and simpler – big thank you @jaraco for reviewing GH-106703.
Also, I should note that I’m planning to publish a PyPI package that backports Python 3.13+ pathlib for earlier versions of Python, so pathlib.PathBase
will be widely available soon after it lands in main
. pathlib2 might be that package.
Thanks for reading, may your filesystems be forever virtual!
While I’m not too familiar with the pathlib
internals, I’d gladly help reviewing docs.
Regarding the backport lib, the typing backport lib is called typing_extensions
, maybe it would sense to name the pathlib backport lib in a similar fashion. Calling it pathlib2
makes it sound like a replacement for pathlib
, which it isn’t afaict (at least not for the latest Python version). Just my 2 cents.
Thank you! There will definitely be a bunch of docs to write when we make pathlib.PathBase
public. I’ve been sketching them out in my head for the last couple of years, but I’m sure I’ll need help them just right.
The ‘pathlib2’ package has been around for some time – it’s what many folks (including myself) used when maintaining Python 2/3 compatible codebases. Its trusted name is something of an asset, and the nature of the package (backport of latest pathlib) lines up. The name is imperfect but I reckon it’s a fine compromise.
September 2023 progress report
Massive thanks to @AA-Turner and @gst for their reviews of GH-106337, which adds pathlib._PathBase
. I’m now pretty comfortable with the implementations of is_junction()
and _scandir()
.
The last thing needing review is the _PathBase.resolve()
method, which is a generic implementation of os.path.realpath()
atop _PathBase.absolute()
, stat()
and readlink()
. The implementation in the PR follows the POSIX path resolution algorithm, so if /link
is a symlink to /home/barney
, then /link/..
resolves to /home
. I think this is sensible default behaviour, and at any rate it’s only relevant to users implementing symlink support in their PathBase
subclass, which I expect to be uncommon.
For historical reasons, pathlib.Path.resolve()
raises RuntimeError
rather than OSError(ELOOP)
when a symlink loop is encountered, and does this even when the strict keyword argument is set to False
. os.path.realpath()
does not suffer from this problem. I’ve logged an issue and opened a PR to fix it: GH-109192. The fix is a change of behaviour, but the alternative is that we double down on the RuntimeError
behaviour and implement it in _PathBase.resolve()
too; I’d like to avoid this if we can.
If anyone would like to help review either of the above PRs, I’d really appreciate it
When _PathBase
lands, the todo list is:
- Fix up
_PathBase.__eq__()
and friends - Add
_PathBase.from_uri()
classmethod (GH-107640) - Add
glob.translate()
and simplify some pathlib internals (GH-106703) - Backport everything into the
pathlib2
PyPI package - Drop the underscore prefix, add documentation.
This won’t all land before 3.13 alpha 1, but it should still land in time for beta 1 (feature freeze)
Q: What should pathlib.PathBase.pathmod
be set to? This attribute controls basic path syntax:
- Whether forward or backward slashes are used as separators
- Whether drives are supported
- Whether path comparisons are case-sensitive
Other classes have it set as follows:
Note:
os.path
evaluates toposixpath
orntpath
at runtime, depending on whether you’re on WindowsPurePath
,PathBase
andPath
can’t be directly instantiated, but they can be subclassed.
In favour of os.path
: it would match the superclass (PurePath
) and subclass (Path
), which seems easier to explain/justify.
In favour of posixpath
: implementations of PathBase
other than Path
generally won’t change the path syntax based on the host OS. For example, an S3Path
class would consistently use POSIX syntax (forward slashes, no support for drive letters, etc) even on Windows. Unwary devs may develop entirely on Mac or Linux, without even considering that their PathBase
subclass will behave differently on Windows.
Thoughts?
Perhaps we could refuse to guess - define __init_subclass__()
in PathBase
, then examine the class hierarchy and where pathmod
is defined. If the subclass does not inherit from Path
, but would inherit the pathmod
definition from PurePath
raise an exception. Maybe even move this to PureParh
, require it to be defined if you inherit there too - but that would be backwards incompatible so it’d need to be just a warning.