Pathlib objects as glob template

Conchylicultor · September 30, 2024, 10:12am

Sometimes, the user provide a glob template as a string, like --files=s3://home/*/documents/*.txt.

Currently, there’s no obvious way to use glob using the pathlib API (we cannot use shutil.glob as it does not support Google cloud, s3 buckets):

path = 's3://home/*/documents/*.txt'

path = s3_path.Path(path)
files = list(path.glob())  # Not working

Currently it require some ugly hack, like:

Path(path.parts[0]).glob(os.fspath(Path(path.parts[1:])))

It would be nice if the pathlib API had a standard way to using it as glob template

MegaIng · September 30, 2024, 10:17am

I feel like the best solution is to add the ability for Path.glob to take in absolute patterns, locked behind a flag to prevent accidents.

petercordia · September 30, 2024, 10:17am

I’ve also had this situation, and ended up writing 3 different functions to process different versions of glob lookups all of which match this pattern.

Added to that, In 80% of cases, I actually want one of

sorted(Path(path.parts[0]).glob(os.fspath(Path(path.parts[1:]))))
or
list(Path(path.parts[0]).glob(os.fspath(Path(path.parts[1:]))))

barry-scott · September 30, 2024, 11:58am

pathlib does not know about S3, why should it’s glob work for S3?

Conchylicultor · September 30, 2024, 12:16pm

Pathlib defines an API which has many implementation. The official “standard” one is the one in the standard library, but many other implementations exists which add s3, gcs, zipfile, git,…
Like:

To maintain consistency across those libraries, it’s important pathlib API supports the use-cases of those implementations.

Also see: https://discuss.python.org/t/make-pathlib-extensible

chepner · September 30, 2024, 1:27pm

And the interface is fine. path = s3_path.Path(path) doesn’t (I assume) raise a problem, only the call to path.glob(). Whatever library you are using needs to be responsible for querying what objects are actually available in a particular bucket.

barneygale · September 30, 2024, 6:44pm

Could you retrieve a list of bucket keys (e.g. with ListObjectsV2) and then filter them through PurePosixPath.full_match()? (That method is new in 3.13.)

Conchylicultor · October 1, 2024, 8:34am

I’m not sure I understand how this solve the issue.

I simplified the use-case, but in practice, the implementation should not depend on any specific backend (as the same code should works for s3, gcs, local paths,…)

I’m suggesting there should be an option to allow this:

path = epath.Path('gcs://home/*/documents/*.txt')  # Works with arbitrary paths
path.glob(is_template=True)

Currently, the way of implementing this is hacky and unatural, as pointed out in my first message

pf_moore · October 1, 2024, 9:49am

I don’t think it’s “hacky and unnatural”. To execute a glob search, you need to start from some root. So Path(path.parts[0]).glob(os.fspath(Path(path.parts[1:]))) does precisely that, it separates the root and the pattern, and uses the pattern to do a glob against the root.

Certainly, helpers could be added which make the operation easier, but it’s not at all unnatural in my view. In actual fact, I find using a path object as a glob pattern in the first place to be the unnatural aspect - a path is a very different object than a glob pattern, and confusing the two feels like a bug.

Also, if we did allow Path.glob(), we’d have to come up with a meaning for Path("*://*/documents/*.txt"). And that’s a whole other problem.

barneygale · October 1, 2024, 3:28pm

Apologies, I misunderstood your message.

barneygale · October 1, 2024, 3:52pm

Does this argument apply also to Path("~/.ssh")? Arguably it’s another path-ish pattern that can be expanded into a “real” path with a specific method (Path.expanduser()).

A few more thoughts:

Supporting non-relative patterns means throwing away information from self. Folks already complain spiritedly when os.path.join('foo', '/bar') returns /bar, and we’d be doing the same thing with Path('foo').glob('/bar'). Perhaps we could allow non-relative patterns only when Path.parts is empty? I’m not so keen on the flag suggested by @MegaIng as it complicates the interface a little.

pathlib PurePath objects strip trailing slashes, which changes the meaning of patterns. I think this might be practically intractable.

Expanding self as a pattern would change long-standing behaviour; albeit rarely, as glob wildcards don’t often appear in real paths

Users would be able to supply the pattern in two places (Path('*').glob() and Path().glob('*')) rather than one. We could deprecate the method argument, but it would be another change for existing users.

jamestwebber · October 1, 2024, 3:59pm

I don’t think it applies there because ~ isn’t a wildcard, it’s a specific placeholder for the user directory. Whereas *:// reads as something like “search all(?) possible URIs” which I’m not sure is even defined.

barneygale · October 1, 2024, 4:07pm

Thanks. I don’t think my brain is working very well, as I’m still conflating different ideas in my previous post. Sorry for the noise.

pf_moore · October 1, 2024, 4:24pm

Yes, that was the point I was making.

I think the point here is that we’re talking about an application that accepts user input of a “file pattern” and wants to turn that into a list of files. Traditionally, this is done with glob.glob, which takes a string argument representing a pattern to match against the filesystem.

The pathlib.Path.glob method is different - it takes a string as a pattern in a similar way, but the pattern is required to be relative, and is matched against the contents of the path object in self.

The OP’s use case is the first situation, but they want the ability to match against non-filesystem patterns, which glob doesn’t support. The extensibility of pathlib is attractive here, but only if we can construct a suitable base pathlib subclass. The problem is that if we follow the logic of glob.glob, the pattern is general, and in theory the user could enter something that has a wildcard in the URI scheme part of the path. Clearly that makes no sense, and in particular it doesn’t allow us to know which pathlib subclass would apply. So in practice a fully general pattern string isn’t valid.

If we make the restriction that the “root” (however we choose to define that) of the pattern must be fixed (no wildcards) then it’s easy to implement the relevant search:

root, pattern = split_user_pattern(input_string)
list_of_paths = Path(root).glob(pattern)

Implementing split_user_pattern is left as an exercise for the reader. It depends heavily on the application and valid path provider types, but something like

def split_user_pattern(input_string):
    path = Path(input_string)
    return path.parts[0], os.fspath(Path(path.parts[1:]))

isn’t an implausible implementation in the absence of any application-specific constraints.

This is why I don’t see the one-line version of this as “hacky and unnatural”. The need to split and rejoin using Path.parts is annoying - a Path.without_anchor property would help a lot here - but that’s a very minor implementation detail, easily hidden in a helper function (as I did here).

jamestwebber · October 1, 2024, 4:38pm

edit: deleted because I misunderstand @barneygale earlier. My brain is also not working.