Enrich pathlib with a PureURLPath

Julien00859 · July 9, 2021, 9:40am

Hello,

I’ve been using the new pathlib library much and I like to know whether it would be possible to use a similar API to manipulate URLs such as one returned by urllib.parse.urlparse(...).path.

After a quick glance at the current PurePath API, there are several methods that would be useful when manipulating URLs: parent, parents, name, suffix, suffixes, stem, jointpath(), match(), with_name(), with_stem(), with_suffix(). In the application I’m coding right now, I would have a direct use of parts, suffixes and match.

Uniform Resource Identifier (URI): Generic Syntax

Regards,
Julien

storchaka · July 9, 2021, 5:52pm

How would it differ from PurePosixPath?

domdfcoding · July 10, 2021, 8:45am

There are some third-party libraries that I think provide what you’re looking for:

In addition to the pathlib methods, these expose the other attributes of urlparse – the scheme, query, fragment etc.

[1] disclaimer: I am the maintainer

Julien00859 · July 10, 2021, 3:58pm

An URL has no drive, no root, no anchor. It does not state anything about successive slashes (thus should not be modified) while Posix does (and merge them all). By the RFC, URL paths are basically just segments separated by a slash without additional meanings.

I see in the source code of apeye that just just did override PurePosixPath to raise NotImplementedError for the few attributes and methods an URL does not have (drive and friends). Using your library the path “/foo//bar/” becomes “/foo/bar” while it should stay “/foo//bar/”. This might seems very specific but I have an use-case where I have double slashes in an URL path.

Our internal policy makes it hard to add new third party libraries as requirement of our application, that’s why I’m looking for a standard solution

apalala · July 11, 2021, 10:49am

A policy that requires changes to the Python standard library should be one very difficult to implement

Python committers tend to reject changes to stdlib if third-party libraries cover the functionality. Among the reasons:

more code to maintain
the new library gets bound to the Python release cycle, while it can have it’s own cycle remaining as third-party

pf_moore · July 11, 2021, 11:43am

You could vendor in whatever 3rd party library you choose (i.e., copy the library code into your application as a private module). That might help if the policy is to not have dependencies that need to be installed alongside your application. If the policy is to not use any code from the internet either by copying or as a dependency, then that policy is pretty draconian, and I doubt you have any real option apart from writing your own.

Adding stdlib functionality for this would mean you’d have to wait until Python 3.11 came out at the earliest (so that’s late 2022) and not support using your application with older versions of Python. I can’t say whether that’s a realistic option for your application, but from an outsider’s perspective, it seems unlikely.

Julien00859 · July 12, 2021, 12:58pm

I understand your point, indeed installing an existing package from PyPI is the best solution. I should have searched the index before opening this post. Thank you for your answers

uranusjr · July 12, 2021, 9:36pm

Since it has not been mentioned, there is also yarl which is maintaioned by the same people behind aiohttp etc.

apeye looks very nice though and I like how URL.path is also structured, not a plain str like other libs. I’m going to try it out in my own projects.

barneygale · July 16, 2021, 1:11am

If this goes into the standard library it would need to inherit at least PurePath, to avoid any more things which mimic the (totally informal) pathlib interface.

And that can’t work at present. At its deepest level pathlib assumes that foo//bar and foo/./bar can be simplified to foo/bar, and that logic doesn’t apply for URL path segments.

I have ideas on how to control this restriction in subclasses, but there’s a fair amount of intervening work needed, and most pathlib PRs take several months to land.

toonarmycaptain · July 16, 2021, 1:46pm

One possible advantage of this would be the potential for with open(url): to operate in the same way as with open(file): which would be nice for operating on documents/images from remote urls, just by loading them into memory.

Of course, there’s a lot of under the hood magic to happen for that to be possible, but places where paths are accepted also taking urls would be a first step.

Julien00859 · July 19, 2021, 7:51am

Hello David, the subject of re-using open() to download documents from the internet have been brought already in another discussion (or PEP?) and have been rejected. From what I remember, the main reason was that we don’t want a php-like open() function that works with whatever resource but to keep it dedicated to the file-system. What would open('https://example.com', 'w') do ?