An urllib.parse.URL class

apalala · May 31, 2019, 2:50pm

As it stands with urllib.parse, even a small change to an URL requires multiple lines of code that resemble the actual required change very little because the parsing and unparsing are most prominent.

This is an approach to a design for an URL class we could improve on and adapt for inclusion into the standard library:

mjpieters · May 31, 2019, 5:10pm

Unfortunately, furl objects are mutable, so personally I prefer yarl, which also actually calls their type URL.

What kinds of improvements did you envision for furl?

apalala · May 31, 2019, 5:16pm

I didn’t mean to promote furl (nor yarl), but to take the ideas and include an URL class as part of urllib.

Note that furl (don’t know if yarl) states it supports base_url / path_component.

I should have given more arguments for the proposal for URL.

Basically, newcomers to things about the Web will almost always do “string arithmetic” (split(), join(), +, etc.) over URLs, and that is always risky, and often wrong.

A pathlib.Path-like class in urlib.URL would almost certainly make the above go away in time (and make the Python-run Web a safer place too).

apalala · May 31, 2019, 5:19pm

And I agree that URL should be inmutable, like pathlib.Path.

njs · May 31, 2019, 5:31pm

“hyperlink” is another of these libraries that comes to mind.

My impression is that urllib.parse is one of those stdlib libraries that isn’t just harder to use than the alternatives, but also does a worse job. URL parsing is really complicated and urllib.parse has a lot of quirks. To get a sense of the issue see slide 25 here: https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf

This makes it somewhat unclear how much effort we should be putting into it.

We’ll even be shipping a better URL parser as part of the stdlib soon, albeit not where users can see it – pip vendors urllib3, and urllib3 recently started vendoring the rfc3986 package.

apalala · May 31, 2019, 6:22pm

Perhaps the it can be urllib.url.URL instead of urllib.parse.URL to be free from backwards compatibility and have URL handling that works right?

Also, if the URL class mimics what’s recoverable from pathlib.Path, it can be a pathlike for URLs with the file protocol.

NOTE: changed the proposed module to urllib.url, to make it only URL manipulation.