URLLib Join Behavior

OfficiallySomeGuy · March 29, 2025, 4:57pm

This proposes that urllib.parse.urljoin('https://example.com/thing', 'v1') should resolve to https://example.com/thing/v1 rather than the current behaviour which resolves to https://example.com/v1 without warning.

I believe the relevant code is here.

This would also resolve some of the concerns in issue #96015.

notatallshaw · March 29, 2025, 5:13pm

This would be a backwards incompatible change breaking millions of Python scripts, libraries, and apps.

The value in doing this would need be clearly greater than the cost of such a massive disruption.

nedbat · March 29, 2025, 5:17pm

The behavior is very unlikely to change, since it would break existing code and is designed to match browser semantics. Would you like to propose an improvement to the docs so that future users won’t be surprised by how it works?

picnixz · March 29, 2025, 5:20pm

In addition to break compatibility, I also think it would diverge from the RFC it’s based upon: RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax.

Note that https://example.com/thing is not an absolute URL so that’s why it’s like this. However urllib.parse.urljoin('https://example.com/thing/', 'v1') would do the job.

storchaka · March 29, 2025, 7:29pm

If on page https://example.com/a/b there is a reference <a href='c/d'>, it is resolved to https://example.com/a/c/d, not to https://example.com/a/b/c/d. But if it is on page https://example.com/a/b/, you will get https://example.com/a/b/c/d.

bwoodsend · March 29, 2025, 8:00pm

I’ve been stung by this too. Putting together parts of URLs for REST APIs without either missing or duplicate / separators ^[1] feels similar enough to the problem that os.path.join() solves ^[2] that I go looking for some kind of urljoin() function most likely in the urllib.parse namespace. When I find exactly the name I was looking for in the first place I expected to find it, I go ahh perfect until I try it and realise that it does something completely different.

I’ve since then discovered APIs that mandate all URLs have a trailing / and others that mandate the opposite which has shown me that there can never really be such a thing as a universal os.path.join()-like function for putting parts of a URL path together. But I do wish urljoin had been named something else.

some APIs do take issue over a double / ↩︎
minus the / vs \ business ↩︎