Urlparse() can sometimes raise an exception. Should it?

davidfstr · January 29, 2024, 2:19pm

The urlparse and urlsplit functions generally always returns a ParseResult even when given strange inputs, rather than raising an exception:

from urllib.parse import urlparse

urlparse('')
# ParseResult(scheme='', netloc='', path='', params='', query='', fragment='')
urlparse('bare_word')
# ParseResult(scheme='', netloc='', path='bare_word', params='', query='', fragment='')
urlparse('http:almost_domain')
# ParseResult(scheme='http', netloc='', path='almost_domain', params='', query='', fragment='')
urlparse('http:/almost_domain2')
# ParseResult(scheme='http', netloc='', path='/almost_domain2', params='', query='', fragment='')
urlparse('http://domain')
# ParseResult(scheme='http', netloc='domain', path='', params='', query='', fragment='')

Such weird ParseResult outputs can still be reversed back to the original input using urlunparse.

The documentation for urlparse also does not say what (if any) exceptions it can raise.

Based on the above two behaviors, a developer could reasonably assume that urlparse always returns a ParseResult for all inputs and in particular never raises an exception, writing application code that depends on that behavior.†

However for certain inputs urlparse will raise an exception, surprisingly, even when it could have returned a ParseResult:

urlparse('//[oops')
# ValueError: Invalid IPv6 URL
# Could have returned: ParseResult(scheme='', netloc='[oops', path='', params='', query='', fragment='')

urlparse('//\uFF03あ')
# ValueError: netloc '＃あ' contains invalid characters under NFKC normalization
# Could have returned: ParseResult(scheme='', netloc='\uFF03あ', path='', params='', query='', fragment='')

I propose to either:

Alter urlparse() to always return a ParseResult for all inputs, which can be reversed using urlunparse(), or
Alter the documentation for urlparse() to explicitly say it can raise a ValueError for certain invalid inputs.

Thoughts?

† This is not a theoretical concern: I originally discovered this surprising behavior of urlparse() when a Python website downloader I was testing ran into a URL candidate looking like "//*[@id='" on a real web page and tried to parse it with urlparse().

steve.dower · January 29, 2024, 2:40pm

Unfortunately, these functions are often targeted by security researchers, who insist that we need to raise errors for certain “exploitable” patterns. As a result, a number of hard errors have been added over time. We’ve not been so good at dealing with these, and have probably been intimidated more often than we should’ve been.

Our current consensus is that we need a new “safe” URL parser that can be guaranteed to reject invalid URLs entirely, so that those who are trying to trust untrusted data can use it confidently, and then we can label the existing functions as “insecure on purpose” and make them happily round-trip everything without error. Right now, we don’t have the new parser, and nobody volunteering to write it, so we’re in a weird not-quite-correct-not-quite-secure grey area here.

Altering the documentation to say that urlparse may raise for some invalid URLs is fine, and we should do that. But it’s not the fix that would make us happy

Unfortunately, always returning a ParseResult isn’t an option until we can “deprecate-for-security-uses” (as we’ve done with the entire http module, for example).

So I guess a relevant question for you here is: would you prefer to be using the trustworthy-but-raises-often parser, or the one that never raises?

pitrou · January 29, 2024, 3:03pm

I think the answer is that those “strange inputs” are all valid URIs as far as generic URI syntax - as specified in RFC 3986 - is concerned.

These parsing rules could perhaps be tightened by adding scheme-specific validation rules, such as outlined for HTTP in RFC 9110: HTTP Semantics.

davidfstr · January 29, 2024, 3:28pm

For my application I’d prefer to use the never-raises parser.

Reviewing my usage of urlparse(), I see the following patterns:

is_http_url = urlparse(url_path).scheme in ('http', 'https')
is_missing_scheme = urlparse(url_path).scheme == ''
is_referer_from_my_host = urlparse(referer_url).netloc == my_host
is_root_path = urlparse(url_path).path == '/'
url_without_fragment = urlunparse(urlparse(url_path)._replace(fragment=''))
url_parts = urlparse(url); new_url = «based on url_parts»

pf_moore · January 29, 2024, 3:53pm

Agreed. IMO, checking for “potentially dangerous” URLs (whatever that means) should be handled by a separate method. So parsing according to the spec should be handled by urlparse (which would only raise an exception for a string that’s malformed according to that spec) and checking that a URL is “safe” should be handled by the application after parsing. We can (and quite probably should) provide a suitable checking function, but we should be 100% clear that it’s the application’s responsibility to decide when and how to use it.

As things stand, the stdlib doesn’t include a function that can answer the question “is this a valid URL?” That seems wrong to me.

PS I understand that backward compatibility means that we may never be able to achieve the above ideal. But that doesn’t mean I agree with the current behaviour.

steve.dower · January 29, 2024, 5:41pm

This is true. The majority of “concerns” that are raised is where our behaviour differs from “popular browsers”, not where we differ from the spec.

ben · January 29, 2024, 6:19pm

Is this documented somewhere, for people who try to contribute to urlparse()? What (if any) changes are being accepted for the current urlparse() method (I have an open PR for urlsplit/urlparse)?

gpshead · January 29, 2024, 6:37pm

For (2), we do document that urlparse raises a ValueError in some situations. I doubt we exhaustively list all situations and do not think we should try to do so. Perhaps all you’re asking is that this mention of an exception be pulled further up in the doc? I’m would not personally assume a Python API does not raise something like ValueError when given an unreasonable value even if not explicitly documented. Something not willing to raise an exception at all should document itself as such and have related regression tests if that is an important API trait.

We also state that urlunparse((urlparse(value)) is not guaranteed to return the original value. It is unreasonable to assume that it does. We specifically document that it may not return something identical to value if attempted.

We cannot do (1) and should not try. Loosening the API will hurt many existing applications depending on what little non-guaranteed validation it already does perform. ValueError makes sense for things that do not appear to be valid URLs. Not doing so and should always return a ParseResult even on non-sense input or crafted malicious input are fundamentally at odds with what I’d call secure API design best practices. Always returning would basically be telling each and every user that “you’re on your own, you all need to anticipate everything malicious and every possibly way it might pass through our internal parsing implementation intentional or not and reinvent your own validation logic and repeat all of security bugs in your application”. While we already advise people to check the results, going further down that path and doing less for them is not good for the world.

We’re realistically stuck with these URL related APIs. Not breaking existing users is our top priority unless the existing use is an outright security failure in the common widely used actual use cases from people’s existing application and library code. So we tighten things a bit when feasible given the non-designed legacy 1990s implementation or url parsing and public APIs that get both widely used and abused that this code sits upon. It isn’t even always feasible.

Some previous relevant discussions to be aware of and a huge pile of existing bugs.

If someone wants a URL parsing library with majorly different behavior or design, they’re best off doing that on PyPI.

gpshead · January 29, 2024, 6:43pm

The code is a nightmareish mess of legacy behaviors so no core devs really want to claim ownership. When making changes, we’ve often been burned by existing code depending in surprising or unreasonable manners on existing (mis)behavior so there isn’t a lot of reward for working on it.

Thanks for the ping, I’ll get to your PR (which I see appears to add more relevant ValueErrors) eventually. At a glance it makes sense to me but i’ll need to dive into the logic further to understand for sure.

ben · January 29, 2024, 6:46pm

Thanks so much!

davidfstr · January 30, 2024, 2:00am

Ah ha. Somehow I missed the mention of ValueError on the first reading. So disregard my (2) suggestion.

We cannot do (1) [in the standard library] and should not try.

OK.

we’ve often been burned by existing code depending in surprising or unreasonable manners on existing (mis)behavior so there isn’t a lot of reward for working on it.

Sounds like Hyrum’s Law applies here

For my own application, I’ve applied a monkeypatch to urlparse that does (1) using approximately the following logic:

def patch_urlparse_to_never_raise_valueerror():
    super_urlparse = urllib.parse.urlparse  # capture
    def urlparse(url: str, *args, **kwargs) -> ParseResult:
        try:
            return super_urlparse(url, *args, **kwargs)
        except ValueError:
            return ParseResult(scheme='', netloc='', path=url, params='', query='', fragment='')
    urllib.parse.urlparse = urlparse  # monkeypatch