Urlparse() can sometimes raise an exception. Should it?

  1. The urlparse and urlsplit functions generally always returns a ParseResult even when given strange inputs, rather than raising an exception:
from urllib.parse import urlparse

urlparse('')
# ParseResult(scheme='', netloc='', path='', params='', query='', fragment='')
urlparse('bare_word')
# ParseResult(scheme='', netloc='', path='bare_word', params='', query='', fragment='')
urlparse('http:almost_domain')
# ParseResult(scheme='http', netloc='', path='almost_domain', params='', query='', fragment='')
urlparse('http:/almost_domain2')
# ParseResult(scheme='http', netloc='', path='/almost_domain2', params='', query='', fragment='')
urlparse('http://domain')
# ParseResult(scheme='http', netloc='domain', path='', params='', query='', fragment='')

Such weird ParseResult outputs can still be reversed back to the original input using urlunparse.

  1. The documentation for urlparse also does not say what (if any) exceptions it can raise.

Based on the above two behaviors, a developer could reasonably assume that urlparse always returns a ParseResult for all inputs and in particular never raises an exception, writing application code that depends on that behavior.ā€ 

However for certain inputs urlparse will raise an exception, surprisingly, even when it could have returned a ParseResult:

urlparse('//[oops')
# ValueError: Invalid IPv6 URL
# Could have returned: ParseResult(scheme='', netloc='[oops', path='', params='', query='', fragment='')

urlparse('//\uFF03恂')
# ValueError: netloc 'ļ¼ƒć‚' contains invalid characters under NFKC normalization
# Could have returned: ParseResult(scheme='', netloc='\uFF03恂', path='', params='', query='', fragment='')

I propose to either:

  1. Alter urlparse() to always return a ParseResult for all inputs, which can be reversed using urlunparse(), or
  2. Alter the documentation for urlparse() to explicitly say it can raise a ValueError for certain invalid inputs.

Thoughts?

ā€  This is not a theoretical concern: I originally discovered this surprising behavior of urlparse() when a Python website downloader I was testing ran into a URL candidate looking like "//*[@id='" on a real web page and tried to parse it with urlparse().

1 Like

Unfortunately, these functions are often targeted by security researchers, who insist that we need to raise errors for certain ā€œexploitableā€ patterns. As a result, a number of hard errors have been added over time. Weā€™ve not been so good at dealing with these, and have probably been intimidated more often than we shouldā€™ve been.

Our current consensus is that we need a new ā€œsafeā€ URL parser that can be guaranteed to reject invalid URLs entirely, so that those who are trying to trust untrusted data can use it confidently, and then we can label the existing functions as ā€œinsecure on purposeā€ and make them happily round-trip everything without error. Right now, we donā€™t have the new parser, and nobody volunteering to write it, so weā€™re in a weird not-quite-correct-not-quite-secure grey area here.

Altering the documentation to say that urlparse may raise for some invalid URLs is fine, and we should do that. But itā€™s not the fix that would make us happy :wink:

Unfortunately, always returning a ParseResult isnā€™t an option until we can ā€œdeprecate-for-security-usesā€ (as weā€™ve done with the entire http module, for example).

So I guess a relevant question for you here is: would you prefer to be using the trustworthy-but-raises-often parser, or the one that never raises?

3 Likes

I think the answer is that those ā€œstrange inputsā€ are all valid URIs as far as generic URI syntax - as specified in RFC 3986 - is concerned.

These parsing rules could perhaps be tightened by adding scheme-specific validation rules, such as outlined for HTTP in RFC 9110: HTTP Semantics.

2 Likes

For my application Iā€™d prefer to use the never-raises parser.

Reviewing my usage of urlparse(), I see the following patterns:

  • is_http_url = urlparse(url_path).scheme in ('http', 'https')
  • is_missing_scheme = urlparse(url_path).scheme == ''
  • is_referer_from_my_host = urlparse(referer_url).netloc == my_host
  • is_root_path = urlparse(url_path).path == '/'
  • url_without_fragment = urlunparse(urlparse(url_path)._replace(fragment=''))
  • url_parts = urlparse(url); new_url = Ā«based on url_partsĀ»
1 Like

Agreed. IMO, checking for ā€œpotentially dangerousā€ URLs (whatever that means) should be handled by a separate method. So parsing according to the spec should be handled by urlparse (which would only raise an exception for a string thatā€™s malformed according to that spec) and checking that a URL is ā€œsafeā€ should be handled by the application after parsing. We can (and quite probably should) provide a suitable checking function, but we should be 100% clear that itā€™s the applicationā€™s responsibility to decide when and how to use it.

As things stand, the stdlib doesnā€™t include a function that can answer the question ā€œis this a valid URL?ā€ That seems wrong to me.

PS I understand that backward compatibility means that we may never be able to achieve the above ideal. But that doesnā€™t mean I agree with the current behaviour.

4 Likes

This is true. The majority of ā€œconcernsā€ that are raised is where our behaviour differs from ā€œpopular browsersā€, not where we differ from the spec.

2 Likes

Is this documented somewhere, for people who try to contribute to urlparse()? What (if any) changes are being accepted for the current urlparse() method (I have an open PR for urlsplit/urlparse)?

For (2), we do document that urlparse raises a ValueError in some situations. I doubt we exhaustively list all situations and do not think we should try to do so. Perhaps all youā€™re asking is that this mention of an exception be pulled further up in the doc? Iā€™m would not personally assume a Python API does not raise something like ValueError when given an unreasonable value even if not explicitly documented. Something not willing to raise an exception at all should document itself as such and have related regression tests if that is an important API trait.

We also state that urlunparse((urlparse(value)) is not guaranteed to return the original value. It is unreasonable to assume that it does. We specifically document that it may not return something identical to value if attempted.

We cannot do (1) and should not try. Loosening the API will hurt many existing applications depending on what little non-guaranteed validation it already does perform. ValueError makes sense for things that do not appear to be valid URLs. Not doing so and should always return a ParseResult even on non-sense input or crafted malicious input are fundamentally at odds with what Iā€™d call secure API design best practices. Always returning would basically be telling each and every user that ā€œyouā€™re on your own, you all need to anticipate everything malicious and every possibly way it might pass through our internal parsing implementation intentional or not and reinvent your own validation logic and repeat all of security bugs in your applicationā€. While we already advise people to check the results, going further down that path and doing less for them is not good for the world.

Weā€™re realistically stuck with these URL related APIs. Not breaking existing users is our top priority unless the existing use is an outright security failure in the common widely used actual use cases from peopleā€™s existing application and library code. So we tighten things a bit when feasible given the non-designed legacy 1990s implementation or url parsing and public APIs that get both widely used and abused that this code sits upon. It isnā€™t even always feasible.

Some previous relevant discussions to be aware of and a huge pile of existing bugs.

If someone wants a URL parsing library with majorly different behavior or design, theyā€™re best off doing that on PyPI.

7 Likes

The code is a nightmareish mess of legacy behaviors so no core devs really want to claim ownership. When making changes, weā€™ve often been burned by existing code depending in surprising or unreasonable manners on existing (mis)behavior so there isnā€™t a lot of reward for working on it.

Thanks for the ping, Iā€™ll get to your PR (which I see appears to add more relevant ValueErrors) eventually. At a glance it makes sense to me but iā€™ll need to dive into the logic further to understand for sure.

Thanks so much!

Ah ha. Somehow I missed the mention of ValueError on the first reading. So disregard my (2) suggestion.

We cannot do (1) [in the standard library] and should not try.

OK.

weā€™ve often been burned by existing code depending in surprising or unreasonable manners on existing (mis)behavior so there isnā€™t a lot of reward for working on it.

Sounds like Hyrumā€™s Law applies here :slight_smile:


For my own application, Iā€™ve applied a monkeypatch to urlparse that does (1) using approximately the following logic:

def patch_urlparse_to_never_raise_valueerror():
    super_urlparse = urllib.parse.urlparse  # capture
    def urlparse(url: str, *args, **kwargs) -> ParseResult:
        try:
            return super_urlparse(url, *args, **kwargs)
        except ValueError:
            return ParseResult(scheme='', netloc='', path=url, params='', query='', fragment='')
    urllib.parse.urlparse = urlparse  # monkeypatch
1 Like