- The urlparse and urlsplit functions generally always returns a ParseResult even when given strange inputs, rather than raising an exception:
from urllib.parse import urlparse
urlparse('')
# ParseResult(scheme='', netloc='', path='', params='', query='', fragment='')
urlparse('bare_word')
# ParseResult(scheme='', netloc='', path='bare_word', params='', query='', fragment='')
urlparse('http:almost_domain')
# ParseResult(scheme='http', netloc='', path='almost_domain', params='', query='', fragment='')
urlparse('http:/almost_domain2')
# ParseResult(scheme='http', netloc='', path='/almost_domain2', params='', query='', fragment='')
urlparse('http://domain')
# ParseResult(scheme='http', netloc='domain', path='', params='', query='', fragment='')
Such weird ParseResult outputs can still be reversed back to the original input using urlunparse.
- The documentation for urlparse also does not say what (if any) exceptions it can raise.
Based on the above two behaviors, a developer could reasonably assume that urlparse always returns a ParseResult for all inputs and in particular never raises an exception, writing application code that depends on that behavior.ā
However for certain inputs urlparse
will raise an exception, surprisingly, even when it could have returned a ParseResult:
urlparse('//[oops')
# ValueError: Invalid IPv6 URL
# Could have returned: ParseResult(scheme='', netloc='[oops', path='', params='', query='', fragment='')
urlparse('//\uFF03ć')
# ValueError: netloc 'ļ¼ć' contains invalid characters under NFKC normalization
# Could have returned: ParseResult(scheme='', netloc='\uFF03ć', path='', params='', query='', fragment='')
I propose to either:
- Alter urlparse() to always return a ParseResult for all inputs, which can be reversed using urlunparse(), or
- Alter the documentation for urlparse() to explicitly say it can raise a ValueError for certain invalid inputs.
Thoughts?
ā This is not a theoretical concern: I originally discovered this surprising behavior of urlparse() when a Python website downloader I was testing ran into a URL candidate looking like "//*[@id='"
on a real web page and tried to parse it with urlparse().