As part of this PR for Issue #102153 I’m pondering adding the text in diff-form appended below to our urllib.parse
docs. I’m a little concerned this wording could leave people going “well thanks, now what am I supposed to do!?” where-upon they go off and write their own url parser, badly, and invite potentially worse localized security problems into their code.
Does anyone have opinions on whether or not we should state this and how best to concisely word it?
I don’t currently like my existing warning text because the #1 job of any web server is to specifically use the parsing of the url given to any request for access control and thus security. So it doesn’t feel accurate as worded. But I do want to convey that caution is needed as the results may not be what you expect. ie: code should perform further validation of each of the parsed values returned before blindly believing “they’re good”. A somewhat natural thing to do in web server implementations, but not in all other applications like one attempting to implement a blocklist filter.
The reality is that we do not have a “world class” URL parsing library in the standard library and, please read the issue, we cannot simply change our existing APIs into one because most of the odd underspecified behaviors they have need to be maintained for compatibility’s sake. Even though we’d never design a modern URL parsing library that way.
--- a/Doc/library/urllib.parse.rst
+++ b/Doc/library/urllib.parse.rst
@@ -159,6 +159,11 @@ or on combining URL components into a URL string.
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
+ .. warning::
+
+ We recommend that you do not use :func:`urlparse` as the primary basis
+ for access control or security checks on URLs. See
+ :ref:`URL parsing security <url-parsing-security>` for more details.
.. versionchanged:: 3.2
Added IPv6 URL parsing capabilities.
@@ -328,6 +333,12 @@ or on combining URL components into a URL string.
control control and space characters are stripped from the URL. ``\n``,
``\r`` and tab ``\t`` characters are removed from the URL at any position.
+ .. warning::
+
+ We recommend that you do not use :func:`urlsplit` as the primary basis
+ for access control or security checks on URLs. See
+ :ref:`URL parsing security <url-parsing-security>` for more details.
+
.. versionchanged:: 3.6
Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`.
@@ -418,6 +429,29 @@ or on combining URL components into a URL string.
or ``scheme://host/path``). If *url* is not a wrapped URL, it is returned
without changes.
+.. _url-parsing-security:
+
+URL parsing security
+--------------------
+
+ The :func:`urlsplit` and :func:`urlparse` APIs do not perform **validation**
+ of input URLs. They may fail to parse unusually crafted URLs without
+ raising an error by instead returning some pieces as ``""`` while others
+ pieces contain more than they probably should. You may even find that
+ :func:`urlunsplit` can reassemble a working URL from those parts.
+
+ This is, sadly, not always a bug. It is a historical behavior of the API.
+ Over the decades, some applications have come to rely on such behaviors so
+ we have been conservative when making changes. We make no attempt to define
+ all of these corner case behaviors. Bug fixes may alter them in the future.
+ **Please** consider surprising behaviors as undefined.
+
+ We recommend that you do not use :func:`urlsplit` or :func:`urlparse` APIs
+ as the primary basis for access control or security checks on URLs.
+ Including hostname and path validation. It is not guaranteed to parse a URL
+ the same way as other clients, servers, or applications will; nor how
+ different standards claim it should.
+
.. _parsing-ascii-encoded-bytes:
Parsing ASCII Encoded Bytes