How to word a warning about security uses in `urllib.parse` docs

As part of this PR for Issue #102153 I’m pondering adding the text in diff-form appended below to our urllib.parse docs. I’m a little concerned this wording could leave people going “well thanks, now what am I supposed to do!?” where-upon they go off and write their own url parser, badly, and invite potentially worse localized security problems into their code.

Does anyone have opinions on whether or not we should state this and how best to concisely word it?

I don’t currently like my existing warning text because the #1 job of any web server is to specifically use the parsing of the url given to any request for access control and thus security. So it doesn’t feel accurate as worded. But I do want to convey that caution is needed as the results may not be what you expect. ie: code should perform further validation of each of the parsed values returned before blindly believing “they’re good”. A somewhat natural thing to do in web server implementations, but not in all other applications like one attempting to implement a blocklist filter.

The reality is that we do not have a “world class” URL parsing library in the standard library and, please read the issue, we cannot simply change our existing APIs into one because most of the odd underspecified behaviors they have need to be maintained for compatibility’s sake. Even though we’d never design a modern URL parsing library that way.

--- a/Doc/library/urllib.parse.rst
+++ b/Doc/library/urllib.parse.rst
@@ -159,6 +159,11 @@ or on combining URL components into a URL string.
       ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
                   params='', query='', fragment='')
 
+   .. warning::
+
+      We recommend that you do not use :func:`urlparse` as the primary basis
+      for access control or security checks on URLs. See
+      :ref:`URL parsing security <url-parsing-security>` for more details.
 
    .. versionchanged:: 3.2
       Added IPv6 URL parsing capabilities.
@@ -328,6 +333,12 @@ or on combining URL components into a URL string.
    control control and space characters are stripped from the URL. ``\n``,
    ``\r`` and tab ``\t`` characters are removed from the URL at any position.
 
+   .. warning::
+
+      We recommend that you do not use :func:`urlsplit` as the primary basis
+      for access control or security checks on URLs. See
+      :ref:`URL parsing security <url-parsing-security>` for more details.
+
    .. versionchanged:: 3.6
       Out-of-range port numbers now raise :exc:`ValueError`, instead of
       returning :const:`None`.
@@ -418,6 +429,29 @@ or on combining URL components into a URL string.
    or ``scheme://host/path``). If *url* is not a wrapped URL, it is returned
    without changes.
 
+.. _url-parsing-security:
+
+URL parsing security
+--------------------
+
+   The :func:`urlsplit` and :func:`urlparse` APIs do not perform **validation**
+   of input URLs.  They may fail to parse unusually crafted URLs without
+   raising an error by instead returning some pieces as ``""`` while others
+   pieces contain more than they probably should.  You may even find that
+   :func:`urlunsplit` can reassemble a working URL from those parts.
+
+   This is, sadly, not always a bug.  It is a historical behavior of the API.
+   Over the decades, some applications have come to rely on such behaviors so
+   we have been conservative when making changes.  We make no attempt to define
+   all of these corner case behaviors.  Bug fixes may alter them in the future.
+   **Please** consider surprising behaviors as undefined.
+
+   We recommend that you do not use :func:`urlsplit` or :func:`urlparse` APIs
+   as the primary basis for access control or security checks on URLs.
+   Including hostname and path validation.  It is not guaranteed to parse a URL
+   the same way as other clients, servers, or applications will; nor how
+   different standards claim it should.
+
 .. _parsing-ascii-encoded-bytes:
 
 Parsing ASCII Encoded Bytes

That warning does leave me with a strong feeling of “well, what do you want me to do instead?”

Is there a third-party library that’s sufficiently well-established that we can point people to it? I’ve heard of yarl, but don’t know the space well.

5 Likes

If we can’t change the existing API into a secure one, can we at least point people to a secure one on PyPI?

edit: Looks like Jelle posted the same thing, heh

No amount of clever wording can fix the problem, if the problem is that there actually isn’t something they “are supposed to do” that is provided in the standard library. The course of action that seems natural to me, is to create a new API, and then put the existing one through the usual deprecation process.

1 Like

Isn’t this a bit self contradicting?

We make no attempt to define all of these corner case behaviors.  Bug fixes may alter them in the future.   **Please** consider surprising behaviors as undefined.

… couldn’t we consider the historical weirdness to be bugs that warrant fixing? Theoretically we could fix them and hide the fix behind a safe param or similar that would default to the legacy behavior.

The current wording makes me think I should use a regex to parse a URL instead… Which is probably a bad answer.

This seems like the right action to me, too (unfortunately).

Can we base the wording around “deprecated due to security risks, no plans to remove, no stdlib alternative available yet”? I wouldn’t add a noisy deprecation warning, but we might encourage linters to add their own.

I feel the “yet” is important to provide notice that we want a secure alternative, to help find contributors who are passionate about it. This feels like something we ought to have in the stdlib, possibly based on pathlib now that its becoming easier to generalise it.

This sounds like the random module before the secrets module was introduced in 3.6, which had the warning

Warning: The pseudo-random generators of this module should not be used for security purposes.

That’s exactly the way forward I’ve proposed in this comment. I’m afraid that saying “this is not safe, don’t use it.” won’t help us long-term. We’ll get more and more CVE reports and we’ll have to find a better way sooner or later.

1 Like

For what it’s worth, I just deprecated all of Werkzeug’s old/modified copy of the Python 2/3 url parsing/building functions, in favor of using urllib.parse. There were no meaningful test failures, and a huge speed increase. My reasoning was partly that it would be easier to rely on the larger maintainer/contributor group in CPython to improve things if something comes up. I’m fully on board with gradual improvement towards the WHATWG URL standard and HTTPWG specs, and away from the older RFCs.

1 Like

It might look like that WHATWG is a newer and better standard than RFC and it’s definitely one of the goals of its authors to obsolete the RFC but some people disagree with the claim. It also seems that the main focus of WHATWG is browsers but we might need to support more use cases. I’m still not sure which standard is better for stdlib.

1 Like

One root of that problem is that there is no concrete definition of “secure one” when it comes to a URL parsing API.

Yep. Our existing urllib.parse APIs evolved to serve multiple sometimes conflicting needs at once. It actually does what I’d call “a pretty good job” (famous last words) at this. But the needs of a web-only client UI, code that wants to access local files and resources from other protocols, a server responding to queries, and something acting as an intermediary processing pieces of data from within other protocols and data formats, something trying to prevent human users from shooting themselves in the foot, and probably others not in my head at the moment, are all slightly different.

anyways, thanks for thoughts on this thread so far. my existing proposed warning text seems overblown. I’m going to work on something simpler and less draconian sounding.

The key point I think we should make is that users of the APIs should take a “trust but verify” approach to what it produces.

i.e. If you think this should be a scheme, check that it is one. If you think something should just be a path, check that it doesn’t contain things that a path should not, etc. Think that’s a hostname? Look closer to see if it contains non-host like things… etc.

I at least want to make it clear that our parse and split APIs don’t guarantee validity of that kind of thing.

2 Likes

IIUC, the current implementation supports more use cases than WHATWG standard. Do you think the same is true for RFC? I mean, would the implementation of a new parser following RFC mean that we’d drop something important from urllib.parse?

I tried taking a closer look at this instead of just responding to the general form of the question.

To be clear: the threat model here is that urllib.parse.urlparse could fail to extract e.g. the domain name from a URI string (thus failing to match a domain blacklist) that is nevertheless usable by urllib.request.urlopen (thus allowing the user to specify a blacklisted domain and have it be accessed)?

I have like, at least two gut reactions:

  1. Isn’t that avoided by just having the code re-create a URL string from the parsed result instead of accepting the original string? That limits functionality a little, but should be secure and seems easy to advise about. Am I missing something?

  2. Isn’t everyone using urllib3 and/or requests by now anyway?

I was wondering roughly the same. Since urllib.request already recommends Requests, perhaps it could be a good idea to mention urllib3 (parse_url and the Url class) in urllib.parse? IIUC that is used underneath urllib3 and has matching behaviour. This would provide the answer to the what should I do question people would have reading the warning.

As far as I can tell urllib3 doesn’t actually have a particularly robust API for URL inspection and construction, it’s geared towards making HTTP requests. For example, it doesn’t provide query string parsing, quote/unquote, etc. It’s Url object also uses different attribute names than SplitResult. There are a few other libraries, like whatwg-url · PyPI, that might be appropriate, but I don’t think they provide the full API that urllib.parse does. Additionally Werkzeug/Flask avoids pulling in other dependencies, for support reasons as well as not wanting to inflict popularity on other libraries.