Deprecating `urllib.parse.urlparse`

masklinn · September 30, 2023, 11:43am

Not entirely clear whether this belongs to here or Ideas, but deprecation threads seem to have mostly been created here so…

urlparse implements URL semantics deprecated since RFC 2396 (1998), and the expansion of the single parameters sequence (between path and query string) to a per-segment parameters sequence. A feature which then got removed entirely from the base URI spec by RFC 3986 (2005), which according to its documentation is what urllib.parse is supposed to follow (if not the WhatWG url standard which it also mentions in some locations).

RFC 2396 semantics is what led to the addition of urlsplit (BPO-478038 / #35466) back in 2001.

urlparse’s naming and prominence in the documentation (it is at the very top of the “URL Parsing” section, urlsplit is halfway down the page below the parse_qs functions) makes its use very likely even though no system created in the last two decades should have any use for its semantics, it is thus a trap for the unwary. Anecdotally I’ve had several colleagues express surprise at my remarks that they should be using urlsplit almost always and that urlparse almost certainly has semantics they’re not looking for.

There should be no use of urlparse which can not be replaced by urlsplit, as such I think urlparse should be deprecated, with an eye to its removal (possibly alongside all the undocumented utility functions deprecated since Python 3.8?).

malemburg · September 30, 2023, 12:37pm

Those two functions (and their resp. unparse/unsplit functions) implement two different ways of parsing URLs. Depending on the application, you may want one or the other, so it’s not clear why urlparse() should be deprecated in favor of urlsplit().

This paragraph in the docs makes this rather clear:

What constitutes a URL is not universally well defined. Different applications have different needs and desired constraints. For instance the living WHATWG spec describes what user facing web clients such as a web browser require. While RFC 3986 is more general. These functions incorporate some aspects of both, but cannot be claimed compliant with either. The APIs and existing user code with expectations on specific behaviors predate both standards leading us to be very cautious about making API behavior changes.

masklinn · September 30, 2023, 12:56pm

Those two functions (and their resp. unparse/unsplit functions) implement two different ways of parsing URLs. Depending on the application, you may want one or the other, so it’s not clear why urlparse() should be deprecated in favor of urlsplit() .

Because the behaviour urlparse implements has been functionally useless for more than two decades and it is extremely unlikely that a new codebase has any use for its semantics: in my experience essentially nobody knows about (let alone wants) RFC 1808 params, so the only thing that does is add confusion to the module, and unnecessary risks of inconsistent parsing.

Not to mention in the extremely unlikely case where it would be desired, it is easy to implement on top of urlsplit (just as RFC 2396 per-segment params are).

And yet, as I also wrote up, it is given a position of primacy in the documentation, by its upfront position (especially combined with its name), making its misuse common when a user almost always really wants urlsplit.

notatallshaw · September 30, 2023, 3:33pm

Are you saying that the params behvior is useless or urlparse as a whole?

Becuase doing a quick code search I find lots of modern libraries using urlparse, which would apprear to fit the definition of functionally useful to me, however I don’t know how common params is used.

I did find that distlib, which is vendored by pip and therefore included in most installations of Python, checks if params exists: https://github.com/pypa/distlib/blob/0.3.7/distlib/index.py#L46, which again would apprear to fit the definition of functionaly useful.

I’m not disagreeing with your general thesis, I’m not well versed enough on url RFCs to make comment on it, but maybe avoid the hyperbole.

masklinn · September 30, 2023, 4:06pm

Yes.

The params behaviour is the sole reason for urlparse’s existence.

That’s pretty much the point, lots of people use urlparse, and almost certainly not because they want to leverage RFC 1808 semantics.

I’m not convinced, all it does is immediately assert that there is no params, or query string, or fragment, aka that a repository URL only has a scheme, netloc, and path. Does it actually care that the path contains a ; or is this assertion there because it’s a member not named path?

But there is no hyperbole? The only reason urlparse exist is to support a feature which was deprecated 25 years ago, and which I can only say I have never seen anyone care about or actually want to use it, only unwittingly be affected by it.

Nor am I aware of any modern URL-parsing library which implements RFC 1808 semantics (or even special support for RFC 2396 params for that matter).

And possibly even more relevant: I’m not aware of any modern web framework which uses or provides support for this. Neither does WSGI itself.

jack1142 · September 30, 2023, 4:23pm

I think we should start with improving the documentation of urlparse() and whatever else to make it a lot clearer which function one actually wants - I’m pretty sure I’ve always used urlparse(), not even being aware that urlsplit() does the same thing but using a different (according to you, more relevant) spec.

I suspect this function is used a lot simply because people assume that the first in the docs and most logically named function for parsing the URL is the right one. They (same as me) probably don’t know that its behavior may be unsuitable for general use as people generally assume that the stdlib probably knows better how to do this than them.

malemburg · September 30, 2023, 4:50pm

This part is not quite correct. The RFCs have over the years expanded on the use of per path segment parameters. With RFC 1808, only the last path segment was allowed to have parameters (and these were applied to the object referenced by that last path segment). Starting with RFC 2396, each segment may have such parameters. RFC 3986 allows this as well and goes a step further by pushing the interpretation of the per segment parameters down to the used scheme definition.

As such, parameters are not deprecated and never have been.

All that said, per path segment parameters are really rare in the wild. I’ve only ever seen ones which were used on the last path segment, which urlparse() deals with just fine. urlsplit(), OTOH, would not parse out these parameters at all, so code using it will have to deal with those parameters separately.

Does this make urlsplit() better than urlparse() ? I don’t think so. Both have their use cases. And in fact, urlparse() will often be the better choice, since it allows ignoring last-segment path parameters easily.

steve.dower · October 2, 2023, 8:38am

We definitely want a better parse function, though. urlparse has attracted way too many security “bugs” recently because of how inconsistent its results are with other people’s parsers. (Guess I should add that it’s a security issue “because someone might parse and compare against a list to make security decisions for a user”.^[1])

If a new, robust, well defined and behaved parser was contributed, the security team at least would take it. It’s mostly the behaviour on invalid URLs that is problematic, but it’s too late for us to start raising exceptions on every edge case from the existing functions.

Yes, this is a pretty weak justification for a security bug, but what can we do? Chrome won the internet, so if you don’t parse the same as they do, you’re “wrong” ↩︎

malemburg · October 2, 2023, 9:00am

Could you elaborate on this a bit more ? It sounds like you’re saying that some applications may be using the parse functions to determine whether or not they need to apply restrictions, but could be wrong.

Given that there are different “flavors” of URL parsing out there, it may make sense to add a flavor parameter to the function, so that users can choose the interpretation they would like to see (similar to what we have for the CSV parser).

steve.dower · October 2, 2023, 9:58am

Two convenient examples:

https://nvd.nist.gov/vuln/detail/CVE-2023-24329

https://nvd.nist.gov/vuln/detail/CVE-2022-0391

malemburg · October 2, 2023, 10:44am

Hmm, I wonder why these are considered “security” issues. Input data sanitization such as correctly extracting strings representing URLs should really happen before calling urllib.parse functions.

IMO, adding an extra layer of protection doesn’t hurt, but it’s outside the main scope of these functions. In fact, it would probably be better to safer to raise an exception if illegal chars are present in the strings, than to silently remove them… after all, either the original parser for extraction of the URLs is doing something wrong, or there is an actual attack going on.

masklinn · October 2, 2023, 6:20pm

Yes, that is pretty much the second half of the post. urlparse is hugely prominent in the module, and has a much more intuitive name than urlsplit (for historical reasons, namely that it was here first). Thus it is perfectly natural for users to go through it even though it should not be the first function to reach for when trying to parse urls (and I argue not a function to reach for at all). I’m certainly not blaming users here.

Except urlparse is not compatible with RFC 2396 (at least not without more additional work than would be required using urlsplit), and it’s similarly not compatible with the “whatever” approach of RFC 3986 either, as in both cases using it leads to a non-uniform non-application-decided treatment of the path’s subcomponents. And it specifically breaks any URL which happens to contain a ; in its last path parameter, even if that shall not have any semantics implication.

Something which is hardly difficult if you know you need it, and furthermore coherent with most other URL parsers.

I find this view shocking. “ignoring the last-segment path parameters” is not difficult in the first place if you need this behaviour, and when you don’t it’s a footgun.

And if it is your belief that built-in handling of params is critical, then surely it would be more useful for _splitparams to be public and documented, and splittattr and splitvalue un-deprecated and documented? Or a variant of parse_qs[l] for attrs could be added. That would make the behaviour both more explicit and more accessible, as all urlparse does is give you a params string and leave you to deal with it, a task for which the module provides little assistance.

TBF parsing differentials are a common way to smuggle things through the swiss cheese, by different parsers having different interpretations of a datum you can have a piece of security software (e.g. WAF-type) leaves through data it considers safe, but which an other parser with different interpretation of the datum interprets unsafely. It was one of the major reasons for the standardisation of error handling and recovery in HTML5 for instance.

But that’s an other issue entirely.