Refactor urllib.parse

I love urllib.parse, but it’s unclear to me why it’s written the way it is.

That said, over the last year, I have found bugs in nearly all of the popular URL parsing libraries in PyPI (rfc3986, urllib3, yarl, furl, hyperlink). This is because all of these libraries have significant hand-written parsing logic, which is just really hard to do correctly.

I have begun work on a reimplementation of urllib.parse that aims to be more maintainable: GitHub - kenballus/nurllib: Rewrite of urllib.parse from the python stdlib

Are people open to merging this code into CPython? I’m very much open to code reviews, feature suggestions, and whatever else people have to say.

7 Likes

Pardon; a few clarifying questions.

Do none of these actually sit on top of urllib.parse? Are they trying to add more functionality, or just replace the “ball of crud” that you’ve observed? (I take it you have not identified bugs in urllib.parse itself?)

Lastly, I assume you have filed bug reports, yes?

Thank you for the questions.

Do none of these actually sit on top of urllib.parse? Are they trying to add more functionality, or just replace the “ball of crud” that you’ve observed? (I take it you have not identified bugs in urllib.parse itself?)

Of them all, only furl sits on top of urllib.parse. The rest are trying to replace the ball of crud.

(I take it you have not identified bugs in urllib.parse itself?)

I should have been clearer; I have also found and patched bugs in urllib.parse. If you search the issues for the GH usernames “kenballus” and “JohnJamesUtley” you’ll see three fixed bugs that we found by fuzzing urllib.parse (we work at the same university).

There are other bugs having to do with characters accepted in places where they shouldn’t be, but I haven’t filed them because I think a rewrite is a better solution than gluing more crud to the ball.

Lastly, I assume you have filed bug reports, yes?

In the last year, I have submitted many bug reports and patches. Today, I think that many of these libraries would be better replaced by a modernized urllib.parse, so I still have some bug reports remaining to file.

For those of you who have not yet read this, it gives some good perspective on the general attitude toward urllib.parse and security issues within it.

1 Like

I would be against making changes to the behavior of urllib.parse at this point. Various things have been worked-around at this point in other pieces of code, though relying on urllib.parse for a base level.

I’d say come up with a new equivalent stdlib function maybe urllib.parse2. (Sort of like dup vs dup2.) We can mark urllib.parse as frozen in time with its current flaws (but not exactly deprecated).

1 Like

I’d say come up with a new equivalent stdlib function maybe urllib.parse2. (Sort of like dup vs dup2.) We can mark urllib.parse as frozen in time with its current flaws (but not exactly deprecated).

I’m open to this option. I really just want the stdlib to provide a decent URL parser with predictable behavior that conforms to a standard.

1 Like

I’m not sure you’ll get a lot of support for that. The consensus seems to be that the way forward for supporting modern internet standards lies in third party libraries, and for the stdlib we need stability over anything else.

1 Like

I’m not sure you’ll get a lot of support for that. The consensus seems to be that the way forward for supporting modern internet standards lies in third party libraries, and for the stdlib we need stability over anything else.

I agree that stability is one of the most important properties that we need in the stdlib. The problem is that over time, urllib.parse keeps getting tweaked and patched, and is therefore not very stable. This is both because it has accumulated a lot of baggage, and because it doesn’t comply with any one URL standard.

The new parser implements RFC 3986, which is 18 years old and widely adopted. Note that I am not advising that we implement the WHATWG URL standard, precisely because of stability concerns.

Since this new parser is little more than a direct translation of the standard’s ABNF into Python regular expressions, it should require close to zero maintenance. Moreover, once the new parser is released, we no longer need to tweak and patch the original urllib.parse; we can direct people to the new parser instead. The original parser can be left in place for those who need it.

3 Likes

Something to consider: the ada_url package has recently been published to PyPI.

I started the project because I was looking for an alternative to urllib.parse that was (a) faster, and (b) compliant with one standard rather than a mix of various ones.

It uses CFFI to expose the functionality of ada, one of the reference implementations for the WHATWG URL Spec. It supports IDNA-encoding/decoding as well, something that has been languishing in the Python standard library.

Given that URL handling evolves over time (IETF RFCs don’t come out very often, but browsers change behavior not-infrequently, and WHATWG’s spec is explicitly a "living standard.), I think it makes more sense to have PyPI packages with up-to-date functionality instead of a rewritten standard library package.

2 Likes

Something to consider: the ada_url package has recently been published to PyPI.

The batteries-included approach is part of what I like so much about Python. Ada is a great library, but most uses of urllib.parse that I have encountered do not require such a heavyweight dependency, when a simple, RFC-compliant, regex-based parser would suffice.

Given that URL handling evolves over time (IETF RFCs don’t come out very often, but browsers change behavior not-infrequently, and WHATWG’s spec is explicitly a "living standard.), I think it makes more sense to have PyPI packages with up-to-date functionality instead of a rewritten standard library package.

URL handling in browsers changes over time, but URL handling in other applications changes a lot more slowly. Consequently, the WHATWG standard is orders of magnitude more complex than RFC 3986. As much as the WHATWG doesn’t want to admit it, their standard is not the standard of record from many people’s (and business’s) perspectives. I think a proper implementation of the RFC belongs in the stdlib because the RFC is stable over time and describes what people expect from a URL parser.

The current state of urllib.parse is somewhat sorry. It inconsistently distinguishes between undefined and empty URL components, even though this is required by both standards. It features two separate, incompatible APIs (urlsplit and urljoin). It allows significant bending of the rules of which characters are permissible where (for example, colons are incorrectly permitted in the first segment of a path-noscheme). It doesn’t validate port numbers until they are accessed (for example, the following URL will make it through urlparse without error: "http://user@example.com:\x00/path?query#fragment"). Each of these is a real problem, and some might argue that there are security concerns to be raised. Patching over these issues is not a good solution because the code is so confusingly laid out, so it’s almost guaranteed to be hiding further bugs. A rewrite really doesn’t take all that much effort; I think I could get my reimplementation to pass the urllib.parse test suite in about a day of work. It makes little sense to me to leave a broken parser in place, waiting to be misused, when a reimplementation could replace it while reducing both the number of bugs and total lines of code in the stdlib.

8 Likes

Renaming is slow and expensive, but better than nothing. Linters like bandit could start recommending the new method as soon as they decide that enough python usage is on newer versions.

But is that even necessary? If the goal is to have the same behavior on good inputs and fix bad behavior on bad inputs, the stability requirement would seem to be satisfied if the rewrite ships in 3.13?

1 Like

I think the rewrite is starting to look pretty good. I would appreciate any feedback anyone here might have.

Note that you have not had a response from any core dev (I don’t count since I only am trying to give feedback on process). That does not bode well for inclusion of your code into the stdlib.

1 Like

Note that you have not had a response from any core dev (I don’t count since I only am trying to give feedback on process). That does not bode well for inclusion of your code into the stdlib.

Acknowledged.

I’d be interested to hear if there’s any more detail that can be shared about core devs’ responses to this suggestion. @gpshead’s and @Jelle’s opinions would be of particular interest to me, because I’ve interacted with them in the past about bugs in urllib.parse.

If there are no further responses to this thread, I’ll consider it dead and leave it alone.

1 Like

I’d definitely prefer to see a library succeed on its own merits externally on PyPI with lots of community buy-in before seriously thinking about bringing it into the standard library.

We often quip semi-jokingly that the stdlib is “where good libraries go to die” because the reality is, once it is in the stdlib, its compatibility above all else maintenance story shifts towards compatibility first (as Guido noted). It cannot easily be updated and all bugs become behaviors someone relies on. So anything undergoing active development or not well tested with lurking bugs is suspicious from a stdlib perspective.

PyPI provides a great way for a library to prove itself. And offers a much easier bug or security issue fix story independent of “upgrade your entire Python runtime”.

The most important battery included with CPython these days is pip connecting it to the PyPI ecosystem.

9 Likes

I was curious—do you have any plans to publish this on PyPI?

Yes. This is now published on PyPI as nurllib.

2 Likes

This is a really good point, but in my opinion, this shouldn’t mean the batteries python already claims to include should be broken.

The stdlib does provide a recommended way to parse URLs, it’s just that the parsing doesn’t adhere to any real standard (From what I understand from @kenballus’s comments).

If the stdlib can’t provide a consistent, predictable way to parse URLs, and expects users to use a PyPI package, then why not deprecate urllib.parse or at least not recommend it’s usage?

If the stdlib does want to include these batteries, why not add include a version (e.g. urllib.parse2) with a behavior which is consistent and correct for a URL standard which is used outside of python?

To me, both are valid options - but the current state seems indecisive.

3 Likes