Add URI normalization functions to the urllib.parse module

maggyero · March 31, 2020, 1:50pm

Recently I was desperately looking for an URI normalization Python library providing syntax-based normalization (case normalization, percent-encoding normalization and path segment normalization) and scheme-based normalization, as specified in RFC 3986.

I could only find this Gist by Mark Nottingham: https://gist.github.com/mnot/246089

It was great but outdated so I updated it in this fork (Python 3, RFC 3986 compliance, Unittest framework and a few corrections): https://gist.github.com/maggyero/9bc1382b74b0eaf67bb020669c01b234

I think it could be a nice addition to the Python standard library, so I contacted Mark and he is fine with that too. More precisely, we could add the normalizing functions defined in my Gist to the urllib.parse module:

normalizes: normalize an URI;
normalize: normalize URI components;
remove_dot_segments: remove the dot-segments in a URI path component.

What is your opinion on this?

ofek · April 5, 2020, 8:08pm

For URI handling nowadays, I think everyone uses the code pulled out of Twisted: https://github.com/python-hyper/hyperlink

mahmoud · April 6, 2020, 5:04pm

I dunno about everyone, but hyperlink’s certainly one of the better options. I may be biased though; I added the normalize() method to hyperlink, the docs of which are here: https://hyperlink.readthedocs.io/en/latest/api.html#hyperlink.URL.normalize

If there’s real demand for this in the stdlib, I’d be happy to help. It can get a little contentious at times, especially around balancing the fundamental URL behaviors versus the creative interpretations browsers make.

ofek · November 5, 2024, 4:28pm

Update, people are moving to this now: GitHub - ada-url/ada-python: Python bindings for Ada URL parser