Add URI normalization functions to the urllib.parse module

Recently I was desperately looking for an URI normalization Python library providing syntax-based normalization (case normalization, percent-encoding normalization and path segment normalization) and scheme-based normalization, as specified in RFC 3986.

I could only find this Gist by Mark Nottingham: https://gist.github.com/mnot/246089

It was great but outdated so I updated it in this fork (Python 3, RFC 3986 compliance, Unittest framework and a few corrections): https://gist.github.com/maggyero/9bc1382b74b0eaf67bb020669c01b234

I think it could be a nice addition to the Python standard library, so I contacted Mark and he is fine with that too. More precisely, we could add the normalizing functions defined in my Gist to the urllib.parse module:

  • normalizes: normalize an URI;
  • normalize: normalize URI components;
  • remove_dot_segments: remove the dot-segments in a URI path component.

What is your opinion on this?

3 Likes

For URI handling nowadays, I think everyone uses the code pulled out of Twisted: https://github.com/python-hyper/hyperlink

2 Likes

I dunno about everyone, but hyperlink’s certainly one of the better options. I may be biased though; I added the normalize() method to hyperlink, the docs of which are here: https://hyperlink.readthedocs.io/en/latest/api.html#hyperlink.URL.normalize

If there’s real demand for this in the stdlib, I’d be happy to help. It can get a little contentious at times, especially around balancing the fundamental URL behaviors versus the creative interpretations browsers make.

2 Likes