File:// URIs in Python

I’d like to tour the standard library’s existing support for file URIs and then make a proposal.

urllib

The urllib.request module has longstanding support for parsing and generating file URIs with pathname2url() and url2pathname(). The implementation depends on the current platform. To always use Windows semantics, one can import the same functions from the undocumented nturl2path module. On POSIX, one can call urllib.parse.quote() and unquote(). Bugs and discussion:

  • #85168 - on POSIX, uses UTF-8 rather than local filesystem encoding
  • #90812 - on Windows, incorrectly produces/expects file URIs beginning file://// (four slashes), which is incompatible with pathlib’s implementation.
  • These functions expect you to add and remove the file:// prefix yourself. The Windows bug mentioned above misleads some folks into thinking they need to add/remove file: (no slashes).
  • The Windows variant isn’t documented.
  • The operations have much more to do with OS paths than URLs, so urllib is arguably the wrong place for them.

pathlib

The pathlib.PurePath class provides an as_uri() method. Again, the implementation depends on the current platform. The Windows and POSIX variants can be found in PureWindowsPath and PurePosixPath. Bugs and discussion:

  • #91504 - there’s no way to convert a URI to a path
  • pathlib is 90% a high-level wrapper around os, ntpath and posixpath. The as_uri() method is one of a small handful of exceptions where pathlib implements low-level path manipulation logic itself. IMO pathlib is arguably the wrong place for its implementation.

os.path (proposal!)

I propose we add two new functions to os.path that parse and generate file:// URIs. I haven’t found good names for them yet, so here are their working names:

  • os.path.fileuri() - returns a file URI from the given path.
  • os.path.fileuriparse() - returns a path from the given file URI.

Their implementations would live in ntpath and posixpath, like most other os.path functionality.

We can then adjust the previously mentioned modules:

  • pathlib.PurePath.as_uri() - remove implementation, call through to fileuri()
  • pathlib.PurePath.from_uri() - add this new classmethod, call through to fileuriparse()
  • urllib.request - replace usages of url2pathname() with fileuriparse()
  • urllib.request - deprecate pathname2url() and url2pathname()
  • nturl2path - deprecate pathname2url() and url2pathname() (and the entire module?).

I believe this would have the following benefits:

  • Improve the experience for users who want to parse and generate file:// URIs, who usually end up on this SO post with 40k views or one of several others.
  • Reduce the scope for bugs and incompatibilities in urllib and pathlib by unifying their underlying file URI implementations
  • Slightly simplify the urllib codebase, including letting us deprecate the nturl2path module.
  • Slightly simplify the pathlib codebase by more consistently delegating low-level tasks to posixpath and ntpath.

Thanks for reading. What do you think?

3 Likes

Aren’t file URLs a bad idea from a security POV?

They’re still well-supported in webbrowsers and a few other applications (GNOME and Windows shell use them IIRC). The security considerations are the same as for other URLs I think - if you’re taking untrusted user input, it’s better to include allowed protocols like http:// rather than exclude disallowed protocols like file://, ftp://, etc.

That’s fair – I had assumed file: URLs were going out of style, but it seems that was premature.

But then, since file:... is a URL(*), why is it wrong to have the fundamental support be in urllib? I’d be amenable to your proposal if you chose to stick it there.

Regarding the naming, I recommend something symmetric, e.g. url_to_path() and path_to_url().


(*) Or a URI? There doesn’t seem to be agreement on what’s what – even standards bodies seem to disagree.

I suppose file URIs stradle the “URI” vs “file path” divide by their nature. For me, it falls more on the “file path” side of things because the rules vary by OS:

  • POSIX uses the local filesystem encoding, but Windows uses UTF-8
  • POSIX just prepends the path with file://, but Windows additionally removes two leading slashes (for UNC paths) or adds one (for local drive paths), and doesn’t percent-encode colons in drives.

urllib doesn’t otherwise do much per-OS stuff; this is the exception.

On URI vs URL: this w3c document says:

a URL is a type of URI that identifies a resource via a representation of its primary access mechanism (e.g., its network “location”), rather than by some other attributes it may have.

Which makes some sense to me.

Regarding the naming, I recommend something symmetric, e.g. url_to_path() and path_to_url() .

Thanks, that’s certainly an improvement :slight_smile:

This is the bit that’s always confused me. Does this mean that it’s not possible to write a file URL in a cross-platform manner, even when it’s possible to write the equivalent path in a cross-platform way? (Paths are typically usable cross-platform as long as you only care about the current drive on Windows - even if semantically, something like /etc is clearly POSIX-specific).

I think your analysis is correct, because the trick you’re relying on (omitting the drive letter) makes the path non-absolute, and relative file URIs aren’t supported in RFC 8089.