Pathlib should forbid paths with invalid characters

I noticed that pathlib allows you to create path objects containing invalid characters for that os. For example:

In [1]: from pathlib import PureWindowsPath

In [2]: PureWindowsPath('<')
Out[2]: PureWindowsPath('<')

In [3]: PureWindowsPath('file.txt ')
Out[3]: PureWindowsPath('file.txt ')

Both of these paths should be invalid (according to this SO answer).

As this path can never be used, why is it allowed to be instantiated? (I would have demonstrated with a concrete WindowsPath but I’m on MacOS.)

I would instead have expected that:

  • PureWindowsPath and WindowsPath raised a ValueError at construction time if any component of the path contained a character that is always forbidden in paths on Windows systems.
  • PurePosixPath and PosixPath similarly raised a ValueError at construction time if any component of the path contained a character that is always forbidden in paths on posix systems.

The benefit of this would be that you could use the various Path constructors to validate that a user-supplied path string is not malformed. It’s also better to raise at construction time than later on, for example to avoid calling .exists() on something that you could have known earlier could never possibly exist.

Originally noticed in the context of Stricter S3 bucket name string validation? · Issue #489 · drivendataorg/cloudpathlib · GitHub

(This is my first time posting on any python language forum so please let me know if there is a better place to raise this.)

2 Likes

Validating Windows paths is infeasible because Python versions aren’t tied to specific Windows versions. If a path becomes invalid in a future Windows version, we would still face the same issue of being unable to validate paths.

It is highly unlikely that Windows would ever add new invalid characters. How would that work with existing files created in the past 30ish years?

Making more characters valid is less impossible, but also quite unlikely. The set’s been unchanged since the 90s.

If anything, it would be on a new file system. Any existing files would be on existing file systems. Here’s Microsoft to say it, not me:

“”"
Any other character that the target file system does not allow.
“”"

It’s annoying to have differences in what file names are supported, since it makes archiving and unarchiving somewhat frustrating. But we already have that across platforms (for example, I can create a .ZIP file that has a file called What is this?.txt and there’s no issues, but Windows won’t allow that to be extracted under that name) and this would be no different.

1 Like

I think this has come up before, but I’m failing to dig out the GitHub discussion. It’s not a bad idea :slight_smile:

I guess my main objection is that forbidding reserved characters in pathlib would have a performance impact, and I suspect that most users wouldn’t be willing to take the hit. From a more philosophical angle, most path-related functionality in the standard library takes an EAFP approach, e.g. we catch and handle encoding errors rather than checking for unencodable paths ahead of time.

It’s worth noting that Python 3.13 adds a Windows-only os.path.isreserved() function.

2 Likes