Add os.path.splitroot()

barneygale · December 28, 2022, 2:26am

The longstanding os.path.splitdrive() function splits a path into a (drive, tail) pair. But in some cases more detail is wanted, specifically a (drive, root, tail) triad.

The drive part has the same meaning as in splitdrive()
The root part is one of: the empty string, a forward slash, a backward slash (Windows only), or two forward slashes (POSIX only)
The tail part is everything following the root.

Similarly to splitdrive(), a splitroot() function would ensure that drive + root + tail is the same as the input path.

The extra level of detail reflects an extra step in the Windows ‘current path’ hierarchy – Windows has both a ‘current drive’, and a ‘current directory’ for one or more drives, which results in several kinds of non-absolute paths, e.g. ‘foo/bar’, ‘/foo/bar’, ‘X:foo/bar’

This three-part model is used successfully by pathlib, which exposes root as an attribute, and combines drive + root as an attribute called anchor. The anchor has useful properties, e.g. comparing two paths anchors can tell us whether a relative_to() operation is possible.

Pathlib has its own implementation of splitroot(), but its performance is hamstrung by its need for OS-agnosticism. By moving the implementation into ntpath and posixpath we can take advantage of OS-specific rules to improve performance.

Some previous discussion in a review thread: gh-68320, gh-88302 - Allow for `pathlib.Path` subclassing by barneygale · Pull Request #31691 · python/cpython · GitHub

barneygale · December 28, 2022, 2:26am

@eryksun would you like to expand or correct the pitch?

eryksun · December 28, 2022, 5:30am

The _split_root() method in pathlib removes repeated slashes at the root – such as r"C:\\spam" → r"C:\spam". It also splits an implicit root in a UNC path – e.g. r"\\server\share" → (r"\\server\share", "\\", "")^[1]. Given the above quote, I assume that these non-conserving steps are intended to remain in _split_root(). Given the root is already split out, these steps won’t require calling normpath() and splitdrive(), so there will be just the single splitroot() call. Is that right?

WinAPI PathCchSkipRoot() is the basis for the builtin function nt._path_splitroot(). Offhand I recall that it has one restriction that’s over the top. It limits the use of the extended device prefix (i.e. "\\\\?\\") to just drive-letter names, volume GUID names, and the “UNC” device (e.g. r"\\?\C:\spam", r"\\?\Volume{12345678-1234-1234-1234-123456781234}\spam", and r"\\?\UNC\localhost\C$\spam"). It thus rejects a path such as r"\\?\BootPartition\spam" as an invalid parameter, even though it’s a valid path in the file API. This error case can be handled by substituting the normal device prefix (i.e. "\\\\.\\") in place of the extended prefix and retrying the PathCchSkipRoot() call.

_split_root() also mistakenly does this for device paths such as r"\\.\C:". This changes the meaning of the path from being a volume device to being the root directory of the filesystem that mounts the volume. ↩︎

barneygale · December 28, 2022, 3:56pm

That’s correct. I think the pathlib change is pretty simple:

diff --git a/Lib/pathlib.py b/Lib/pathlib.py
index b959e85d18..003d980d7a 100644
--- a/Lib/pathlib.py
+++ b/Lib/pathlib.py
@@ -293,7 +293,10 @@ def _parse_parts(cls, parts):
         path = cls._flavour.join(*parts)
         if altsep:
             path = path.replace(altsep, sep)
-        drv, root, rel = cls._split_root(path)
+        drv, root, rel = cls._flavour.splitroot(path)
+        if drv.startswith(sep):
+            # UNC paths always have a root.
+            root = sep
         unfiltered_parsed = [drv + root] + rel.split(sep)
         parsed = [sys.intern(x) for x in unfiltered_parsed if x and x != '.']
         return drv, root, parsed

Extraneous slashes in the root are placed at the beginning of the rel part and are stripped out by the parsed = ... line.

eryksun · December 28, 2022, 6:03pm

This part should be something like the following:

        if not root and drv.startswith(sep) and (
                not drv.startswith(device_prefixes) or
                drv.startswith(unc_device_prefix)):
            # UNC file shares always have a root.
            root = sep

where device_prefixes is ("\\\\.\\", "\\\\?\\") and unc_device_prefix is "\\\\?\\UNC\\".

An implicit root is split for file shares such as r"\\server\share" and r"\\?\UNC\server\share", while base device paths such as r"\\.\C:" and r"\\.\NUL" have no root. The latter are still absolute, however. The is_absolute() method should return true for all UNC paths, since they’re never relative to a working directory.

guido · December 28, 2022, 7:17pm

Is there somewhere that explains the FULL path syntax on Windows? All I remember is C: and maybe \blah but I cannot follow the discussion here…

barneygale · December 28, 2022, 8:22pm

Be forewarned: this page uses some terms (like “absolute path”, “relative path”) a little differently than you might expect!

The docs for pathlib.PurePath.drive, root and anchor may also be useful:

eryksun · December 28, 2022, 10:46pm

That’s one of its mistakes. For example, it says that “\directory” is an absolute path that doesn’t depend on the current directory. No, it does depend on the current directory, and it is not an absolute path. When opened, “\directory” is relative to the drive or UNC share of the current directory. As a symlink target, “\directory” is relative to the drive of the opened path of the symlink.

Here are the supported MS-DOS path types that date back to the 1980s:

relative: “spam\eggs”
relative rooted (no drive): “\spam\eggs”
relative drive (no root): “Z:spam\eggs”
absolute drive: “Z:\spam\eggs”
absolute UNC: “\\server\share\spam\eggs”

The current working directory can be either an absolute drive path or an absolute UNC path. If the current directory is a UNC path, the share is handled as the current drive, such as for resolving a relative rooted path.

There’s also an optional working directory on each drive-letter drive. It defaults to the root directory on the drive. It gets used to resolve relative drive paths such as “Z:spam\eggs”. The API doesn’t force this feature on applications, but Python’s os.chdir() and C _wchdir() both opt into it.

Windows supports an additional path type that wasn’t present in MS-DOS: device paths for canonical device names and mapped drives^[1]. These come in two flavors: normalized and extended (literal). The prefix for a normalized device path is “\\.\” (e.g. “\\.\C:”), and the prefix for an extended device path is “\\?\” (e.g. “\\?\UNC\server\share”). Opening a volume device requires the use of a device path. For example, “C:” gets resolved relative to the working directory on the drive, while “\\.\C:” is an absolute path for the volume.

Path normalization applies to all path types when opened, except for “\\?\” extended (literal) paths. Normalization replaces forward slashes with backslashes, removes repeated slashes, resolves “.” and “..” components, and removes trailing spaces and dots from the final component. Normalized paths may be limited to MAX_PATH (260) characters, or sometimes less. The native NT limit of about 32760 characters is possible if long normalized paths are enabled for both the system and the application. (Python 3.6+ enables long paths, but it still depends on the system setting.) Using an extended path allows reliable access to long paths up to about 32760 characters, but one has to be careful to first normalize the path via GetFullPathNameW() (i.e. os.path.abspath()).

The current directory is not documented to support device paths, even if they’re for a filesystem directory such as “\\?\C:\Windows”. It may seem to work, but the API isn’t tested to support it, and it has serious bugs that result in nonsense paths. ↩︎

guido · December 29, 2022, 12:46am

That’s an awesome summary, Eryk – I think I knew all of the MS-DOS flavors but the device paths are new to me (and what confused me in the discussion).

I guess there are also some additional wrinkles like case normalization, things like NUL (what’s the list of those?), and long vs. short (8+3 IIRC) paths. Also code pages, character sets, UTF-16.

merwok · December 29, 2022, 1:28am

https://github.com/python/cpython/blob/main/Lib/pathlib.py#L33-L39

eryksun · December 29, 2022, 8:32am

In the internal NT API, filenames are 16-bit Unicode strings. It’s not strictly UTF-16 because surrogate codes are not validated as surrogate pairs.
The API also does not normalize filenames to a particular Unicode normal form (e.g. “NFC” or “NFKC”).
If a filesystem directory is case insensitive, name comparisons first translate to upper case using a locale-invariant case table. One-to-many case conversions are not supported (e.g. “ß” maps to “ß”, not to “SS”) .
Starting with Windows 10, NTFS supports case-sensitive directories.

For bytes paths, Python 3.6+ uses UTF-8 as the filesystem encoding. Bytes paths get decoded to wide-character strings before calling system functions. The error handler is “surrogatepass” due to the possibility of lone surrogate codes in filenames. This is sometimes called 8-bit Wobbly Transformation Format (WTF-8).

Regarding short filenames^[1], they’re a legacy feature for compatibility with ancient applications.

ReFS and exFAT filesystems do not support short filenames.
NTFS allows disabling the automatic creation of short filenames, either for individual filesystems or system-wide, and they can be stripped from existing files. This can improve performance since NTFS stores short filenames as separate, specially-flagged entries in a directory.
FAT32 generates short filenames that can include non-ASCII OEM characters^[2], which violates the documented specification. It also uses a best-fit encoding that can be problematic. For example, given OEM is code page 850, “spĀm.txt” has the associated short name “SPAM.TXT”. In this case, most people will be surprised that opening or creating “spam.txt” actually opens or replaces “spĀm.txt”.

The list of reserved DOS device names includes “NUL”, “CON”, “CONIN$”, CONOUT$", “AUX”, “PRN”, “COM<1-9>” and “LPT<1-9>”. The names are case insensitive. These devices are virtually present in the unqualified current directory on all Windows versions, just like the dive-letter names “A:” through “Z:”. The device name can be followed by a colon and any number of dots and spaces. For example:

>>> stat.S_ISCHR(os.stat('CONIN$:. . . .').st_mode)
True

Unlike drive-letter names, the virtually present DOS device names cannot have a path, and the optional colon is not part of the real device name. For example:

>>> os.getcwd()
'C:\\Temp'
>>> nt._getfullpathname('CON:/spam')
'C:\\Temp\\CON:\\spam'
>>> nt._getfullpathname('CON:')
'\\\\.\\CON'

Prior to Windows 11, DOS device names are reserved in a wider range of cases than drive-letter names:

DOS device names can have an extension that gets ignored (e.g. “CON.txt”).
DOS device names are present in the explicitly referenced current directory (e.g. “.\CON”), as well as the parent directory of most opened paths (e.g. “C:\Temp\CON”), except never in UNC paths.

For some reason the latter behavior is still implemented for the “NUL” device on Windows 11. For example:

>>> nt._getfullpathname('./NUL')
'\\\\.\\NUL'
>>> nt._getfullpathname('Temp/NUL')
'\\\\.\\NUL'
>>> nt._getfullpathname('C:/Temp/NUL')
'\\\\.\\NUL'

DOS devices have never been virtually present in UNC share paths and device paths, in which case they’re just regular filenames, at least as far as the API is concerned. For example:

>>> nt._getfullpathname('//localhost/C$/Temp/NUL')
'\\\\localhost\\C$\\Temp\\NUL'  
>>> nt._getfullpathname('//./C:/Temp/NUL')
'\\\\.\\C:\\Temp\\NUL'

A filesystem or filesystem redirector (e.g. SMB) may disallow creating DOS device names, even in cases that the API doesn’t reserve. For example:

>>> open('//localhost/C$/Temp/NUL', 'w')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
PermissionError: [Errno 13] Permission denied: '//localhost/C$/Temp/NUL

Short filenames are specified in [MS-FSCC] 2.1.5.2.1. ↩︎
Currently on Windows 11 there’s an OEM decoding bug in the system runtime library, at least on the systems I’ve checked. The function RtlOemStringToUnicodeString() mistakenly uses the ANSI codepage instead of the OEM codepage. Thus short names that contain non-ASCII OEM characters are returned as mojibake. This also affects GetShortPathNameW(). ↩︎

barneygale · December 29, 2022, 3:56pm

Just in case anyone reading is struggling to relate the discussion of reserved names, normalization, etc, back to the original proposal: they’re important parts of the wider picture of how Windows paths work, and they influence how pathlib normalizes paths, but it’s perhaps worth noting that reserved names, normalization, 8+3, etc, don’t have a direct bearing on the proposed os.path.splitroot() function, because it’s designed to be conservative: input_path = drive + root + tail.

eryksun · December 29, 2022, 4:36pm

I know it’s an off-topic side discussion. I was trying to give Guido a summary response to his questions about Windows paths and figured I may as well answer on the public forum.

On the POSIX side of the splitroot() problem, the only gotcha I can think of is a path with two leading slashes, per the specification of “pathname” and pathname resolution:

Multiple successive <slash> characters are considered to be the same as one <slash>, except for the case of exactly two leading <slash> characters.
If a pathname begins with two successive <slash> characters, the first component following the leading <slash> characters may be interpreted in an implementation-defined manner, although more than two leading <slash> characters shall be treated as a single <slash> character.

I don’t think any of Python’s officially supported POSIX platforms has special handling for two leading slashes. Cygwin and MSYS2 reserve it for UNC paths.

barneygale · December 29, 2022, 5:01pm

Ack, I didn’t mean to come across as telling either you or Guido to stop talking about closely related topics. It’s fine by me and I seem to learn something new from every one of your posts! I was trying to put my proposal in context for anyone else who might be reading this thread. Sorry for not being clear.

barneygale · January 12, 2023, 11:23pm

I’ve logged a feature request and a PR. I found that several functions in ntpath and posixpath were already doing their own parsing of path roots that could be replaced by splitroot(); I think helps demonstrate the usefulness of this function.