The longstanding os.path.splitdrive() function splits a path into a (drive, tail) pair. But in some cases more detail is wanted, specifically a (drive, root, tail) triad.
The drive part has the same meaning as in splitdrive()
The root part is one of: the empty string, a forward slash, a backward slash (Windows only), or two forward slashes (POSIX only)
The tail part is everything following the root.
Similarly to splitdrive(), a splitroot() function would ensure that drive + root + tail is the same as the input path.
The extra level of detail reflects an extra step in the Windows âcurrent pathâ hierarchy â Windows has both a âcurrent driveâ, and a âcurrent directoryâ for one or more drives, which results in several kinds of non-absolute paths, e.g. âfoo/barâ, â/foo/barâ, âX:foo/barâ
This three-part model is used successfully by pathlib, which exposes root as an attribute, and combines drive + root as an attribute called anchor. The anchor has useful properties, e.g. comparing two paths anchors can tell us whether a relative_to() operation is possible.
Pathlib has its own implementation of splitroot(), but its performance is hamstrung by its need for OS-agnosticism. By moving the implementation into ntpath and posixpath we can take advantage of OS-specific rules to improve performance.
The _split_root() method in pathlib removes repeated slashes at the root â such as r"C:\\spam" â r"C:\spam". It also splits an implicit root in a UNC path â e.g. r"\\server\share" â (r"\\server\share", "\\", "")[1]. Given the above quote, I assume that these non-conserving steps are intended to remain in _split_root(). Given the root is already split out, these steps wonât require calling normpath() and splitdrive(), so there will be just the single splitroot() call. Is that right?
WinAPI PathCchSkipRoot() is the basis for the builtin function nt._path_splitroot(). Offhand I recall that it has one restriction thatâs over the top. It limits the use of the extended device prefix (i.e. "\\\\?\\") to just drive-letter names, volume GUID names, and the âUNCâ device (e.g. r"\\?\C:\spam", r"\\?\Volume{12345678-1234-1234-1234-123456781234}\spam", and r"\\?\UNC\localhost\C$\spam"). It thus rejects a path such as r"\\?\BootPartition\spam" as an invalid parameter, even though itâs a valid path in the file API. This error case can be handled by substituting the normal device prefix (i.e. "\\\\.\\") in place of the extended prefix and retrying the PathCchSkipRoot() call.
_split_root() also mistakenly does this for device paths such as r"\\.\C:". This changes the meaning of the path from being a volume device to being the root directory of the filesystem that mounts the volume. âŠď¸
if not root and drv.startswith(sep) and (
not drv.startswith(device_prefixes) or
drv.startswith(unc_device_prefix)):
# UNC file shares always have a root.
root = sep
where device_prefixes is ("\\\\.\\", "\\\\?\\") and unc_device_prefix is "\\\\?\\UNC\\".
An implicit root is split for file shares such as r"\\server\share" and r"\\?\UNC\server\share", while base device paths such as r"\\.\C:" and r"\\.\NUL" have no root. The latter are still absolute, however. The is_absolute() method should return true for all UNC paths, since theyâre never relative to a working directory.
Thatâs one of its mistakes. For example, it says that â\directoryâ is an absolute path that doesnât depend on the current directory. No, it does depend on the current directory, and it is not an absolute path. When opened, â\directoryâ is relative to the drive or UNC share of the current directory. As a symlink target, â\directoryâ is relative to the drive of the opened path of the symlink.
Here are the supported MS-DOS path types that date back to the 1980s:
relative: âspam\eggsâ
relative rooted (no drive): â\spam\eggsâ
relative drive (no root): âZ:spam\eggsâ
absolute drive: âZ:\spam\eggsâ
absolute UNC: â\\server\share\spam\eggsâ
The current working directory can be either an absolute drive path or an absolute UNC path. If the current directory is a UNC path, the share is handled as the current drive, such as for resolving a relative rooted path.
Thereâs also an optional working directory on each drive-letter drive. It defaults to the root directory on the drive. It gets used to resolve relative drive paths such as âZ:spam\eggsâ. The API doesnât force this feature on applications, but Pythonâs os.chdir() and C _wchdir() both opt into it.
Windows supports an additional path type that wasnât present in MS-DOS: device paths for canonical device names and mapped drives[1]. These come in two flavors: normalized and extended (literal). The prefix for a normalized device path is â\\.\â (e.g. â\\.\C:â), and the prefix for an extended device path is â\\?\â (e.g. â\\?\UNC\server\shareâ). Opening a volume device requires the use of a device path. For example, âC:â gets resolved relative to the working directory on the drive, while â\\.\C:â is an absolute path for the volume.
Path normalization applies to all path types when opened, except for â\\?\â extended (literal) paths. Normalization replaces forward slashes with backslashes, removes repeated slashes, resolves â.â and â..â components, and removes trailing spaces and dots from the final component. Normalized paths may be limited to MAX_PATH (260) characters, or sometimes less. The native NT limit of about 32760 characters is possible if long normalized paths are enabled for both the system and the application. (Python 3.6+ enables long paths, but it still depends on the system setting.) Using an extended path allows reliable access to long paths up to about 32760 characters, but one has to be careful to first normalize the path via GetFullPathNameW() (i.e. os.path.abspath()).
The current directory is not documented to support device paths, even if theyâre for a filesystem directory such as â\\?\C:\Windowsâ. It may seem to work, but the API isnât tested to support it, and it has serious bugs that result in nonsense paths. âŠď¸
Thatâs an awesome summary, Eryk â I think I knew all of the MS-DOS flavors but the device paths are new to me (and what confused me in the discussion).
I guess there are also some additional wrinkles like case normalization, things like NUL (whatâs the list of those?), and long vs. short (8+3 IIRC) paths. Also code pages, character sets, UTF-16.
In the internal NT API, filenames are 16-bit Unicode strings. Itâs not strictly UTF-16 because surrogate codes are not validated as surrogate pairs.
The API also does not normalize filenames to a particular Unicode normal form (e.g. âNFCâ or âNFKCâ).
If a filesystem directory is case insensitive, name comparisons first translate to upper case using a locale-invariant case table. One-to-many case conversions are not supported (e.g. âĂâ maps to âĂâ, not to âSSâ) .
Starting with Windows 10, NTFS supports case-sensitive directories.
For bytes paths, Python 3.6+ uses UTF-8 as the filesystem encoding. Bytes paths get decoded to wide-character strings before calling system functions. The error handler is âsurrogatepassâ due to the possibility of lone surrogate codes in filenames. This is sometimes called 8-bit Wobbly Transformation Format (WTF-8).
Regarding short filenames[1], theyâre a legacy feature for compatibility with ancient applications.
ReFS and exFAT filesystems do not support short filenames.
NTFS allows disabling the automatic creation of short filenames, either for individual filesystems or system-wide, and they can be stripped from existing files. This can improve performance since NTFS stores short filenames as separate, specially-flagged entries in a directory.
FAT32 generates short filenames that can include non-ASCII OEM characters[2], which violates the documented specification. It also uses a best-fit encoding that can be problematic. For example, given OEM is code page 850, âspÄm.txtâ has the associated short name âSPAM.TXTâ. In this case, most people will be surprised that opening or creating âspam.txtâ actually opens or replaces âspÄm.txtâ.
The list of reserved DOS device names includes âNULâ, âCONâ, âCONIN$â, CONOUT$", âAUXâ, âPRNâ, âCOM<1-9>â and âLPT<1-9>â. The names are case insensitive. These devices are virtually present in the unqualified current directory on all Windows versions, just like the dive-letter names âA:â through âZ:â. The device name can be followed by a colon and any number of dots and spaces. For example:
Unlike drive-letter names, the virtually present DOS device names cannot have a path, and the optional colon is not part of the real device name. For example:
Prior to Windows 11, DOS device names are reserved in a wider range of cases than drive-letter names:
DOS device names can have an extension that gets ignored (e.g. âCON.txtâ).
DOS device names are present in the explicitly referenced current directory (e.g. â.\CONâ), as well as the parent directory of most opened paths (e.g. âC:\Temp\CONâ), except never in UNC paths.
For some reason the latter behavior is still implemented for the âNULâ device on Windows 11. For example:
DOS devices have never been virtually present in UNC share paths and device paths, in which case theyâre just regular filenames, at least as far as the API is concerned. For example:
Currently on Windows 11 thereâs an OEM decoding bug in the system runtime library, at least on the systems Iâve checked. The function RtlOemStringToUnicodeString() mistakenly uses the ANSI codepage instead of the OEM codepage. Thus short names that contain non-ASCII OEM characters are returned as mojibake. This also affects GetShortPathNameW(). âŠď¸
Just in case anyone reading is struggling to relate the discussion of reserved names, normalization, etc, back to the original proposal: theyâre important parts of the wider picture of how Windows paths work, and they influence how pathlib normalizes paths, but itâs perhaps worth noting that reserved names, normalization, 8+3, etc, donât have a direct bearing on the proposed os.path.splitroot() function, because itâs designed to be conservative: input_path = drive + root + tail.
I know itâs an off-topic side discussion. I was trying to give Guido a summary response to his questions about Windows paths and figured I may as well answer on the public forum.
On the POSIX side of the splitroot() problem, the only gotcha I can think of is a path with two leading slashes, per the specification of âpathnameâ and pathname resolution:
Multiple successive <slash> characters are considered to be the same as one <slash>, except for the case of exactly two leading <slash> characters.
If a pathname begins with two successive <slash> characters, the first component following the leading <slash> characters may be interpreted in an implementation-defined manner, although more than two leading <slash> characters shall be treated as a single <slash> character.
I donât think any of Pythonâs officially supported POSIX platforms has special handling for two leading slashes. Cygwin and MSYS2 reserve it for UNC paths.
Ack, I didnât mean to come across as telling either you or Guido to stop talking about closely related topics. Itâs fine by me and I seem to learn something new from every one of your posts! I was trying to put my proposal in context for anyone else who might be reading this thread. Sorry for not being clear.
Iâve logged a feature request and a PR. I found that several functions in ntpath and posixpath were already doing their own parsing of path roots that could be replaced by splitroot(); I think helps demonstrate the usefulness of this function.