I’m not limiting filesystem to APFS, it was given as an example.
Disclaimer: most of the below is written from a UNIX/POSIX point of
view; the situation on Windows is more complex because it has multiple
APIs with open(2)
type calls in them, which have different filename
rules. Some of that complication is historic as Windows evolved. Also,
those APIs (IIRC) take strings rather than bytes (UNIX). So, to UNIX…
The trickiness is that even for purely local filesystems (does that
include a plugged in external drive, which might have almost anything
on it?), this is a bit tricky.
You do need to know the rules for the specific filesystem in play for
the paths you’re using, and as mentioned (by Steven?) if your path
crosses a mountpoint to need to apply the apropriate rules on either
side of the mountpoint).
You can’t do that in a purely lexical fashion, unless by “lexical”
you’re prepared to allow “lexical string anlysis, augmented by knowing
the mount points and associated filesystems and their rules”. Which
isn’t all that bad, because you can read the output of the mount(8)
command to get that list, then do purely lexical stuff from there on
with that knownledge. Um, and an os.getcwd()
if you’ve got a relative
path or just use os.path.abspath
which does that for you.
The API I’m seeking needs to either answer or provide description sufficient to compute an answer for “whether two differently coded path strings point to the same file from the perspective of open (2)” on a generic (ideally) filesystem.
I presume you mean the OS open(2)
system call above, to which Python’s
os.open
should be a shim.
It appears to me that knowing normalization and case sensitivity (regardless of preservation) is sufficient to answer that:
For UNIX/POSIX, this is probably so. With the caveat about mount points
above. And some more constraints which I’ll get to below.
I think I would be inclined to use pathlib
to get your platform’s
Path
flavour, or os.path.split
to do the same. Then work with each
path component according to the filesystem rules for that step in the
path.
- If I know that normalization is irrelevant then I can normalize both output of
os.listdir
and user supplied string to a form of my choosing
Strictly speaking, for UNIX you need to convert the string to bytes
because the open(2)
system call takes bytes - it’s a C string, which
places some constraints really just that , but they’re still bytes.
That requires a convention for encoding filename strings to bytes. For
MacOS, that encoding should be UTF-8 in normal form D. For other less
formal UNIXen that encoding depends on the locale in play for the
particular process doing the work; this is because the filesystems do
not have an official encoding- they’re just bytes!
So really, your criteria are how to the bytes compare.
A traditional pure UNIX filesystem does no normalisation beyond
coalescing adjacent '/'
bytes (the path separator) - then you just
compare bytes.
Also keep in mind that most filesystems have limits on the length of but
the overall pathname and the individual filename components of the path.
For example, I grew up on UNIX V7, where filename components were a
maximum of 14 bytes long (a dirent was 16 bytes long with 2 bytes for
the inode number). So abcdef_ghijkl_01
and abcdef_ghijkl_02
would
be colliding filenames (you’d just get abcdef_ghijkl_
after you made
the file).
On modern POSIX systems I believe you get at least like 256 bytes for
the filename components and at least 1024 bytes for the full pathname,
and there’re ways to query those limits for the local platform.
A case insensitive filesystem will presumably downcase the bytes (by
interpreting the bytes as some kind of “text”, possibly mere ASCII or
better some Unicode encoding) before comparing bytes. You need to know
that rule, whatever it is. You can probably infer it from the mount
table from the filesystem type and options.
For added fun, inside the OS it almost certainly does not know your
personal locale (i.e the encding used to convert str
to bytes
in the
system call) and insted will be using the filesystem’s mount options to
derive that, if that is an option at all.
- If I know that FS prefers one specific normalization then I normalize user-supplied string to that and compare it directly to the output of
os.listdir
Hahaha! If only it were that easy!
os.listdir
returns different things depending on whether you supply a
str
or a bytes
object for the directory pathname.
For a bytes
directory path, you’ll get a list of the filenames in raw
bytes
form. If you know the fs rule above, you can (a) convert your
source path to bytes correctly and (b) compare the the bytes from
os.listdir
using the fs’ comparison rule. That is probably the most
reliable approach.
If you use a str
with os.listdir
the raw bytes names get decoding
using sys.getfilesystemencoding()
using the surrogate escape
convention for bytes which don’t decode cleanly using that encoding.
- If the API fails to provide this description, then I can proceed without normalization hoping that the user supplied the path string in the right form.
- Similarly for case-sensitivity.
Fingers crossed.
You can probably write some tests with example filenames which
should/should not collide and try making those names on various
filesystems to validate how well this approach works.
The description of the FS is preferable over direct path comparison,
because I want to support fnmatch-like filters.
Ok.
This complexity is why some of us prefer to use samefile()
when that
is feasible - it punts the whole thing to the OS which inherently does
whatever it does.
When that doesn’t work, then we might try to emulate what should
happen.
You can go some way towards accomodating collisions by using open()
modes which fail if the target path already exists, which may help you
avoid your problems, depending on your needs.
Cheers,
Cameron Simpson cs@cskk.id.au