Check whether two strings will point to the same file on the local filesystem

Kentzo · December 15, 2022, 5:17am

This is a question regarding encoding and unicode normalization.

In a nutshell I have two problems to tackle:

Given two path strings, one supplied by a user (e.g. CLI) and another by Python API (e.g. os.listdir) I need to answer whether they would point to the same on a local disk without actually trying to open the file (it might not exist)
Same as [1] but now user supplies an fnmatch pattern

As an example consider APFS where “/tmp/Jalape\u00f1o” and “/tmp/Jalapen\u0303o” point to the same file.

Is there a builtin method to compare path strings with respect to the local filesystem? If not, is there a reliable library on PyPI?

cameron · December 15, 2022, 5:48am

This is a question regarding encoding and unicode normalization.

In a nutshell I have two problems to tackle:

Given two path strings, one supplied by a user (e.g. CLI) and another by Python API (e.g. os.listdir) I need to answer whether they would point to the same on a local disk without actually trying to open the file (it might not exist)

So you’re trying to check whether the two paths are equivalent, even if
the target filesystem objects does not exist?

Personally I’d see if os.path.samefile() returns true.

If that raises a FileNotFound exception, only then would I try to
evaluate the paths lexicaly. However, if a path does not exist the
reules are operating system dependent. For example, on MacOS paths are
usually case insensitive (though you can use case sensitive
filesystems). On other UNIX platforms paths are usually not case
insensitive.

This can vary per mount point.

Same as [1] but now user supplies an fnmatch pattern

As an example consider APFS where “/tmp/Jalape\u00f1o” and “/tmp/Jalapen\u0303o” point to the same file.

Do they? Hmm.

 >>> p1="/tmp/Jalape\u00f1o"
 >>> p1
 '/tmp/Jalapeño'
 >>> p2="/tmp/Jalapen\u0303o"
 >>> p2
 '/tmp/Jalapeño'
 >>> p1 == p2
 False
 >>> with open(p1,"w"): pass
 ...
 >>> with open(p2,"r"): pass
 ...

I guess they do. I have in the back of my mind that MacOS filesystems
use Unicode normal form “D”. Let’s see if these are the same in that
form:

 >>> import unicodedata
 >>> unicodedata.normalize('NFD', p1) == unicodedata.normalize('NFD', p2)
 True

But note that this is MacOS specific rule, and possibly mountpoint
specific into the bargain.

Is there a builtin method to compare path strings with respect to the
local filesystem? If not, is there a reliable library on PyPI?

os.path.samefile tests whether to existing paths resove to the same
file. If they don’t exist you’d need to know the filesystem rules.

The short answer is that if it exists, the check is easy and can be
handed off the the OS via os.path.samefile. Otherwise you need special
knowledge. I do not know if there’s a PyPI package with that knowledge.

Cheers,
Cameron Simpson cs@cskk.id.au

Kentzo · December 15, 2022, 6:07am

I’m hoping that someone already did that. There are many combinations, but not that many.

I tried a couple of opensource projects that deal with path strings, but they seem to hand responsibility to the user (e.g. rsync). Perhaps I should try open source file managers, maybe mc has something like that.

Rosuav · December 15, 2022, 6:37am

It’s worth noting that this isn’t quite exactly what the OP requested, although it may well be a lot of what’s needed. As well as dealing with the case where the file doesn’t exist (which you handle in the rest of the message), this has the limitation that it will return True for two different names for the same file:

rosuav@sikorsky:~/tmp$ touch spam
rosuav@sikorsky:~/tmp$ ln -s spam ham
rosuav@sikorsky:~/tmp$ ll spam ham
lrwxrwxrwx 1 rosuav rosuav 4 Dec 15 17:32 ham -> spam
-rw-r--r-- 1 rosuav rosuav 0 Dec 15 17:32 spam
rosuav@sikorsky:~/tmp$ python3 -c 'import os; print(os.path.samefile("spam", "ham"))'
True
rosuav@sikorsky:~/tmp$ mkdir spamdir
rosuav@sikorsky:~/tmp$ ln -s spamdir hamdir
rosuav@sikorsky:~/tmp$ touch hamdir/nom
rosuav@sikorsky:~/tmp$ python3 -c 'import os; print(os.path.samefile("spamdir/nom", "hamdir/nom"))'
True

Other potential ways to have the same file visible with two names: hardlink the file, hardlink the directory the file’s in (if you’re masochistic and willing to create some fun nightmares), bind mount the directory, and possibly even remote-mounting your own file system (eg sshfs to localhost).

Whether these are a problem or not is up to the OP - does this have to be lexical or should it be based on the actual FS?

Kentzo · December 15, 2022, 7:18am

Ideally the solution should be purely lexical, since os.path.samefile is unfeasible for [2]. I think being able to tell whether the underlying FS is sensitive to unicode normalization would almost solve the issue for me (leaving the part when it cannot represent UTF-8 charset without surrogates).

Rosuav · December 15, 2022, 7:37am

Unfortunately there won’t just be a single “underlying FS” if there are any network mounts, so it’s entirely possible that this cannot be done purely lexically. But if you’re prepared to accept some approximations, it should be possible to query the current directory and/or the root file system to find out what their rules are. Not sure whether it’s going to really help though.

steven.daprano · December 15, 2022, 8:59am

You can’t tell what file names will be considered identical unless you know what file system and OS they are on, and what file system flags are used, and the history of the storage device.

File systems use different rules and the combinations are seemingly endless. The only way to be sure whether two file names are the same is to ask the file system.

HFS+ normalises file names to Unicode normalisation form NFD, preserves case, but performs case insensitive comparisons. So the file name Cafe with an accent on the e will be normalised to UTF-8 b’Cafe\xcc\x81’ regardless of whether the user specified Cafe\u0301 or Caf\u00E9.

APFS does not normalise file names. The file system itself will happily treat Cafe\u0301 and Caf\u00E9 as different files even though to the user they look identical.

So Apple added a normalisation layer to macOS that does the normalisation before the file name reaches the file system, and that usually works fine, except when it doesn’t, and then you can get two seemingly identical file names differing only in the invisible to the user byte pattern in the name.

Try this:

import unicodedata

a = 'Caf\u00E9'
b = 'Cafe\u0301'

assert unicodedata.normalize('NFD', a) == b
assert unicodedata.normalize('NFC', b) == a

with open(a, 'w') as f:
    f.write('NFC form')

with open(b, 'w') as f:
    f.write('NFD form')

On Posix file systems like ext4, you end up with two identical-looking filenames with different content. Other Posix file systems may normalise the filenames, and you may end up with just one file.

On HFS+ you will end up with one file containing ‘NFD form’.

I don’t know what you get on NTFS.

On APFS in theory you should get one file, but there are ways to get past the OS normalisation layer and write directly to the file system, in which case you can get two files.

Yes, this is a mess.

To answer your question, you will need to find an Apple file system expert and ask them, but I think the answer is that you need to:

normalise both pathname strings to ‘NFD’;
then normalise the pathnames using os.path.normpath;
then do a case-insensitive comparison.

Note that os.path.normcase may not be sufficient for that last step!

If you do those three steps, that will (hopefully!) do what you want, most of the time, except when it doesn’t.

You can test the case-folding rules by trying to create a file called lowercase ‘ss’, and another file called ‘\N{Latin small letter sharp s}’ and see if the OS will treat them as the same.

If they are treated as different files, then (probably) normcase will be sufficient. But if APFS treats them as the same file, you may need to use str.casefold() to do the comparison. (And you probably should report that as a bug as well.)

If Apple supports filename internationalisation, then both normcase and casefold will give the wrong results on Turkish systems.

And I know absolutely nothing about the rules for other languages, especially non-Latin based languages like Korean, Japanese, Greek or Russian.

Honestly, the only way to be absolutely sure the file names point to the same file on disk is to ask the file system, but you might get 95% of the way with the above tricks.

And one last thing: I am not a Mac expert, and I don’t have a modern Mac to try it on, so everything I have said here might be wrong.

Kentzo · December 15, 2022, 6:51pm

They will point to the same file. IIRC APFS preserves user-supplied case and normalization but does not allow multiple normalization variants to co-exist (multiple case variants may co-exist depending on the FS configuration).

From the perspective of my application it’s as irrelevant as an attempt to counter cosmic rays. My only concern whether open (2) would open the same file on disk.

I’m not convinced I might though. This dance is done without checking properties for the underlying filesystem pointed by the path. In that regard Python’s os.path is misleading although documentation corrects that.

What I need, I believe, is an API that would return a tuple of (normalization, case) that is relevant from the perspective of open (2) (and related POSIX API), where normalization is a unicode normalization identifier, special keyword “irrelevant” or None (when it cannot be determined) and case is True, False or None (when it cannot be determined).

Rosuav · December 15, 2022, 7:31pm

It would need to do that for every directory level independently. When different file systems are mounted - especially networked file systems - they can have vastly different behaviours.

Kentzo · December 15, 2022, 8:06pm

In my application only local filesystem is relevant. Links, mounts and special devices are not followed and treated as individual files.

steven.daprano · December 15, 2022, 9:21pm

I don’t understand your objection. I gave you a recipe that will tell you whether two filenames on APFS point to the same file. Isn’t that what you want? Have I misunderstood your requirements?

And I don’t understand this requirement at all.

You’ve said what the API returns, but not what it takes as argument, or how it determines the return values.
What do you plan to do with that information?

Kentzo · December 15, 2022, 9:55pm

I’m not limiting filesystem to APFS, it was given as an example.

The API I’m seeking needs to either answer or provide description sufficient to compute an answer for “whether two differently coded path strings point to the same file from the perspective of open (2)” on a generic (ideally) filesystem.

It appears to me that knowing normalization and case sensitivity (regardless of preservation) is sufficient to answer that:

If I know that normalization is irrelevant then I can normalize both output of os.listdir and user supplied string to a form of my choosing
If I know that FS prefers one specific normalization then I normalize user-supplied string to that and compare it directly to the output of os.listdir
If the API fails to provide this description, then I can proceed without normalization hoping that the user supplied the path string in the right form.
Similarly for case-sensitivity.

The description of the FS is preferable over direct path comparison, because I want to support fnmatch-like filters.

cameron · December 17, 2022, 12:41am

I’m not limiting filesystem to APFS, it was given as an example.

Disclaimer: most of the below is written from a UNIX/POSIX point of
view; the situation on Windows is more complex because it has multiple
APIs with open(2) type calls in them, which have different filename
rules. Some of that complication is historic as Windows evolved. Also,
those APIs (IIRC) take strings rather than bytes (UNIX). So, to UNIX…

The trickiness is that even for purely local filesystems (does that
include a plugged in external drive, which might have almost anything
on it?), this is a bit tricky.

You do need to know the rules for the specific filesystem in play for
the paths you’re using, and as mentioned (by Steven?) if your path
crosses a mountpoint to need to apply the apropriate rules on either
side of the mountpoint).

You can’t do that in a purely lexical fashion, unless by “lexical”
you’re prepared to allow “lexical string anlysis, augmented by knowing
the mount points and associated filesystems and their rules”. Which
isn’t all that bad, because you can read the output of the mount(8)
command to get that list, then do purely lexical stuff from there on
with that knownledge. Um, and an os.getcwd() if you’ve got a relative
path or just use os.path.abspath which does that for you.

The API I’m seeking needs to either answer or provide description sufficient to compute an answer for “whether two differently coded path strings point to the same file from the perspective of open (2)” on a generic (ideally) filesystem.

I presume you mean the OS open(2) system call above, to which Python’s
os.open should be a shim.

It appears to me that knowing normalization and case sensitivity (regardless of preservation) is sufficient to answer that:

For UNIX/POSIX, this is probably so. With the caveat about mount points
above. And some more constraints which I’ll get to below.

I think I would be inclined to use pathlib to get your platform’s
Path flavour, or os.path.split to do the same. Then work with each
path component according to the filesystem rules for that step in the
path.

If I know that normalization is irrelevant then I can normalize both output of os.listdir and user supplied string to a form of my choosing

Strictly speaking, for UNIX you need to convert the string to bytes
because the open(2) system call takes bytes - it’s a C string, which
places some constraints really just that , but they’re still bytes.
That requires a convention for encoding filename strings to bytes. For
MacOS, that encoding should be UTF-8 in normal form D. For other less
formal UNIXen that encoding depends on the locale in play for the
particular process doing the work; this is because the filesystems do
not have an official encoding- they’re just bytes!

So really, your criteria are how to the bytes compare.

A traditional pure UNIX filesystem does no normalisation beyond
coalescing adjacent '/' bytes (the path separator) - then you just
compare bytes.

Also keep in mind that most filesystems have limits on the length of but
the overall pathname and the individual filename components of the path.

For example, I grew up on UNIX V7, where filename components were a
maximum of 14 bytes long (a dirent was 16 bytes long with 2 bytes for
the inode number). So abcdef_ghijkl_01 and abcdef_ghijkl_02 would
be colliding filenames (you’d just get abcdef_ghijkl_ after you made
the file).

On modern POSIX systems I believe you get at least like 256 bytes for
the filename components and at least 1024 bytes for the full pathname,
and there’re ways to query those limits for the local platform.

A case insensitive filesystem will presumably downcase the bytes (by
interpreting the bytes as some kind of “text”, possibly mere ASCII or
better some Unicode encoding) before comparing bytes. You need to know
that rule, whatever it is. You can probably infer it from the mount
table from the filesystem type and options.

For added fun, inside the OS it almost certainly does not know your
personal locale (i.e the encding used to convert str to bytes in the
system call) and insted will be using the filesystem’s mount options to
derive that, if that is an option at all.

If I know that FS prefers one specific normalization then I normalize user-supplied string to that and compare it directly to the output of os.listdir

Hahaha! If only it were that easy!

os.listdir returns different things depending on whether you supply a
str or a bytes object for the directory pathname.

For a bytes directory path, you’ll get a list of the filenames in raw
bytes form. If you know the fs rule above, you can (a) convert your
source path to bytes correctly and (b) compare the the bytes from
os.listdir using the fs’ comparison rule. That is probably the most
reliable approach.

If you use a str with os.listdir the raw bytes names get decoding
using sys.getfilesystemencoding() using the surrogate escape
convention for bytes which don’t decode cleanly using that encoding.

If the API fails to provide this description, then I can proceed without normalization hoping that the user supplied the path string in the right form.

Similarly for case-sensitivity.

Fingers crossed.

You can probably write some tests with example filenames which
should/should not collide and try making those names on various
filesystems to validate how well this approach works.

The description of the FS is preferable over direct path comparison,
because I want to support fnmatch-like filters.

Ok.

This complexity is why some of us prefer to use samefile() when that
is feasible - it punts the whole thing to the OS which inherently does
whatever it does.

When that doesn’t work, then we might try to emulate what should
happen.

You can go some way towards accomodating collisions by using open()
modes which fail if the target path already exists, which may help you
avoid your problems, depending on your needs.

Cheers,
Cameron Simpson cs@cskk.id.au