A Code Glitch May Have Caused Errors In More Than 100 Published Studies

(https://science.slashdot.org/story/19/10/12/1926252/python-code-glitch-may-have-caused-errors-in-over-100-published-studies)

Original study is behind the paywall. Does anyone know what was the core confusion about?

I don’t think there was some lack in documentation, but I’m curious if we could improve it to try to help avoiding such situations.

I think the issue is that they had a directory with files like:

item001.metadata
item001.data
item002.metadata
item002.data
...

And then they did something like:

for metadata, data in zip(glob("*.metadata"), glob("*.data")):
    ...

If glob returns filenames in sorted order, then this works fine. If it doesn’t return filenames in sorted order, then you could end up with item001.metadata being paired up with item002.data and vice-versa.

And apparently on some systems glob does sort the filenames, and on other systems it doesn’t. And the scripts were originally tested on a system where it does sort, so the developers didn’t notice the problem.

(The actual code is a bit more complicated; you can see it in https://pubs.acs.org/doi/suppl/10.1021/acs.orglett.9b03216/suppl_file/ol9b03216_si_002.zip )

Apparently the docs were fixed back in 2016 to explicitly note that sorting is platform-dependent; before that they were somewhat ambiguous: https://nvd.nist.gov/vuln/detail/CVE-2019-17514

I do wonder if we should change glob to always sort the returned filenames. glob already loads the whole list in memory, and the cost should be trivial compared to the cost of walking the filesystem to generate the glob results. Having the warning in the docs is a good improvement, but it’s even nicer to have a consistent API that doesn’t need a warning in the docs. And apparently the shell globbing feature that glob is based on does do unconditional sorting.

1 Like

It looks like this previously bit a few other people:

https://bugs.python.org/issue21748
https://bugs.python.org/issue30461

At the time, sorting was rejected because of concerns that it would add substantial overhead, especially for iglob.

I think this is just wrong… sorting is cheap compared to globbing. And looking at the source, even iglob always loads individual directory lists fully into memory (even when using scandir for the actual directory scanning). Fortunately, sorting individual directories is sufficient to ensure that the overall output is sorted, so even iglob can easily support sorting. In fact doing it this way leads slightly better sorting, because it gives a tuple sort on pathname components, instead of an ascii sort on fully rendered path strings, so files in the same directory are guaranteed to sort next to each other. And it’s cheaper, because the sorting becomes O(total-files * log(files-per-dir)), instead of O(total-files log(total-files)).

The one downside is that it would block us from, in the future, optimizing iglob to incrementally scan individual directories. I think this is a pretty minor optimization: it would only make a difference if you have an individual directory that directly contains millions of files/subdirectories, and in 28 years no-one has cared enough about that to fix it. (Most filesystems cannot handle this anyway.) And if someone really does need to support that for some reason, we could always add an unsorted=True kwarg later for those with exotic needs, while keeping the predictable, less-error-prone default.

1 Like

Maybe we can add a sort=False flag to glob? It would make it more obvious from the documentation that glob is not sorted by default (otherwise why’d you need that flag), but avoid the unnecessary performance loss if sorting is not needed (which is the majority of cases IMO).

I did some quick benchmarks, and AFAICT the worst case overhead of sorting (fast filesystem, everything already in cache, etc.) is on the order of ~4%.

Given that lack of sorting in glob is known to have caused very severe problems (each of those 100+ invalidated papers is probably like a year of someone’s life, $50k+ of funding, etc.), and consistent output is generally a convenient feature (e.g. one of the bpo issues is complaining about glob inconsistency causing spurious cache misses in a build service; sorted lists are just easier to read, etc.), I find out hard to justify worrying about a 4% slowdown. Especially since I can’t even think of a situation where glob would be the bottleneck. (Like, you’re probably going to do something with each of those files, right?)

5 Likes

I’m happy to work on this patch, if you haven’t done it already!

1 Like

Fine with me. Note that it’s not guaranteed that it will ultimately be accepted; none of the folks who shot this down last time have chimed in yet :-). For quick benchmarking I just replaced two calls to list with calls to sorted. For a full patch, you’ll also need some non-trivial tests, making sure to exercise all the different paths that glob/iglob can take internally, and checking tricky cases like where alphabetical-by-path-segment is different from alphabetical-by-path-string (e.g. we want ["a/b", "a.b", "aa"], not ["a.b", "a/b", "aa"], even though the latter is sorted as a set of strings).

1 Like

Honestly this sounds like a brittle approach, even if we add sorting to glob(). What if the .data file is present but the .metadata file is missing? It would seem safer to recommend a pattern where you iterate over one set and then find the other with a simple path manipulation, e.g.

for metadata in glob("*.metadata"):
    data = os.path.splitext(metadata) + ".data"
    assert os.path.exists(data)
6 Likes

Reproducible builds are hard. It’s not only about glob() order, there are many other issues: https://reproducible-builds.org/

Python still has multiple open issues where it’s “not reproducible”. I suggest to first fix these issues:

Once I proposed to modify a function to sort a list, but it has been decided to instead fix the code which uses the list. It was a list of C files to build a C extension if I recall correctly. The problem was that the list could be modified in the meanwhile. Pseudo-code:

files = glob("*.c") # sorting here isn't enough
files.append("extra.c")
build(files) # correct fix: sort here

There is also a trend for more reproducible research papers (programs used to compute results of these papers). Search for “reproducible research”.

Oh, I found the fix: https://github.com/python/cpython/commit/0d30ae1a03102de07758650af9243fd31211325a

See https://bugs.python.org/issue30461 “glob returns results in undeterministic order” discussion (hello to your best friend, glob!). See also https://bugs.python.org/issue21748 and https://bugs.python.org/issue25615

=> https://bugs.python.org/issue36302 fixed distutils instead.

Pull request and new issue both open. There’s no glob expert, so any review is appreciated!

1 Like

I agree with Guido and Victor here, fixing glob is a plaster on a much bigger problem, and I think it could actually make matters a little worse.

By sorting the glob results you could end up reinforcing the mistaken assumptions about reproducibility. What if down the line a glob() is replaced by a os.path.listdir()? Or the codebase switches to using pathlib? The latter doesn’t build lists of names, it uses scandir then filters filenames in a generator, so fixing that API would have a higher cost than adding sort to the glob() module, but we are now creating the expectation that glob() is something that produces sorted output. (Actually, I see that the scandir entries are pulled into a list, so these could be sorted, mea culpa).

I rather just explicitly document that any file listing-based data can’t be expected to be sorted.

1 Like

I’ve closed the bug and PR as won’t fix.

1 Like

I wonder how difficult would it be to implement recognition of relying on order of glob in Prospector.

I wonder whether it is worth to add a forced randomization in glob() and listdir(). My concern is that it can has a non-zero cost, and if make it optional and off by default, the most vulnerable to such bugs users are those who do not use special options and third-party tools for checking and testing.

if coinflip(): files.reverse() is just as good as full randomization for this purpose, and extremely cheap; much more so than sorting, which we already saw had negligible cost. You’d need to do it without advancing the global random module state, but that should be doable.

IMO this would definitely be an improvement over the current state; the thing where it’s sometimes consistent and sometimes inconsistent is a completely predictable source of bugs.

I don’t know why the discussion of sorting got cut off though, so my opinion is probably not worth much.

1 Like

I think that’s going too far. In particular os.listdir() ought to return items in the order they’re given by the filesystem, it should be a thin interface to the OS behavior. (Maybe someone wants to replicate behavior they see in C, in order to understand e.g. how the filesystem sorts things.)

For dicts we’ve gone away from forced randomization, so I don’t think we should add it lightly in other places.

I also still don’t think this single incident is enough of a reason to change glob(). Surely bugs in research software happen occasionally. The next bug is going to be something completely different. Basically this reaction feels like closing the stable door after the horse has bolted.

2 Likes

http://reproducible-builds.org/ provides tools which adds random on purpose. For example, there is a FUSE file system which randomizes listdir() order. Such tool doesn’t have to be in Python.

Maybe a Python specific implementation can be written, but I don’t think that it should be part of the stdlib, or at least, not enabled by default.

What about add in documentation that glob, os and pathlib does not ensure the result is sorted in any way?

It was already added for glob().