Should glob.glob() output be stable?

Recently with my team we found a bug developing a new feature, we were using glob.glob() results and passing it to a bioinformatics tool. Turns out that the order of the files affects the results.

It doesn’t matter what the actually order is but we need consistent ordering for reproducibility.

We found out that there were other glitches like this in papers assuming consistent ordering of results. See here: A Code Glitch May Have Caused Errors In More Than 100 Published Studies

I see that there’s already an issue reporting this inconsistency and the output was a new piece of documentation: glob.glob should explicitly note that results aren't sorted · Issue #77456 · python/cpython · GitHub

We were using AI (Cursor) to generate this code and it also assumed consistent ordering.

My question here is if so many people (and ai) are expecting this to be sorted, should we make it sorted by default?

Sorted in what order? Linux filenames are case-sensitive, but Windows filenames aren’t.

4 Likes

I don’t have a really good answer for that, but my assumption would be similar ordering as ls command

There’s your mistake.

One thing AI does seem to be good for is asking it whether something has already been suggested/ discussed. I did now and it right away found this (and others), which even talks about the same science case:

5 Likes

ls isn’t consistent either, since it uses the current locale.

1 Like

Is it an example of “In the face of ambiguity, refuse the temptation to guess.”?

1 Like

That was my mistake, I searched through the issues before posting here, but couldn’t find it

Though curiously the reasons given in the issue for making glob.glob() sorted still hold up. And the reasons for not changing anything amounted to “we can’t think of a way to do this that is unambiguously the best way to do it”. Also the arguments in favour are the same, down to linking to the same event in 2014 :eyes:

There’s also a performance issue for large directories. If we start sorting, everyone pays the cost, even if they don’t need sorted listings.

13 Likes

OK, so I guess the question is then, have you thought of a way to do this which is unambiguously the best way to do it?

I’m personally neutral on the question of whether we make this change, but I don’t think it’s reasonable to just dismiss the legitimate concerns expressed both here and in the linked issue. You’ve already said you don’t have a good answer to the question of case sensitivity, and “the same as ls” isn’t helpful, because (1) the ordering used by ls depends on what options you use, and as far as I know isn’t documented if you don’t use options, and (2) ls is a Unix utility, and so doesn’t help for Windows - and it has different implementations even on Unix (for example, GNU ls and BSD ls).

1 Like

Many people have come and asked for glob to be stable/ordered. I wonder if we do some ordering by default like just sorted()before the results, is anyone going to come and say that they prefer an unsorted version?

In that case we can always provide glob.glob(sorted=False).

I’m thinking as a user here, that I’d rather get stable results even if they’re not perfect than something unexpected

2 Likes

I think I would argue that it’s still worth doing even if you’re not 100% certain it’s the 100% best solution.

This linked issue was written in 2019, 6 years ago now. If we had sorted in a manner that’s sensitive to case, and 6 years later we discover that actually case-insensitive sorting would have been preferable to a significant fraction of users, I consider that to be not-a-big-deal.
Maybe I’m missing the harm that could be inflicted by a glob that is sorted in a manner that surprises some users.

We could have had stable glob ordering, and a sorting that is preferred by at least some people, for all of the last 6 years and all of the future.

The performance cost sounds almost trivial to me, if I may trust the authours of the git issue, just 4% extra time needed for the globbing. (which is usually a small part of any program.)

In the[1] case that a significant number of people are adversely effected by making glob sorted, it would be possible to add a sort_method kwarg to glob, which is backwards-compatible and can give everyone the sort method they want (including no sorting). I don’t expect that would be worth the code complexity. But it does mean it is possible to recover from any “errors”.


  1. unlikely(?) ↩︎

I don’t think it should be sorted, for various reasons. If you need determinism of order, you can always add it yourself with a call to a sort function.

+1 on more clearly documenting this given the known impact, though I’m disappointed that people have been relying on this having a stable order when it does not advertise that it provides this.

15 Likes

Yes! I know that’s probably a surprising answer, but it’s true (in addition to being a little bit intentionally provocative). I write a lot of unix-y scripts which do not care about ordering at all, and I would not like to pay a performance penalty.

I am open to the argument that I should use iglob if I care about this. But shouldn’t users with these other needs use list.sort()?

I’m not all that strongly against sorting the results, but in order to argue that the stdlib should change, I think a stronger rationale is needed. Users made a bad assumption. Bugs happen; they’re an inevitable part of creating software. Why not sort the list if you care about this?

I don’t think glob.glob should grow a sorted flag. Globbing and sorting are orthogonal, and the API surface should be kept clean. I think that principle holds even if it starts to sort internally (and, presumably, users like me are encouraged to switch to using iglob).

5 Likes

It’s a bad expectation and has never been true on any OS.

ls gets an unsorted list from the OS and as a presentation option will sort it lexigraphically by default. There is an ls option to stop it sorting and sort by other properies of the files.

But the name sort order that ls does is not the same that you typically see in a GUI file manager. The GUIs tend to sort numbers in filenames in numerical order and not lexical order.

How do you sort these file names a1, a2, a10?

Like this as GUI’s often sort

a1.txt
a2.txt
a10.txt

or as ls will

a1.txt
a10.txt
a2.txt

???

This means that for people that are used to the output of ls and dir will not agree with the sort order that people used to GUIs expect.

1 Like

I see your point. And I agree, I don’t think a specific “ordering” is better or more correct than the other since it depends on the use case.

I’ll change my question to: “Should glob.glob() output be stable?”

I think what’s important is that the output is the same given the same set of files.

1 Like
def sorted_glob(*args, **kwargs):
    result = glob.glob(*args, **kwargs)
    result.sort()
    return result

IMO, this is a classic case of “not every 3 line function needs to be in the stdlib”. This version gives you full control for your usecase:

  • ascending/descending
  • case (in)sensitive
  • locale aware
  • something like natsort if you want an “intuitive” sort order.

If you have any of those needs and glob is already sorted, you are paying twice. Adding all these options to glob would bloat the interface unnecessarily.

6 Likes

I am against a change that imposes CPU costs to glob.glob().

When I care about an ordering I sort, when care about performance I do not want unnecessary sorting forced on me.

I think it’s better to educate that the output of not stable, even for the same set of files.

1 Like

Thanks. That’s a much more reasonable question to ask. But the answer remains more or less the same: the OS doesn’t guarantee that you will get the results in a stable order (not least because you’re presumably asking for “stable across all systems”, if you want to avoid the sort of order-dependent bugs that you’re using to justify the request), and imposing a cost (no matter how small) on all users simply to protect users who failed to read the documentation, is a bad trade-off. And having a sorted option to glob() is pointless, as it’s just as easy to say sorted(glob(...)) instead of glob(..., sorted=True) (and it’s a tiny bit shorter :slightly_smiling_face:)

1 Like

What if the OS returned the files of a directory in the order that they were added to the directory, much like Python’s dict? There would be no guarantee that the same set of files would always be returned in the same order.

You’d then be back to Python imposing some order.

1 Like