Should glob.glob() output be stable?

The Python glob module makes no such promise as the one you gave the impression it does. I don’t know if the docs’ve recently changed, but they now address your concern precisely. Simply reading to the end of the sentence:

The Python glob module promises: “ The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.”

2 Likes

The clarification about arbitrary order has been in the documentation for almost 10 years now: Issue #25615: Document unsorted behaviour of glob; patch by Dave Jones · python/cpython@9f3c094 · GitHub

Hi Paul,documentation needs to be available, but I know, from the documentation of Electronic Design Automation tools, that the documentation can cover an item you are looking for, but you may not have sufficient adjacent knowledge to understand what is written. EDA tool vendors employ engineers with that knowledge to train customers engineers, and It has normalised my feeling of being able to decode what was previousely impenetrable documentation after learning a bit more.

I know Python, but would not be surprised if the same thing happens to those new to a particular intricacy of Python - they can and probably do read the documentation, but do not have enough knowledge for details to register.
That is why I tried to explain a mindset that I have found helped me in some instances - trying to think of what my algorithm needs that I might be taking for granted.

Sorts in given data has tripped me up before, (an EDA supplier shoe-horned in an extra feature in a subsequent release and the implimentor did a sub-sort for just that features contribution of the data when previously the documentation, and the implementation, provided a sort over the data from all features). Once bitten, and not liking the debug, I am more cautious/defensive in programming and reading docs. Hopefully I’ve reduced my error rate :slight_smile:

And it’s the first sentence.

rules used by the Unix shell is a bit outdated. Who uses Unix not Linux? and which shell? sh, csh, bash,dash, ksh, zsh …

I think documentation helps if you’re specifically looking to confirm whether the results are ordered, but you need to keep the question “can I rely on the order?” in the back of your mind for that. And that question, not necessarily only for glob.glob but for everything involving a set of results, starts bothering you when you gain more experience (or when you get bitten by it once). It seems unavoidable to me no matter how many disclaimers you add to the documentation. Programmers sometimes assume things that turn out to be wrong. And they learn from those mistakes and become better programmers.

2 Likes

The pull request is merged.

5 Likes

I was curious about the history of glob, what is it, why is called like that and how it came to be.

It is a unix tool, originally a separate command for global pattern matching, it used to be located in /etc/glob, it was called by the shell when it needed expansion: https://utcc.utoronto.ca/\~cks/space/blog/unix/EtcGlobHistory

The earliest source code that I could find was from Unix V2 from 1972: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V2/cmd/glob.c

Since that version it already states in the source code: “find all files in current directory which match the param, sort them, and use them”

Python chose to call the module glob, but it implemented just the pattern matching part. The full implementation includes sort.

I think users that reach for this module usually come from a Unix background, and Unix has taught them for more than 50 years to expect the output of expansion to be sorted.

Following the “Principle of least surprise” glob output should be sorted.

My question is could it be Python that is wrong and not the users?

Interesting background - thanks.

However, I think it’s overly simplistic to think about this in terms of right and wrong, just as it is in terms of black and white.

Python made a different design choice.

It traded off a slightly reduced feature set, in favour of efficiency. Python has many easy ways of subsequently sorting the results. Does Unix?

I’ll be happy to see support for glob.glob(sorted=True) added though.

I did a search and found this description of Arch Linux glob that explicitly states there is no search.

This command performs file name “globbing” in a fashion similar to the csh shell or bash shell. It returns a list of the files whose names match any of the pattern arguments. No particular order is guaranteed in the list, so if a sorted list is required the caller should use lsort.

If many Linux implementations don’t sort then it doesn’t seem to add to a reason for Python to sort.

But then I found GLOB_NOSORT for “Linux”, so it is confusing!?!

       GLOB_NOSORT
              Don't sort the returned pathnames.  The only reason to do
              this is to save processing time.  By default, the returned
              pathnames are sorted.

Interesting. So is the POSIX standard perfectly happy if glob results are unsorted?

Re: POSIX. No. As well as GLOB_NOSORT the sort key can be adjusted via “the collating sequence in effect in the current locale”

Python has its own history at this point. UNIX history is to Python history as the history of the Red Delicious apple is to the history of the Honeycrisp.[1]

There has also been 50 years of computing history, during which globbing has been implemented several times with different behaviors in several tools. And in that time, filesystems have changed pretty significantly as well.

It is incorrect to extrapolate from the fact that “things were sorted in UNIX in 1972” to the claim that “things have been sorted for 50 years across all *nix environments”.

Python has a 35 year history of not sorting glob.glob.

Is this really the argument you want to make anyway? Do the users who are expecting stable output have that expectation because they’ve used and relied on other tools with sorted output and never used any, excepting Python, which do not sort?
I think it’s a lot more likely that people make this mistake because they simply haven’t thought about it yet; they don’t even realize they are making an assumption.

Following the UNIX principle of “do one job well”, glob output should not be sorted.

Principles can be genuinely useful, but they are not an argument in their own right. They usually encode a number of assumptions.

You are assuming that sorted output is “less surprising”. I would find it surprising if glob.glob started spending time doing sorting that I do not expect or want.

I urge you not to say “right/wrong” simply because it results in yes/no answers.

Are the users right? No. The docs clearly stated that they are not. Their assumptions were wrong.

An unsympathetic reading ends there.

But, could Python do more to accommodate these users? Yes! We should be enthusiastic about finding a way to improve it if and when someone makes a good case for it.

Python, as a complete product, includes its documentation. And the docs have already gotten an adjustment in response to this thread. So we have improved Python already. (And kudos to the PR author and reviewers!)

Maybe there should be further refinement. But I’m not finding a convincing argument here. People make mistakes, and ideally they learn from them. Why is this mistake, which is addressed directly in the docs, special?


  1. Within the US, which grows a lot of apples, Red Delicious used to be ubiquitous. It is very red, but it’s not all that delicious. Honeycrisp, a breed which first started to appear in the 1990s, has, along with a few other varieties, almost completely displaced the Red Delicious. ↩︎

5 Likes

An even less sympathetic read here is that we’re discussing making every use of glob.glob to be slower, because an AI coding tool can’t read docs in a useful way.

If people are going to insist on using these tools, they should find way to accurately inform the tools of the properties of the functions being composed in a way those tools actually use and understand, and do it externally from the language because nobody else needs this in some programmatic fashion[1]. The human readable docs are for human readers, and have to be structured to cater to human use.


  1. Some people actually do need to be able to prove determinism, but they already aren’t using python, and python is never going to be the right language for people who need this. ↩︎

6 Likes

Thanks to everyone that contributed to the discussion, I’m happy to have read what you all think and how you feel about this changes.

Kudos to the python team for updating the docs!

5 Likes