Treating symlinks consistently in `pathlib.Path.glob()`

Per the pathlib.Path.glob() docs:

By default, or when the follow_symlinks keyword-only argument is set to None, this method follows symlinks except when expanding “**” wildcards. Set follow_symlinks to True to always follow symlinks, or False to treat all symlinks as files.

The default behaviour (follow_symlinks=None) dates back several years (gh-70200), whereas support for follow_symlinks=True and follow_symlinks=False is new in 3.13 (gh-77609).

For comparison’s sake, glob.glob() always follow symlinks. I’ve recently documented the discrepancy. A feature request exists to support never following symlinks in glob.glob() (gh-73661).

Question: should we be working towards making follow_symlinks=True the default, and if so, what’s the best way to achieve that without causing too much churn for users?

In favour of changing the default to True:

  • It would be more consistent with glob.glob().
  • It would speed up some recursive patterns by enabling use of a filtered recursive walk.
  • (Arguably) it better meets user expectations.

In favour of keeping the default as None:

  • It avoids breaking anything for existing users.
  • (Arguably) it’s a good middle ground between following symlinks and exposing users to the worst effects of symlink loops.

I haven’t been able to make my mind up about this for several months now. Really appreciate anyone’s perspective on the matter! Thanks.

1 Like

The easy route is to not change the default. So the more interesting discussion is around how to change it with as little disruption as possible, why it makes sense to do so. After that you can decide if the amount of work over a long time period is worth tracking & doing for this.

my version of how:

(this could easily be defeated by my why below… but the how is illustrative regardless)

  1. Having pathlib.Path.glob() raise a DeprecationWarning when follow_symlinks= is not explicitly specified is a good way to warn owners of existing code that we’d like them to always explicitly add follow_symlinks= parameter. Goal: Existing users add an explicit follow_symlinks= parameter.
  2. As code explicitly setting follow_symlinks=None reads awkward (most people are unfamiliar with the API at this level of detail and would read that wrong and think it’d be the same as =False)… this also means we should: (see TODO on value naming below…)
  3. Add an additional direct behavioral spelling for the parameter. follow_symlinks=NonRecursive or similarly bikesheded name.
  4. Once (3) has shipped in a few releases, start the (1) deprecation cycle, so the default can change two releases later.

This is a long process. Longer than our minimum two-release deprecation cycle as described in PEP 387 – Backwards Compatibility Policy | peps.python.org. Why? So that people writing code running across a wide range of Python versions do not need to add conditional logic to alter their API calls. ie: A distro that ships with 3.13 can be assumed to be in use for 5 years, and the simplest scripts using whatever random os-distro-stale /usr/bin/python3 are not the kind of place people want to think about API versions. So not changing the default until a Python that’ll wind up in the latest long term supported distro 5-6 years from now is helpful.

If the change were more urgent it could be pulled in and done faster, but this default API behavior change doesn’t have an urgent feel to it. It’s a nice consistent thing to have at most.

TODO: Today 3.13alpha has follow_symlinks=None… lets change that to NonRecursive or similarly bikeshedded usefully named/defined behavior constant other than None. The key reason being that someone reading code explicitly specifying =None can’t understand what it does. I left a note about this on the issue.

my attempt at “why?”

This is where I’m unclear and not helping you answer… symlinks blindly being followed are notorious for leading to mistakes and security problems. From that perspective, I like the existing behavior or possibly even =False behavior as a default. Putting the onus on symlink support requirements as needing to be explicitly specified by the API user.

Consistency with glob.glob() is at least nice… But for me, if designing an API new, I’d call glob.glob()'s always symlink-following behavior introduced with ** support in 3.5 the undesirable default. (it could be changed via the same mechanism, and if that is going to happen, it should gain an equivalently spelled follow_symlinks= feature)

But this differs from what people might want or expect as a default. What do they compare recursive ** glob behaviors to? You mentioned zsh having a *** concept to explicitly ask for recursive subdirectory following. zsh does not default to recursive symlink following for **. I’d take that as a sign that for consistency we might want to move to match it. Research into other commonly used things supporting ** could balance or change that thought.

Feature wise, it’d also be entirely reasonable for us to gain *** support. Though it seems awkward to support both follow_symlinks= and *** support (if they conflicted would that be an error?). I like the new keyword argument, it is easy for the reader to understand.

2 Likes

I like the security point of view, and the *** idea.
If we have users who don’t want/need to think about a feature that’s uncommon and possibly even unsafe (symlinks), then it should be opt-in: we need a safe & less functional default (not following symlinks) and require an explicit argument (or ***) to change that.

So, API migration might be moot. But, I’ll comment on it anyway:

I’ve done something similar this with the filter argument to TarFile.extract (PEP 706): raise DeprecationWarning if it’s not specified (in 3.12 and 3.13, until the default changes in 3.14).

The user experience is pretty bad. They’re forced to think about something that might be an obscure edge case, or entirely irrelevant, every time they call the function. It should have been a setting with a good default and a knob for power-users, but now everyone needs to think about it and specify it (until they can drop 3.13 support).
For a security fix, I still think it was justified. But it’s not pretty.

One idea to soften the blow was to emit DeprecationWarning only when the old & new behaviour would have different results (in pathlib: when a symlink is actually encountered). But that’s not ideal: if the library author didn’t think about symlinks and didn’t test them, the DeprecationWarning appears too late – when some user runs the code on a complex directory tree. IMO, that’s worse than warning up-front.
(FWIW, in general it might make sense to warn when defaults prevent a symlink from being followed. I just don’t think it’s good for API migration.)

5 Likes

Thank you both, that’s a big help.

I agree that follow_symlinks=NotRecursive is much better than follow_symlinks=None. At a minimum I’ll make that change.

That confirms my fears! So forcing users to pass follow_symlinks for several years isn’t worth a small eventual consistency improvement.

I’ll test a few shells and get back to you.

I think supporting *** would entail removing follow_symlinks, otherwise it becomes difficult for users to understand the interplay. There’s still time to make this change, but it seems a shame to diverge from glob.glob()'s pattern language I guess.

One more idea: we could add two keyword arguments:

Path.glob(pattern, *, follow_symlinks=True, recurse_symlinks=False)

Not sure if that has legs?

1 Like

Not for me. I prefer the singleton/tri-state argument, for the same reason that it becomes hard to understand the interplay.

Where would the constant NonRecursive be defined?

This is not the first time we’re having a tri-state argument with values None, False and True; but maybe it’s the first where we’re trying to change the default? The two-keyword version has the problem that it doesn’t work with Python 3.12, thereby extending the needed deprecation period – otherwise I’d like it.

pathlib.glob() does not yet have a follow_symlinks= parameter in 3.12.
That’s a new 3.13 feature, thus having time to change the tristate.

Other pathlib APIs do have a follow_symlinks= parameter. I’m not sure NonRecursive would make sense on those.

But to avoid mistakes, probably a good idea to add a __bool__ to the NonRecursive that raises a meaningful explanatory error to avoid confusion of it being used in a boolean context like the other APIs.

1 Like

Some notes on shell globbing

IIUC shells always follow symlinks when expanding non-recursive patterns, like foo/bar or package/*.py.

Support for ** and *** wildcards is enabled via a globstar shell option in some shells. Support is as follows:

** ***
sh, bash 3 non-recursive non-recursive
bash 4.0 w/ globstar follows links non-recursive
bash 4.3 w/ globstar doesn’t follow non-recursive
zsh doesn’t follow follows links
tcsh w/ globstar doesn’t follow follows links
fish follows links? non-recursive?

pathlib’s default behaviour is similar to bash 4.3 w/ globstar, whereas glob.glob() is like bash 4.0.

1 Like

Given that shells always follow symlinks when not recursing, I think I’ve overcomplicated things by worrying about the non-recursive case. Perhaps all we need is a boolean recurse_symlinks argument that controls the behaviour of ** specifically? Defaults preserving existing behaviour:

glob.glob(pathname, *, recurse_symlinks=True)
pathlib.Path.glob(pattern, *, recurse_symlinks=False)

(edit: I’ve put this up as a PR: gh-117311)

That sounds reasonable, although I wonder whether there was a particular reason that bash changed to not following?

It looks like it was changed in response to a user report on the ‘bug-bash’ gnu.org list.

I’m conscious that 3.13 beta 1 is right around the corner, and that the current API isn’t very good, and so I’ll merge gh-117311 (i.e. replace three-state follow_symlinks with two-state recurse_symlinks) in a couple of weeks’ few days’ time if no one has any objections. Just a heads up. Thanks everyone.