Symlink resolution in starting script path

What is set as sys.argv[0]?

On some Unixes, true and false referred to the same executable file. It looked at argv[0] and chose behavior accordingly. It seems that this is not possible with Python scripts and symlinks.

1 Like

What is set as sys.argv[0]?

A little experiment suggests that it is the original first argument, before symlink expansion. So you can play the games you mentioned.

2 Likes

While doing bug hunting I played a game with JavaScript:

process.argv[1] = original input script argument
import.meta.url = seems to be some sort of real path, but URI-fied

The import.meta belongs to the new ES modules and is context sensitive.

Maybe this helps a little bit inform a decision about Python.

Disclaimer: So far only tested on Windows with Node.js v20.7.0.

I’m leaning towards we should properly define this and make it reliable cross-platform, specifically:

When determining the initial contents of sys.path, and the launched script file is a symlink to a file in another directory, the parent directory of the target is used as the default search path (sys.path[0]) for the script rather than the directory containing the symlink. Links in other parts of the path are not relevant for this check, and only one link is followed (that is, a link A to a link B to a file C will use the directory of B, not C). The contents of argv[0] is not affected, and will contain the path as provided by the user.

It’s only a single check at startup, so I’m not concerned about the perf implications. Security-wise it might be possible to abuse, but it’s likely less vulnerable to privilege escalation than the current (Windows) behaviour.[1] I don’t think there’s a need to backport, so it would be new in 3.13 (though obviously the POSIX behaviour is unchanged, it’s just got a definition now.

Might also be a good opportunity to move the implementation into this part of getpath.py.

Any other thoughts/concerns?


  1. If you create a link to a script you can’t access, only someone with access can actually launch it, and when they do it’ll have its original dependencies and not the attackers. Compared to today, where a symlink could also substitute modules at runtime… ↩︎

3 Likes

Yes, please. I am a proponent for deprecating symlink resolution and PYTHONPATH magic for scripts. I’ve ran into this behaviour a couple of times and although it’s pretty easy to work around most of the times, it doesn’t really follow the unix conventions and the principle of least astonishment.

While I wouldn’t say that it is actively “harmful”, but it is definitely non-POSIX-y. Symbolic links are supposed to be the “soft” counterpart to hard links. Just like a hard link, a symlink is supposed to behave “as-if” the file was just copied under most circumstances. If you want to interact with a symlink as a symlink, you always have to do something “extra” (use a different, special syscall to inspect the symlink itself, actively resolve the path to get the target file, etc).

Notably, symlinks are NOT shortcuts. There is no reason for python ./symlink_to_target.py to act any differently from python ./copy_of_target.py. I think that it’s pretty clear, that if we (temporarily) ignore the backwards compatibility angle, there is no good reason to have this special case.


Now, regarding how we could deprecate this behaviour – I’ll admit that it’s not going to be seamless for anyone relying on this behaviour (like any deprecation). Luckily, it should be fairly easy to incrementally deprecate the old behaviour.

  1. For starters, keep the old behaviour, but emit a warning. Make the new behaviour opt-in with a __future__ and/or a CLI flag and/or an environment variable.

  2. During the transition period, all instances where symlinks are used should either opt-in to the __future__ (if the script didn’t actively rely on the old behaviour), replace the script with a wrapper script (similar to the entry point scripts) or modify the script to add its resolved path to sys.path before doing anything else.

  3. After the deprecation period passes, make the new behaviour the default.

Optionally, we could include a simple way to explicitly opt-out of the new behaviour with something along the lines of import sys; sys.add_script_dir_to_path(). This would further simplify points (2) and (3).

3 Likes

We can’t make the behaviour opt-in/out with anything in the Python file, because we haven’t looked at it by this stage (unless we’re going to do something really clever with importers… which we could, but I suspect is not worth it).

So we can add a warning in the case where the script is a link, with an environment variable to suppress the warning, and then later remove it entirely.

Code that wants to include its own directory post-symlink can already sys.insert(0, str(Path(sys.argv[0]).realpath().parent)), or for 2-3 lines can more precisely match the current behaviour. Without a bunch of people jumping up and down saying they rely on this functionality and can’t change, I wouldn’t want to promote it to a supported sys function.

As this discussion didn’t seem to result in any resolution I figured I would provide an opinion & practical example as someone who uses a lot of symlinks on my system;

What would require least amount of 4D thinking would be treating all files via their virtual paths rather than “real” paths, i.e:

import sys,os
sys.path.append(os.normpath("C:/path/to/symlink/containing/foo/py/file"))
import foo
print(foo.__file__)

for path in sys.path:
    if foo.__file__.startswith(path): print("found sys.path element that lets us find module foo"); break

should, imo, ideally, be guaranteed to always trigger the last print statement, anything else to me is confusing and will require a lot of extra magic to make work :slight_smile: Current behavior in 3.14 on win64 is that the __file__ attribute returns the real path, rather than the virtual path which we appended in the beginning of the script, which, in my example would mean for the last print statement to never get executed.

This became a problem for me today as I was writing a program that needs to find which sys.path dir a loaded module is located in; as of writing Im not even sure how to resolve my situation as I do not control how the module is loaded with my current API design (and doing so would bloat my code a bit and thus increase the amount of boilerplate necessary).

On the other hand, I don’t seem to be able to think of any benefits of resolving the real path (if anything, maybe it can result in slightly faster file reads??)? Since symlinks are intended to behave like real files and folders, I would also like to think that changing from current behavior (virtualized) to real paths not to have any practical implications for existing scripts? To me it seems rather obscure that any program would be written in such a way that it expects the `_file_` attribute to always point to a real file.

1 Like

Are we talking about some of those 75 symlinks I have in ~/.local/bin? Not everybody uses packages for everything. Some of these are just symlinks to the git checkouts where the actual script is maintained (and not all of them are Python, true). Some of these are really tiny, not worthy of package on PyPI. I think I am not the only one who has stuff like that in ~/.bin (which itself is a symlink to ~/.local/bin).

1 Like