PEP 778: Supporting Symlinks in Wheels

steve.dower · May 21, 2024, 3:28pm

Strongly disagree with letting the symlinks point anywhere, or be created anywhere. At the very least, it should be restricted to the install root (i.e. site-packages or equivalent), but more likely ought to be restricted to files that exist within the package (i.e. are listed in RECORD).

(Aside: please specify that LINKS also has to be listed in RECORD, and that they are always created after all files have been extracted - that avoids the possibility of linking to /etc and then having files from the archive be extracted there under sudo pip install ....)

On Windows, I definitely do not want failure to be the default. I think it’s okay to specify that LINKS on some platforms may be implemented by hard links or copies, rather than actual symlinks, and users of them have to be prepared to live with that. AFAICT, that shouldn’t affect the primary use case for libs, and it allows installers the most flexibility to succeed in any context.

pf_moore · May 21, 2024, 3:37pm

Thanks. Yes, that sounds like the sort of exploit that I’d missed. And presumably even if LINKS is processed last, you could link x to /etc, then link x/passwd to my_malicious_file. To fix that would require some pretty careful checking in the installer, and would not be something I’d want to rely on pip and uv getting right in all cases…

This sounds reasonable, and the PEP should be clear that uses of the feature must be prepared to deal with cases when the symlink will be implemented with some other approach (a copy or hard link, or maybe a junction - we seem to have discounted junctions at this point, though). Even on Unix this might be the case (depending on filesystem). To put it another way, the LINKS file should be specified as a way to record that two names should be equivalent, and while the intended implementation is a symlink, this is not required by the PEP.

I’d be OK with installers being allowed to provide options which let the user choose how symlinks will be implemented. But I’d also suspect that installers may well decide that’s a case of YAGNI and not bother

steve.dower · May 21, 2024, 3:40pm

Oof, I missed that one. Creating links in order from longest path to shortest ought to handle it, but yeah, this is a huge can of worms. Requiring both link and target to exist within the install location is much easier.

emmatyping · May 21, 2024, 7:06pm

Yeah, I very explicitly want to avoid the possibility of any chains of symlinks to be a path to malicious behavior.

I agree that LINKS should be listed in RECORD, and I already specify that links should be made after all other files are extracted, but if that is unclear perhaps I should clarify.

Also, I want to allow a small amount of flexibility, of extraction target which is to allow extracting to anywhere in a namespace package if the source package is installing a namespace package. This would allow for limited but useful cross-wheel dependencies between libraries.

pf_moore · May 21, 2024, 7:40pm

I’m asking the following because I want to make sure we’re precise in terminology here.

First of all, I’m not sure what you mean here by “extracting to”. We’re talking about LINKS so should I understand this as meaning allowing the source or the target of a link to be “anywhere in the namespace package”?

Also, can I just clarify “package” and “namespace package”? A wheel contains a distribution package, which is essentially a bunch of code that will get installed in site-packages. A namespace package is a type of import package, which is an importable thing visible on sys.path. Specifically, it’s a directory. Yes, that’s all, at this level - it’s an empty directory. A distribution package can contain one or many installation packages (or zero, if you want pathological cases!)

With that in mind, suppose I have access to a wheel on PyPI. I modify it, to include the following:

An empty directory called requests
A file called malicious.py (see later…)
A LINKS file that creates a symlink from requests/certs.py to malicious.py

If that’s valid, I’ve now overwritten the module in the user’s copy of requests to replace the code that finds the CA certificate bundle used by requests.

An installer should detect the fact that I’m overwriting an existing file, but that’s just as true for the exploit I proposed above, where /etc/passwd was overwritten with a link to a malicious file. And in reality, I can right now just publish a package containing a file requests/certs.py. So this isn’t a new exploit, it’s mostly just demonstrating that limiting LINKS to namespace packages is no real improvement over allowing them to target anywhere in site-packages.

My instinct is^[1] that requiring the source and target of links to be restricted to precisely the same locations as a wheel is allowed to extract files to, means we’re not introducing any new vulnerabilities. It may offer more subtle ways to hide such exploits, is all.

but last time I said this, Steve immediately posted a problem I’d missed, so bear that in mind! ↩︎

emmatyping · May 21, 2024, 7:50pm

Sorry, I have a bit of PyCon exhaustion, what I meant to say is “I want to allow a small amount of flexibility, which is that the target of a symlink can point to anywhere under a package root of any package that it extracted into”

Yes, in this instance I mean a directory at the top level of site-packages. i.e. if I have namespacepkg.foo and bar as both packages that get installed from one wheel, I could have the target of the symlink point anywhere under either namespacepkg or bar.

njs · May 21, 2024, 9:35pm

Oh sure, people use symlinks for all sorts of things. But AFAIK they’re just a thing people use, not actually relevant to the specific things wheels need to do. For example:

On a Linux system, you’ll see lots of symlinks in /usr/lib, like libxml2.so.2 -> libxml2.so.2.9.13. What’s going on here is that the library name that users are expecting is libxml2.so.2 (the “SONAME”), because all 2.x.y releases are supposed to be compatible. But, the system assumes that you might want to have multiple compatible point releases of libxml2 installed at the same time. So each individual release installs its own libxml2.so.2.x.y, and then a separate process (ldconfig) comes along and looks at the set of installed versions and figures out which one of those you want to actually use (the one with the largest version number) and creates a symlink from the real name libxml2.so.2 to the one you want to use.
OpenBLAS can build for multiple different microarchs, so you again might multiple files coexisting (libopenblas_skylakex_xxx.so, libopenblas_haswell_xxx.so, …), and then on a particular system you look at all the installed packages + the current CPU architecture and pick “the best one”, and use a symlink to make that one into libopenblas.so.

What these both have in common though is that the point of the symlink is to make a late decision about which library to use, looking at global context on the final system, that you can’t make when installing the .so files in the first place? But this PEP is a proposal for putting static symlinks inside individual wheels, so it doesn’t help with these situations — you have to pick which library gets to be libxml2.so.2 or libopenblas.so when you build the wheel, so it can’t depend on which other libxml2.so.2.x.y candidates you have installed or what your final cpu microarch is.

The other thing is that this is all ways of manipulating Linux’s runtime loader’s search-for-library-with-name logic, and when linking to .so files in wheels, you can’t use that logic anyway (or the equivalents on other OSes). There are two cases.

Case 1: you’re distributed the binary that uses the library + the library together in the same wheel. In this case you should be statically setting up your DT_RUNPATH (or equivalent) to point to the .so that you want, and you have full control over both sides of that connection and can set them up however you want, so adding an extra layer of indirection through a symlink doesn’t do anything for you – you could just point it to the right name to start with.

Case 2: you’re distributing the binary that uses the library in wheel A and the library in wheel B. In this case, you actually have no idea where the binary and .so will be located relative to each other in the final install; all you know is that they’ll both be on the python interpreter’s sys.path. There is no guarantee in the wheel spec that all packages will be installed into a single site-packages directory. And this is actually quite useful (e.g. my trick in posy where it assembles python environments on the fly on each invocation by unpacking each wheel to a separate directory + using a lock file to generate sys.path at startup). Or a more common situation would be editable installs, where you get random git checkouts placed directly on sys.path instead of having everything in site-packages together.

This what my old pynativelib writeup is about. On Linux, this problem is actually pretty easy to solve Correctly — e.g. if numpy.whl wants to use libopenblas from openblas.whl, you:

Give libopenblas.so a unique soname, maybe something like libopenblas-from-wheel-compatibility-api-v4.so
Put this soname into your numpy extension module binaries, and into the libopenblas.so you distribute (or all of them, if you have multiple and want to choose between them at runtime)
Have numpy do import openblas; openblas.setup() at the top of __init__.py, before it loads those extension modules
Have openblas.setup() do whatever runtime cleverness it wants to sniff the current microarchitecture, construct a path to an appropriate .so file, and dlopen("/path/where/openblas/wheel/is/unpacked/libopenblas_skylakex_xxxx.so")

The dlopen loads an entry into ld.so’s cache saying that requests for the soname libopenblas-from-wheel-compatibility-api-v4.so should be fulfilled by the library at /path/where/openblas/wheel/is/unpacked/libopenblas_skylakex_xxxx.so, then when you load the numpy extensions ld.so sees that they’re requesting libopenblas-from-wheel-compatibility-api-v4.so, sees that it already has an entry for that in its cache, and links them together, and away you go. I don’t think symlinks show up at any point in this process.

(This same strategy works on Windows; on macOS, the necessary tricks are much more complex. But on macOS AFAIK there aren’t any conventions for using symlinks for shared libraries so I don’t think it’s relevant to this thread in particular.)

I do totally get that this is confusing, the tooling isn’t there, and people don’t actually distribute libraries like this today. So instead people bake in the assumption that wheels will all be installed into a single site-packages directory, and set up their DT_RUNPATH to point to like, ../some-other-package/libs/, and then you have to make sure that the desired library has the correct soname, that its filename matches the soname, and that the file is located at ../some-other-package/libs, which all become part of some-other-package’s public interface contract. But (a) changing packaging standards through PEPs and deploying new standards across the ecosystem is very labor intensive so IMO if we’re going to do it we should spend that effort on full solutions instead of enshrining partial workarounds for the limitations of existing PEPs, (b) actually AFAIK nothing in what I just said relies on symlinks anyways, except to let you also expose the .so under a different name that’s not part of the public interface contract and that no-one uses?

Process note: I find these discussions end up less frustrating for everyone if the authors try to be overly-anal about adding all of these kinds of motivations to the PEP text, instead of only bringing them up in discussion where they tend to get lost after the conversation moves on, so please do add this :-).

Content: I’m not sure how to map this example into wheel-land… In wheels, executables aren’t represented as files at all, but as entry-points that generate tiny wrapper scripts to invoke the actual executable code. So how would your LINKS help here? And if pip install samurai wants to provide both samurai and ninja executables that do the same thing, can’t it do that already (and more portably) by having two entry points that point to the same place?

eryksun · May 22, 2024, 3:51am

A junction is basically like a bind mountpoint on Unix systems. Here’s an example layout to demonstrate an important difference in path parsing between a junction and a symlink.

A
│   spam.txt
│
├───B
│       spam.txt [..\spam.txt]
│
└───C
    │   spam.txt
    │
    ├───B_junction [C:\Path\To\A\B]
    └───B_symlink [C:\Path\To\A\B]

>>> os.listdir('A')
['B', 'C', 'spam.txt']
>>> os.listdir('A/B')
['spam.txt']
>>> os.listdir('A/C')
['B_junction', 'B_symlink', 'spam.txt']
>>> open('A/spam.txt').read()
'A spam\n'
>>> open('A/C/spam.txt').read()
'C spam\n'
>>> os.readlink('A/B/spam.txt')
'..\\spam.txt'

Here’s the result of evaluating the relative symlink “spam.txt” that’s in directory “B”, depending on whether it’s accessed via “C\B_symlink” or “C\B_junction”.

>>> open('A/C/B_symlink/spam.txt').read()
'A spam\n'
>>> open('A/C/B_junction/spam.txt').read()
'C spam\n'

Evaluating “B_symlink” replaces the opened path in the kernel’s path parsing. Thus the relative symlink “B\spam.txt” gets evaluated as “C:\Path\To\A\B\..\spam.txt” → “C:\Path\To\A\spam.txt”^[1]. OTOH, evaluating “B_junction” does not replace the opened path in the kernel’s path parsing, so the relative symlink gets evaluated as “C:\Path\To\A\C\B_junction\..\spam.txt” → “C:\Path\To\A\C\spam.txt”. This is the same behavior that one would observe on Linux if a bind mountpoint were used in place of “B_junction”.

Another difference is that junctions, as they are mountpoints, are required to target a directory in a local filesystem. Thus when a junction is traversed remotely, such as via SMB, the remote server can reliably evaluate the target directory of the junction on one of the server’s local devices (e.g. a directory on the server’s “Q:” drive) and leave the junction itself in the opened path, as is normally done for junctions.

A remote symlink, on the other hand, is always evaluated on the client side. A remote symlink might target a relative path within the share (e.g. “..\spam\eggs”), a local path on the client (e.g. “C:\spam\eggs”), or a remote path on another share (e.g. “\\server\share\spam\eggs”). The server sends the client the target path of the symlink and the offset of the remaining unparsed path after the symlink. The parsed path up to the symlink may contain junctions (which were evaluated by the server), but it should not include any symlinks. It’s up to the client to resolve and open the target path. If it’s a local absolute path on the client (i.e. remote-to-local symlink) or a remote absolute path on another share (i.e. remote-to-remote symlink), the client’s symlink policy may forbid the open. Generally speaking, remote-to-local (R2L) symlinks are dangerous, and clients should always forbid opening them.

Note that Windows 11 has changed the behavior when evaluating a relative symlink. The new behavior doesn’t affect the given example, but it does in other cases. The path parsing routine of the I/O manager in the kernel has been changed to logically normalize “..” components in the target of the symlink, similar to how the runtime library in user mode behaves. Previously it resolved them physically, like POSIX. It’s now the case that, for example, both “junction\..\spam” and “symlink\..\spam” get logically normalized to just “spam”. The parsing routine no longer evaluates “symlink” to replace the opened path before resolving the “..” component. ↩︎

rgommers · May 22, 2024, 9:01am

The dangerous thing here, and in general the tricky cases due to “multiple redirection” tricks, seems to be symlinking of directories rather than files. I only thought about the latter; the PEP doesn’t mention directories. Would symlinking a directory be allowed at all, and if so is there a use case for that?

rgommers · May 22, 2024, 9:40am

That perspective assumes that it’s not needed to have symlinks for anything that can be done in a wheel in some other way / with a workaround. Which is one way of looking at it - but can be pretty painful in practice. If invoking a project’s build system produces a set of build artifacts including shared libraries and symlink, and that’s how it is used by the project authors as well as by distro packagers, then having to make a bunch of tweaks just to fit into the constraints of wheels is labor-intensive and fragile. Sure you can write some script to post-process stuff, but it can be a lot of work even when you’re very familiar with Python packaging. If you’re not, you’re probably just not going to bother / not get it right.

All agreed - and scipy-openblas32 · PyPI actually does contain pretty much exactly the logic for steps 1-4 you describe.

By default they do, since the build system produces them. All you’re saying I think is “you can do the work to remove them, and things are still possible today”.

It’s starting to be done, as the link to scipy-openblas32 I posted above shows (PyArrow and some other project are also doing this, or in the process of doing this).

That would indeed be quite convenient - but I haven’t seen it done that way yet.

I think you are also missing a Case 3, which is the MKL example I gave. That’s a non-Python project that is simply bundled into a wheel, and the goal in the CI job I used to is to use it at build time. Just to use it on that machine, not to build new wheels that depend on the MKL wheel. The latter would need your step 1-4 logic, but the former does not. However, it does require the symlinks to not go missing.

+10 for having a good motivation section with a couple of clear examples. Note that I’m not an author though. And you can’t expect @emmatyping to read my mind about which examples are going to be brought up in this thread:)

Sure, yes - you do need a [console_scripts] entrypoint ^[1] and that can be your second executable. The point is again that if the build system already produces a symlink and marks it for installation, the wheel build is going to choke on that. When you run python -m build, there isn’t even a good point to stop and insert a rm path/to/executable. So what you then have to do is add a build flag like --dont-install-symlinks to the project (which you may or may not have control over), which is problematic. ^[2]

IIRC you can also install an executable directly into prefix/bin/ by using scripts, but let’s not delve into that one since it’s not recommended. ↩︎
The way it works with meson-python is that Meson (the build system containing executable(...) build commands) produces a metadata file with everything to be installed, which will contain {bindir}/executable-name, and then meson-python translates that to a path inside the wheel. Which works, unless executable-name is a symlink. ↩︎

abravalheri · May 22, 2024, 12:13pm

The one clear use case is editable installs, but this is not the intent of this PEP, if I am not wrong.

That said, directory symlinks in wheels would massively simplify the “dramas” of editable installs in environments/platforms where they are available^[1].

Example of “dramas”:
1. PEP 660 suggests creating a “link farm” in a build directory and using a .pth to point to it => The problem with this approach is that users can accidentally remove the directory and break the installation… The other problem with this approach is that pip is not able to “garbage collect” directories that are not listed in the RECORD on the occasion of uninstalls.
2. Lack of support of static analysis tools for MetaPathFinders/PathEntryFinders ↩︎

pradyunsg · May 22, 2024, 12:46pm

That’s not true, you can have files in the scripts scheme, which does end up being arbitrary files that are executable. IIUC, that example would have literally the file to write in the wheel, albeit at a different path.

geofft · May 22, 2024, 8:06pm

I think I agree with @njs that symlinks are not actually needed for shared libraries on UNIX. And if that’s true, I think it would be better not to use this functionality, so that wheels are installable by existing tools that only handle wheel 1.0. I’m worried that if we require a new version of install tools, this will be a repeat of the time that people were still running manylinux-unaware versions of pip and were needlessly building things from source and not realizing they could and should upgrade pip. (To be clear, I still think this PEP is worthwhile even if it’s not used for shared libraries, see below.)

There’s two cases where symlinks show up for shared libraries on UNIX. The first one is for runtime use, e.g., the libxml2.so.2 -> libxml2.so.2.9.1. I also am not aware of anything actually using the full name, except for making it human-informative which version you have. So I think I agree that packaging tools for Python wheels should just get rid of the libxml2.so.2.9.1 name and ship libxml2.so.2 as a real file. Which, I agree, people shouldn’t have to do by hand:

But we already ask people to run auditwheel on wheels that contain dynamic libraries to make them work right, where “work right” is a little complicated to explain and very complicated to do by hand, but the users of auditwheel don’t have to care exactly what that process involves. It would be straightforward to extend auditwheel to do the replacement of symlinks, and it can be done in a deterministic programmatic way without asking people to do anything fragile. (Actually, what does auditwheel do currently? Surely this has come up before.)

The other case is the development symlink, in this case, libxml2.so -> libxml2.so.2 so that cc -lxml works right. But there are ways to do this without a symlink (and without a copy). For the GCC ecosystem, this can be done with linker script. Create libxml2.so as a plain text file containing INPUT(libxml2.so.2), and the linker will process that. (At a former job I actually had to do this for a distribution system that, like wheel, didn’t support symlinks, so I can attest that this does work in production.)

Example of using a linker script instead of a symlink

$ export LD_LIBRARY_PATH=$PWD LIBRARY_PATH=$PWD
$ cat lib.c
int x(void) {
    return 42;
}
$ cat main.c
#include <stdio.h>
int x(void);
int main() {
    printf("x = %d\n", x());
}
$ cc -fPIC -shared -o libx.so.1 lib.c
$ cc -o main main.c -lx
/usr/bin/ld: cannot find -lx
collect2: error: ld returned 1 exit status
$ ln -s libx.so.1 libx.so
$ cc -o main main.c -lx
$ ./main
x = 42
$ rm libx.so
$ echo 'INPUT(libx.so.1)' > libx.so
$ cc -o main main.c -lx
$ ./main
x = 42

On current macOS you can do this with tapi stubify, which creates a small plaintext .tbd file that has the same effect. (I imagine that most platforms are going to have some equivalent. Linux and macOS are the only UNIX targets that pypi.org currently allows uploads for, but I’m happy to dig into how to do this on some other platform that anyone feels is important.) And again this should all be wrapped up in a tool so packagers don’t have to think about how any of this is implemented; the experience should be that they start with some existing UNIX-conventional directory layout (including an unpacked MKL, CUDA, etc. binary archive) and end up with a wheel that does the right thing.

It’s also worth noting that the development symlink case is only needed for when you’re using a wheel as a build dependency for native code, which is a little bit of an unusual case. (In particular your development package will want to ship include files, debug symbols, etc. that most of your users won’t need.) From within Python code, one of the modes of cffi would make use of this, but probably you should be precompiling your cffi code anyway.

But there were a handful more use cases identified in this discussion thread:

Executables that can be invoked by multiple names, e.g. the pkg-config -> pkgconf example above, or ex -> vim. These are used where people expect to call the binary by the other name, and sometimes when the different names have different behaviors. An important special case of this is python3 -> python3.12 for the Python interpreter itself, which is one of two reasons the pybi format needs to deal with symlinks. Note this is only really needed for things that aren’t Python code (C, shell, etc.); if it’s a real entry point you can just create another entry point with a different name.
macOS frameworks. This is the other use case identified in the pybi spec (PEP 711): “symlinks are required to store macOS framework builds in .pybi files. So […] we absolutely have to support symlinks in .pybi files for them to be useful at all.”
Providing compatibility for Python code itself, the namespace/foo -> bar/ example. For which there’s an argument that this is perhaps better off explicitly disallowed, and I think I agree.
Representing editable installs as normal wheels, which is currently out of scope for the PEP except to keep it in mind as a future goal.

(Did I miss any?)

Use case 1 looks very common and is clearly sensible cross-platform. But it’s also a use case that can be specified in a very narrow and precise way: we define something akin to entry points that says, when you install this wheel, make this symlink in the scripts directory (which bin/ on UNIX) to this other file that is also in the scripts directory. By narrowing the problem to symlinks to files in the same directory, we avoid a host of security questions and we also get a precise cross-platform implementation. On Windows, this can be hard links; on any platform, this can be a simple file copy, at the cost of some disk space. So this is something that can have a straightforward cross-platform spec.

Use case 2 is Mac-specific. I think it would be great to encode something into the wheel format (in wheel 2.0 if needed) so that pybi is not a separate format. But we don’t have to answer cross-platform compatibility questions there, because these wheels will not be installed on other systems. And there are also some fairly tight constraints around what frameworks need, e.g., they only point within their own directory. (I also suspect that the symlinks can be flattened out the way that the libxml2.so.2 case can be flattened out, but I haven’t tried it.)

Use case 3 is questionable and probably worth discouraging, and use case 4 is inherently pointing to things outside the Python environment and is unsuitable for wheels that anyone is distributing. Maybe one answer is to spec them out technically, probably with the same means as use case 2, but ban them on PyPI for now and permit installers to not implement this functionality.

tl;dr, I suggest that we split this in to three streams:

Get auditwheel and friends to fix up UNIX libraries without needing symlinks and quietly Do The Right thing for wheel 1.0.
Add a thing analogous to entry points to allow aliases for binaries, which can be implemented by symlinks, hard links, or copies, whichever is most convenient.
Continue to spec out symlinks in wheels to unify the pybi spec with wheels. Have PyPI accept them only for macOS platform tags and only where needed for frameworks, and only require pip etc. to implement the unpacking on macOS where symlinks are guaranteed.

On that last note I do like the approach and rationale in PEP 711 for handling symlinks. In short, starting from the fact that normal files are listed both in the zip file and in RECORD, it stores symlinks in the zip file so that UNIX unzip tools work, and it also stores them in RECORD so that non-UNIX systems have a convenient option.

emmatyping · May 22, 2024, 10:58pm

Ok, I am back home from PyCon and ready to get back to responding.

This is a very good point. It also conflicts with how a lot of libraries are distributed today, as you say.

I think that the pynativelib approach is quite appealing, and I might even try adopting it. However, it does not handle the “user wants to link against bundled library” problem.

I should probably explain in more detail why this is important (both here and in the PEP). The main reason is that some libraries may not have stable ABIs, so it is required that if you wish to link to that project, you must link to the exact version that you wish to depend on for runtime use. For example, I know some teams use custom builds of PyTorch, and they must link against exactly the version they depend on at runtime to not break things.

Wouldn’t this require having the wheel provide information for N build tools? It seems much more appealing to me to add a single symlink to match convention that pretty much all Linux build tools understand, rather than build out a tool to package extra build scripts that some number of tools might understand.

It’s unclear to me how this would work, I feel like I may be missing something from your explanation of how using linker scripts would allow for tool-portable linking against libraries in wheels.

I don’t think this will work because getting the standard library’s zipfile to support symlinks seems difficult due to security concerns.

Many thanks Eryk! I appreciate the explanation. Based on my new understanding I will probably completely remove discussions of junctions and suggest using either hardlinks or symlinks on Windows, depending on what is available.

EDIT: Oh and I’ll try to draft a new version of the PEP tonight, there are substantial changes to be made.

eryksun · May 22, 2024, 11:13pm

Junctions can be useful as a replacement for directory symlinks if the target directory doesn’t contain relative symlinks with target paths that traverse it, like my “..\spam.txt” example. They’re especially useful in SMB shares because they can be used to access a directory in any local filesystem on the server without having to (1) export the directory as a share and (2) enable the policy to allow remote-to-remote (R2R) symlinks on client machines, which is disabled by default.

emmatyping · May 23, 2024, 12:48am

Hm, I’m not sure that we can maintain that invariant since namespace packages can span mount points right?

Hm, then should I still list them as a possibility? Perhaps I should re-frame the PEP text to say that it is up to the installer how to handle LINKS, but list symlinks, hardlinks, and junctions as common options.

eryksun · May 23, 2024, 2:23am

I think a relative symlink that intentionally traverses a parent mountpoint is dysfunctional, but it’s allowed. OTOH, if you’re bind-mounting an existing directory, the onus is on you to ensure that it doesn’t result in a nonsense evaluation of a relative symlink. Usually that entails mounting a base directory, which you can reasonably assume won’t be traversed by a relative symlink.

On Windows, evaluating a symlink that traverses a parent mountpoint might fail as an invalid reparse point if it tries to traverse above the root path of the device in the opened path. That’s disallowed by the I/O manager in the kernel. For example, a relative symlink “E:\spam.txt” → “..\spam.txt” is invalid when accessed via the drive mountpount “E:\”, but it works fine with a junction mountpoint such as “C:\Mount\Volume_E”^[1], for which it resolves to “C:\Mount\spam.txt”.

It seems that allowing symlinks to directories is a matter of debate, and presumably the arguments would apply to allowing junctions on Windows. But I wouldn’t want a symlink to a directory to degrade to a junction if creating the symlink fails, due to the different path parsing semantics. I think it’s reasonable to allow relative symlinks and hardlinks to files, for which a symlink degrades to a hardlink if creating the symlink fails, and a hardlink degrades to a copy if creating the hardlink fails.

When a volume device comes online, any filesystem that it contains gets automatically mounted on the root path of the volume device. For example, when volume “\??\E:” comes online, the root directory of its filesystem gets mounted on the device root path “\??\E:\”. The name “\??\E:” is the persistent DOS drive name of the volume device. There’s also a volume GUID name of the form “\??\Volume{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}”. Not all volumes are assigned a DOS drive, but they should all have a GUID name. The real device name is dynamic, usually of the form “\Device\HarddiskVolume[N]”, and the default mountpoint is its root path “\Device\HarddiskVolume[N]\”. Note the trailing backslash. On POSIX, that would be like automatically mounting a filesystem on “/dev/sd[a][n]/”, which of course is not a thing on POSIX. Volume mountpoints within a filesystem are implemented as junctions that are registered with the system mountpoint manager, which target the root path on a volume device using its persistent GUID name, e.g. “\??\Volume{12345678-1234-1234-1234-123456781234}\”. Junctions in general can target any directory in a local filesystem and aren’t necessarily registered with the mountpoint manager. ↩︎

rgommers · May 23, 2024, 9:59am

This isn’t a strong argument in itself imho, since it applies to a limited number of fairly advanced use cases & packages - it’s not like there will be problems at scale if this PEP gets accepted. In addition, it sounds like there are more proposals in the works for a wheel 2.0, and at that point the argument is moot anyway. I’m waiting curiously for proposals for those other changes.

Since linker scripts aren’t portable, I don’t think they’re ideal. E.g. Apple’s ld64 doesn’t support them, and on FreeBSD they only work in combination with -shared. I’m not sure about other platforms (does AIX support binutils-style linker scripts?) It’s also again a “let’s do more work to do wheel-specific things”, which is long-term worse than having symlinks be regularly supported.

You run auditwheel on an already-produced wheel by a build backend, so that wheel must be containing a symlink already - auditwheel cannot be used to avoid such support. In addition, auditwheel applies only to distributing on PyPI (and other index servers), but wheels have more use cases. E.g., if you do python -m build or pip install . while building packages for a Linux distro, auditwheel doesn’t come into the picture at all. So it’s not the right place to do anything here - and pip has to have support, since it is used in the cases where auditwheel does not apply.

+1 sounds like a nice approach.

dholth · May 23, 2024, 3:01pm

I had thought it might make sense to allow links inside a wheel, remapping them so that a link to -.data/category/file inside the wheel would translate -.data/category/ to whatever the install location was.

I would be a fan of using ZIP format symlinks but there would have to be a layer on top of zipfile to make that work.

Symlinks are useful because the system supports them and we should be able to package arbitrary software. Windows users may as well complain about Linux binaries in wheels or case-sensitive collisions; it is possible to write wheels that are not useful on Windows and vice-versa.

For wheel 2 I would like to include a nested zstandard compressed archive (admittedly this is also controversial) as an alternative to the .data directory, see much older thread on this forum, and clarify that the linux tag is preferred over the manylinux tag since linux is more specialized to the current machine than the more generic manylinux. PEP 491 – The Wheel Binary Package Format 1.9 | peps.python.org also contains never-implemented thoughts on the format.

dholth · May 23, 2024, 3:17pm

I have felt that the argument against symlinks “they will unnecessarily create cross-platform incompatibilities on Windows” was based on an incorrect exploitative model of open source, “I want to be able to conveniently use your software even if you don’t want to provide it to me”. On the other hand, the archetypal lone open source developer “I am so highly motivated to provide free software to you that I’m programming during the evenings instead of spending time with my family” will avoid unnecessary symlinks. Not because of a philosopher’s stone specification that tries to restrict the development of platform-specific software, but because they want to spread their idea as widely as possible. For this reason we should strive to make the specifications as flexible as possible so that we can benefit from the broadest diversity of software published under those specifications, instead of trying to enable only software that we imagine might be useful to us personally.