Symbolic links in wheels

Symlinks in zips are standard, but they aren’t implemented in zipfile.py yet.

As Chris mentioned in the linked GitHub issue, the ZIP spec says nothing about symlinks; they just happen to work in certain implementations, on certain platforms. You could consider the zlib implementation the standard (whether that is valid is another question), but even that does not work on all platforms (IIUC zlib extracts symlinks as zero-size files on Windows).

Even in the scenario that zipfile.py (or pip’s own zip implementation) adds symlink support, including one in a wheel automatically makes the wheel non-portable. That might be okay for some (most?) people, but never all. I guess I am personally okay if symlinks are allowed in specifically-picked situations (say platform-specific wheels like manylinux and macosx), but would be quite unhappy if they are allowed for all wheels. This would be another potential trap for cross-platform package maintainers, and for users wasting time figuring out why a tool fails on them.

1 Like

This is indeed the use case that is talked about here. See first post in discussion…

See https://pypi.org/project/zipfile2/ by the venerable David Cournapeau.

If you’re very worryied just put symlinks behind a flag on the build side. Then cross platform wheels that have otherwise avoided Unix API calls or other incompatibility will not also accidentally include a symlink.

1 Like

Right now we simply reverted to copy all libraries (which means larger wheels) rather than symlink them.

Hijacking this thread after the discussion of one implementation for PEP 660 (wheel-based editable installs) where symbolic-links are used (first mention).

In this case, the wheel is guaranteed to be intended for local (and therefore platform-specific) use, so as long as project build back-ends don’t try to put symlinks into a wheel when they’re unsupported/disabled on Windows, everything should be fine?

Some resources I’ve discovered:

For this to be useful for editable installation, I’d imaging for all versions of Python which don’t support symlinks in zips, both wheel installers and build back-ends (that support editable wheels?) will need to subclass (or patch) zipfile.Zipfile to support symlinks. This sounds like interoperability concern, where back-ends can only start using symlinks when all (supported?) wheel-installers support symlink extraction. Back-ends can’t provide symlink support to wheel-installers as they’re in a separate environment.

As an alternative, it would be possible to add a new metadata file to wheels that (in effect) says “file XXX should be installed as a symlink to file YYY” and require installers to create the link at install time. We don’t need the zipfile module to support symlinks directly.

The problem isn’t recording symlinks in the wheel, it’s things like:

  • Are installers required to fail the whole installation if they can’t create symlinks?
  • Are wheels required to work properly even if the included symlinks aren’t installed?
  • Should symlinks to files that don’t exist at install time be allowed or not?
  • Should there be any restrictions on what can be used as a symlink target?
  • How do we make the user experience acceptable for wheels containing features like symlinks that may not be available on the target system?

If we just look at the “happy path” of a wheel built in an environment that uses symlinks, being installed in an environment that can replicate those symlinks with no issues, the problem isn’t difficult. It’s getting the edge cases and failure modes correct, in a way that doesn’t expose an unsuspecting user to issues they don’t have the information/ability to resolve, that’s the difficulty with putting together an actual proposal.

1 Like

The problem with this, and all other alternatives I can think of (eg metadata external to the zip file, using tar instead, a special file name, adding to the existing wheel metadata) is that this is a spec change which will need to be supported by both wheel installers and back-ends.

Ideally, if back-ends want to use symlinks to implement editable wheels, nothing should have to change to support this (eg if back-ends weren’t run isolated, they could patch the std-library Zipfile so when the wheel installer goes to extract, symlinks already supported).

For my back-end, the “happy path” is the only instance when symlinks would be used (in my case falling back to .pth), and only for editable wheels. I agree with your concerns, which is why I would expect back-ends to pre-emptively decide not to use symlinks in wheels in cases where problems may arise (eg by testing for symlink support, and only using it for ephemeral editable wheels)

But frontends need to change to (a) use a zipfile implementation that supports symlinks, and (b) handles any exceptions when the install environment doesn’t support symlinks. So expecting frontends to not have to change is impractical in reality.

So for your use case, “installs must fail if symlinks cannot be created or the target doesn’t exist, symlink targets must be absolute paths, and front ends should/must error if they find symlinks in wheels that did not get created by a build_editable call” is an acceptable rule. Far enough, but I suspect other tools/users may have different opinions (for example, as a frontend, I don’t like hard-coding a difference between the editable and non-editable path, and the use case for Unix so files that originally prompted this discussion wouldn’t work with that set of restrictions).

Someone needs to collect use cases and define rules that work acceptably for all use cases.

Are symlinks better than .pth files for some reason? I’m curious what the differences are.

Anyway, here’s a thing I’ve been pootling around with privately, not sure where it’s going, but it does have some detailed thoughts about symlinks in wheel-like formats here: posy/README.md at main · njsmith/posy · GitHub

I don’t think .pth files help when you want to symlink .so files (for example to ship both a versioned .so and an unversioned one, which would be our primary use case for symlinks in PyArrow).

A .pth includes the directory containing a file, so if a directory contains multiple Python modules it’s all-or-nothing. Symlinks are more wieldy and have less gotcha. Another big advantage is you can symlink something under a different name, which is useful for extension modules.

Regarding the implementation, IMO the |= 0xA0000000 makes most sense for wheels. It is not standard to treat such a file as symlink, but setting that attribute flag is standard-compliant, so we can amend the wheel spec to say something like “if a file in a wheel has this attribute and contains one single line representing a relative path inside the same wheel, the installer should create a symbolic link at the location to the target file”. This won’t need additions to zipfile (which raises standard-compliant issues) and should be entirely doable within Python packaging specs and projects.

Implementation-wise, Windows now has symlinks and you can enable them pretty easily in you have admin control to the machine, so that’s much less of a problem now than five years ago. And if the user does not (cannot) have it enabled, the installer can choose to either copy the file or raise an error telling the user to talk to their admins. For pip at least, there are already things that may need admin intervention for installation to work on Windows (path length limit), so there’s precedence already.

2 Likes

The cited blog makes it seem novel that “[s]tarting with Windows 10 Insiders build 14972, symlinks can be created without needing to elevate the console as administrator”. This has been possible another way since Windows Vista. UAC only filters the symlink privilege from the set of privileges that are provided by the Administrators group, since the group is disabled (deny-only). The symlink privilege can still be granted directly to the user or one of the user’s groups such as “Authenticated Users”. Anyone with admin access can modify this to allow creating symlinks without elevating.

So I thought a bit about this, and it seems we’d need a wheel version bump for this? Because tools that work with the current wheel version are not guaranteed to be able to handle symlinks, so we need to signify what is compatible and what is not.

If that’s the case, a PEP should specify a new wheel version (1.1) that

  • Does everything that wheel version 1.0 does.
  • If a file in the wheel has external_attr bit 0xA0000000 set, the file MUST contain only one single line that contains a path relative to the directory containing the file. The path MUST point to another entry in the same wheel.
  • On installation, the install tool SHOULD create a symbolic link in place of the file with external_attr bit 0xA0000000 set, with the target being the wheel entry specified by the path in the file.
  • If an install tool is unable to create such a symbolic link, it MAY copy the target instead if the target is regular file. Otherwise, the install tool SHOULD signify this failure.
  • Build tools are advised to use wheel format 1.1 only if they need to include a symbolic link in the wheel, to maintain best compatibility with existing install tools.

I’ll begin drafting some text for the PEP if the above makes sense.

1 Like

This constraint precludes using this feature for editable wheels, as per PEP 660.

As far as I am aware, there are two use cases for this feature, Unix .so file symlinks, and editable wheels (if there are others, people should speak up!). Do we want to only support one of those? I assume the benefit is that there is less risk (in terms of both security and general breakage from mistaken assumptions) if the feature is limited to symlinks internal to the wheel? If we do, the PEP should definitely include a section explaining why we only allow internal symlinks, and the implications of doing so.

Some other points:

  1. Existing tools probably don’t check the wheel version, and so will likely create a regular file containing the text in the link (the target filename). We can’t do much about this, but the PEP should probably note the risk. Also note that the wheel 1.0 spec says that installers must warn but proceed if the major version is the same but the minor version greater than the expected version. So the best we can expect older installers to do is warn.
  2. The broken behaviour above (writing a regular file) is valid according to the new PEP, as the chain of SHOULD/MUST requirements only says that installers SHOULD link, copy or fail. Maybe we should be stricter and say that installers MUST fail if they can’t link or copy?

I would want to re-state the path rewriting rules e.g. package-1.0.data/platlib/x.py installs to $VIRTUAL_ENV/site-packages/; rewrite the symbolic link target with the same mapping.

I would want to add a column to RECORD to indicate the executable or link bits so that the link or executable status of individual files affects the hash of RECORD.

I have a draft updated PEP although it was focused on trying to add the ‘greater compression’ extension where a nested archive could provide files not in the .dist-info directory, but the document needs work to succeed in being more clear than the original.

I think it would be version 2.0? If an old tool tries to unpack it and loses all the symlinks, that will completely break the wheel in most cases, surely?

External symlinks are definitely tricky from a security perspective, because if anyone can be tricked into writing through them, they can redirect writes to arbitrary filesystem locations. For example, consider a symlink from __pycache__/something.pyc pointing to a critical system file.

Another nasty case that you have to harden against is where the zip file first contains a symlink to a system directory: something -> ../../../../../../../etc, and then later on in the zipfile it contains a regular entry that falls inside the symlinked directory: something/passwd → "some text". If the archive tool is unpacking entries sequentially, it can easily end up writing through the symlink to overwrite /etc/passwd with "some text". (For this one, I think the usual solution is to wait to create all symlinks until the other files have been unpacked, in a second pass.)

TBH I’m still not clear on what the .so file symlinks use case is – like I said upthread a few years ago, symlinks like libarrow.so -> libarrow.so.14.0.0 only makes sense for libraries on the ld.so search path, and I don’t see how that ever applies to wheels.

For the editable wheels case, Tzu-Ping points out that:

Real issue! But – maybe there’s another way to fix it? Since .pth files can contain arbitrary code, and importlib has a ton of flexibility in how it searches for modules, maybe we can stick a bit of code in a .pth that only adds a single Python module/package to the search path? We’d have to check with the importlib experts – I always get lost when I try to read the docs :-). But if that works, it might be both backwards compatible and 100% reliable across platforms, neither of which symlinks are.

I guess no-one followed my link earlier… the way I’m doing it is with a special pseudo-hash function:

name/of/symlink,symlink=path/to/symlink/target,

See here for rationale.

2 Likes

If it’s not clear that it’s a valid use case, and neither is the the editable wheel one (see below), then it seems a bit premature for anyone to be thinking about writing PEP…

The discussions around PEP 660 were, to say the least, extensive. I personally agree with you (see my library which supports .pth files and import hooks), but there was a lot of discussion around symlink-based solutions. In the end, I approved PEP 660 on the basis that if symlinks are important to people, we can address the questions as part of adding symlink support to wheels.

I’m happy, personally, to not worry about the editable wheel case, but I felt that I should point out that if the proposal doesn’t support that use case, we will be closing the door to the option of implementing editable installs via symlinks.

I don’t mind the pseudo-hash symlinks. We agree that the rules about what to do with the symlinks are more important than how to embed them in the .zip archive. If we added symlinks to wheels it shouldn’t be difficult to have a different security rule for editable wheels.

The largest wheels on pypi duplicate .so’s; the use case is popular. It may not matter whether or not it is absolutely necessary.

An installer understanding the new format automatically understands a 1.0 wheel, which means backwards compatible (?) so I chose 1.1. But honestly I never really understand how format versioning works, so if 2.0 is the current version, this should be 2.0.

Security is the main issue here to me, and to me the editable use case shows how it’s problematic to use wheel both as a data interchange format between the frontend and the backend, and a distribution format for transmission between multiple environments. Security is a big issue for the latter, but some of those are non-issues for the former (since the frontend has all the information needed to decide what is valid and what is not). So if we want to treat wheel as a distribution format and keep its promise of not writing to a location outside of the target environment’s prefix, editables can’t be supported. If we take PEP 660’s idea and define wheel as the frontend-backend interchange format, however, linking something to /etc should be allowed in a wheel, and the security implication should be solved at another layer in distribution—for example, make PyPI validate uploaded wheels to not contain any out-of-environment symlinks, like how wheel allows the local version segment, but such a wheel cannot be published to PyPI.

I think this can be said to basically all symlink use cases (and actually some existing wheel features), because theoratically any usage of it can be replaced by some kind of runtime magic. But the use case still comes up too many times I think it is worthwhile to allow people to use them, since otherwise people are resorting to solutions like creating duplicated binary blobs that make wheel size baloon, which IMO is not good for the ecosystem. We can yell all day they don’t need that in the first place, but it’s pretty clear people are not going to try the “good” alternatives we suggest, so IMO it is better to provide a solution more obvious to them (symlinks in wheels) as long as it does not compromise the rest of the format.

Semi-related, I also considered allowing only regular files to be symlinked (not directories), which can help with some of the issue (and still covers all known symlink use cases, I believe). But of course that doesn’t help if a wheel contains something → ../../../../../../../etc/passwd, so arguably not very meaningful?

I don’t disagree with the pseudo-hash approach, but IMO this should be done in additional to the 0xA0000000 thing. Hash functions are for validation, and it does not feel right to me to overload it to describe file attributes. Also, wheels currently have a nice “feature” of being able to be extracted directly and still be useful-ish, which the 0xA0000000 happens to achieve, which I like. But that’s minor though.

1 Like