Symbolic links in wheels

Hi,

Is there any plan to support symbolic links in wheels? Context: we package rather large .so files and those .so files are also versioned (e.g. libarrow.so symlinks to libarrow.so.14.0.0). Currently, those symlinks become copies, which doubles the file size (ending up at 50+MB). Ideally we would like to keep the symlinks inside those wheels.

IIRC there have been multiple discussions scattered around multiple venues (GitHub, mailing lists, and maybe here?). My understanding to this issue is:

  • Direct support is unlikely since Windows has poor symlink support.
  • The most significant technical blocking issue is not from the packaging ecosystem, but Python’s zipfile module, which cannot create symlinks. So unless pip or other package installer implement/vendor their own zip implementation, wheels with symlinks cannot be correctly installed anyway.

AFAICT it’s pretty straightforward: os.stat(filename, follow_symlinks=False).st_mode goes into ZipInfo.external_attr to store the “is a symlink” flag. The link target just goes where the file contents would otherwise go.

I had happily subclassed Zipfile in wheel to check / generate file hashes… zipfile is a very nice module.

I notice the zip on the machine I’m using also stores mtime/atime/ctime and uid/gid in the “extra” field but zipfile.py doesn’t currently interpret those.

See also

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

If there’s a serious need for this I have been contemplating writing a sans-I/O zip reader for stuff like this to make it easier to share base-line zip code such as zipfile, use in importlib for zipimport, etc. (which would also mean I would design it to have as few dependencies as possible since it would need to be frozen with importlib).

Here is the most recent discussion of this issue in the pip tracker for those who are interested (issue #5919: “Symlink (and other) handling of archives”): https://github.com/pypa/pip/issues/5919

The whole scheme of symlinks like libarrow.so → libarrow.so.14.0.0 is tightly coupled to how the Linux system linker searches for libraries. Wheels have a different and incompatible way of handling library searching. Including libraries like this inside a wheel is a dubious thing to do… I can see how it might solve some problems in the short term but in the long term I think you’ll hit unsolvable problems. For example, if the user also has a system copy of libarrow.so, and you’re relying on the linker recognizing well-known names like this, then your package might end up using either the system’s copy or the wheel’s copy basically at random, which sounds like a recipe for obscure segfaults.

IMO if you want to ship shared libraries in a wheel, then you should take the search problem seriously, and not rely on the linker’s naming scheme for system libraries. Auditwheel gives each vendored library a unique mangled name, which works well for that use case. From your post, I assume you also want the library to also be usable by other packages. In that case, the best approach I’ve been able to come up with on Linux is:

  • Give your library a unique name that designates a specific ABI as shipping inside a Python wheel, like libarrow-wheel-14.so or similar.
  • Provide a Python API that lets other packages query for build time configuration (linker flags, include dir, etc.), as well as the resulting wheel dependency (maybe if the third-party package is built against pyarrow 14.2.3, then that means its Install-Requires should include pyarrow >= 14.2)
  • Provide a Python API that lets other packages request the library be available at runtime, doing whatever linker finagling is necessary to make that work. (On linux, the simplest thing is to just dlopen("path/to/libarrow-wheel-14.so"); then any future requests for that shared library will be automatically satisfied without going through the normal library search.)
1 Like

It wouldn’t be that difficult. The regular one isn’t too welded to the filesystem, it works fine with any seekable file-like, you could get it fetching range requests over the network to only load the parts of the zip extracted without too much trouble. How would a sans-I/O module differ?

From my POV no dependencies, but that’s more implementation detail.

I would be very disappointed if wheels became incompatible with standard zip tools.

At most, I’d want to see it implemented as metadata, along with a requirement for a direct email contact that frontends can display to users when packages use non-portable features like this :smiling_imp:

In reality, I’d rather the shared native dependency problem be solved properly, though I think as long as we rely on the package developers to do their own wheels and don’t develop a culture where it’s okay for organizations to provide their own (compatible) builds on an alternate index (e.g. a per-OS index maintained by people who do nothing but maintain the builds for that OS) then we’re going to be constantly searching for the next workaround like this one.

2 Likes

Symlinks in zips are standard, but they aren’t implemented in zipfile.py yet.

As Chris mentioned in the linked GitHub issue, the ZIP spec says nothing about symlinks; they just happen to work in certain implementations, on certain platforms. You could consider the zlib implementation the standard (whether that is valid is another question), but even that does not work on all platforms (IIUC zlib extracts symlinks as zero-size files on Windows).

Even in the scenario that zipfile.py (or pip’s own zip implementation) adds symlink support, including one in a wheel automatically makes the wheel non-portable. That might be okay for some (most?) people, but never all. I guess I am personally okay if symlinks are allowed in specifically-picked situations (say platform-specific wheels like manylinux and macosx), but would be quite unhappy if they are allowed for all wheels. This would be another potential trap for cross-platform package maintainers, and for users wasting time figuring out why a tool fails on them.

1 Like

This is indeed the use case that is talked about here. See first post in discussion…

See https://pypi.org/project/zipfile2/ by the venerable David Cournapeau.

If you’re very worryied just put symlinks behind a flag on the build side. Then cross platform wheels that have otherwise avoided Unix API calls or other incompatibility will not also accidentally include a symlink.

1 Like

Right now we simply reverted to copy all libraries (which means larger wheels) rather than symlink them.

Hijacking this thread after the discussion of one implementation for PEP 660 (wheel-based editable installs) where symbolic-links are used (first mention).

In this case, the wheel is guaranteed to be intended for local (and therefore platform-specific) use, so as long as project build back-ends don’t try to put symlinks into a wheel when they’re unsupported/disabled on Windows, everything should be fine?

Some resources I’ve discovered:

For this to be useful for editable installation, I’d imaging for all versions of Python which don’t support symlinks in zips, both wheel installers and build back-ends (that support editable wheels?) will need to subclass (or patch) zipfile.Zipfile to support symlinks. This sounds like interoperability concern, where back-ends can only start using symlinks when all (supported?) wheel-installers support symlink extraction. Back-ends can’t provide symlink support to wheel-installers as they’re in a separate environment.

As an alternative, it would be possible to add a new metadata file to wheels that (in effect) says “file XXX should be installed as a symlink to file YYY” and require installers to create the link at install time. We don’t need the zipfile module to support symlinks directly.

The problem isn’t recording symlinks in the wheel, it’s things like:

  • Are installers required to fail the whole installation if they can’t create symlinks?
  • Are wheels required to work properly even if the included symlinks aren’t installed?
  • Should symlinks to files that don’t exist at install time be allowed or not?
  • Should there be any restrictions on what can be used as a symlink target?
  • How do we make the user experience acceptable for wheels containing features like symlinks that may not be available on the target system?

If we just look at the “happy path” of a wheel built in an environment that uses symlinks, being installed in an environment that can replicate those symlinks with no issues, the problem isn’t difficult. It’s getting the edge cases and failure modes correct, in a way that doesn’t expose an unsuspecting user to issues they don’t have the information/ability to resolve, that’s the difficulty with putting together an actual proposal.

1 Like

The problem with this, and all other alternatives I can think of (eg metadata external to the zip file, using tar instead, a special file name, adding to the existing wheel metadata) is that this is a spec change which will need to be supported by both wheel installers and back-ends.

Ideally, if back-ends want to use symlinks to implement editable wheels, nothing should have to change to support this (eg if back-ends weren’t run isolated, they could patch the std-library Zipfile so when the wheel installer goes to extract, symlinks already supported).

For my back-end, the “happy path” is the only instance when symlinks would be used (in my case falling back to .pth), and only for editable wheels. I agree with your concerns, which is why I would expect back-ends to pre-emptively decide not to use symlinks in wheels in cases where problems may arise (eg by testing for symlink support, and only using it for ephemeral editable wheels)

But frontends need to change to (a) use a zipfile implementation that supports symlinks, and (b) handles any exceptions when the install environment doesn’t support symlinks. So expecting frontends to not have to change is impractical in reality.

So for your use case, “installs must fail if symlinks cannot be created or the target doesn’t exist, symlink targets must be absolute paths, and front ends should/must error if they find symlinks in wheels that did not get created by a build_editable call” is an acceptable rule. Far enough, but I suspect other tools/users may have different opinions (for example, as a frontend, I don’t like hard-coding a difference between the editable and non-editable path, and the use case for Unix so files that originally prompted this discussion wouldn’t work with that set of restrictions).

Someone needs to collect use cases and define rules that work acceptably for all use cases.

Are symlinks better than .pth files for some reason? I’m curious what the differences are.

Anyway, here’s a thing I’ve been pootling around with privately, not sure where it’s going, but it does have some detailed thoughts about symlinks in wheel-like formats here: posy/README.md at main · njsmith/posy · GitHub

I don’t think .pth files help when you want to symlink .so files (for example to ship both a versioned .so and an unversioned one, which would be our primary use case for symlinks in PyArrow).