Proposal - sharing distrbution installations in general

RonnyPfannschmidt · October 19, 2019, 6:20pm

im picking up on [Distutils] [proposal] shared distribution installations

Hi everyone,

since a while now various details of installing python packages in
virtualenvs caused me grief

a) typically each tox folder in a project is massive, and has a lot of
duplicate files, recreating them, managing and iterating them takes
quite a while
b) for nicely separated deployments, each virtualenv for an application
takes a few hundred megabytes - that quickly can saturate disk space
even if a reasonable amount was reserved
c) installation and recreation of virtualenvs with the same set of
packages takes quite a while (even with pip caches this is slow, and
there is no good reason to avoid making it completely instantaneous)

in order to elevate those issues i would like to propose a new
installation layout,
where instead of storing each distribution in every python all
distributions would share a storage, and each individual environment
would only have references to the packages that where
“installed/activated” for them

this would massively reduce time required to create the contents of the
environments and also the space required

since blindly expanding sys.path would lead to similar performance
issues as where seen with setuptools/buildout multi-version installs,
this mechanism would also need a element on sys.meta_path that handles
inexpensive dispatch to the toplevels and metadata files of each
packages (off hand i would assume linear walking of hundreds of entries
simply isn’t that effective)

however there would be need for some experimentation to see what
tradeoff is sensible there

I hope this mail will spark enough discussion to enable the creation of
a PEP and a prototype.

Best, Ronny

back then some discussion about existing solutions an possible implementations happened (of course with all the issues around binaries still unsolved)

this time around i’d like to get to a type of specification where there is something around that handles import locations for distributions including binaries, sorts out installation/uninstallation of belonging script and is implementable by pip

– Ronny

brettcannon · October 21, 2019, 6:17pm

So are you after a way to share what ends up in site-packages, the Python environment itself, or both?

RonnyPfannschmidt · October 21, 2019, 7:31pm

site-packages

the basic idea is to have some kind of location + a way to setup virtualenvs/importing so that each virtualenv/python installation has a config-file a few dozen kb in size instead of hundreds of megabytes of duplicate packages

this would also reduce the setup/creation time for larger virtualenvs from minutes to seconds
(i recall deploy situations where each version of a application would be deployed into a new virtualenv and we quickly had to do cleanup due to each virtualenv being hundrets of megabytes of the ame python packages all over the place)

the same holds true to some extend for having many projects with many tox envs - i’m loosing tens of gigabytes of disk space to duplicate python package installs

bernatgabor · October 21, 2019, 7:37pm

In theory you can achieve this with the current system via symlink/junction at root level directories, and hard links for files against some master copies of the libraries. The only difference then would be that you would want to rewrite the records.txt so that the installer can remove only these root level files. You would want to make all master files read-only to avoid corruption. This would just require a slightly modified pip. I’ve played around to get something similar with pip/setuptools bootstrapping in the new virtualenv poc.

RonnyPfannschmidt · October 21, 2019, 7:43pm

i’d like to strictly avoid symlinks, else its too easy for people to thing “lets edit my virtulenv to try to debug something” and they end with a global edit

the packages to be shared need to be somewhere thats readonly, and their integration should happen via config, not via abusing fs features to get things across

bernatgabor · October 21, 2019, 8:42pm

But if the target is read only they’ll not be able to do it. The config part can be what drives this, so I see no contradiction with what you’ve recommend.

njs · October 21, 2019, 9:16pm

Conda does this using hardlinks (at least on Unix, not sure what they do on Windows). So I guess they’re probably pretty familiar with the real-world benefits/problems of that approach. Asking them or browsing their issue tracker might give you some useful insights.

RonnyPfannschmidt · October 22, 2019, 8:09am

that people wont try that is a interesting hope ^^

i think its very helpful to avoid mirroring the filesystem to bring the packages to import-ability

RonnyPfannschmidt · October 22, 2019, 8:11am

from my pov conda uses a hack that “works”, but it still mirrors the filesystem and it still needs a own system to manage who owns the data

i’d like a system that avoids those details

bernatgabor · October 22, 2019, 8:25am

In that case I must say I don’t understand what you’re proposing. You’re two posts seems to directly conflict each other:

In the first you’re proposing that pythons would share storage - implying that they no longer are responsible for ownership/management of the package files (which leaves me to conclude that someone else must be); while in the second you seem against anyone else/system owning it. Someone must manage these golden instances of the packages, not?

pf_moore · October 22, 2019, 9:46am

I’m also not clear what’s being proposed here. In general terms, a system that stored various versions of packages “somewhere” (in a filesystem cache, or whatever) and then had a list of precisely what project/version combinations should be exposed in a given Python interpreter is easy enough to write, using importlib and custom finders/loaders. That should¹ give a solution without the well known performance hit of having huge numbers of directories on sys.path.

But managing that data structure would be a pretty manual job.

¹ In the sense that it’s “only” a matter of writing the code

This seems to imply that you want to have the new package structure managed by pip (or to keep the discussion generic, by “standard packaging tools”). That sounds like a big ask - pip (and the packaging ecosystem in general) is very closely tied to the standard site-packages structure and the sysconfig installation path mechanism. Moving away from that would be a major change (although it’s precisely the sort of flexibility we had in mind back when we first wrote PEP 302, so in a lot of ways I’d be very much in favour of the idea).

Maybe the best approach would be to develop a proof of concept. Use importlib to put together the runtime side of a suitable “package store” and hack up a tool that unpacks wheels into your store format, to build the store from a bunch of downloaded packages. If that works out well, we can look at having installers natively support that store (I’d strongly recommend only trying to handle unpacking wheels into that store, and rely on installers doing source builds to wheels as the route for supporting source installs).

This approach would have the possible added benefit of decoupling builds and installation a little, making “unpack wheel into a package store” into a separate operation that we could start to specify as an independent standard, and work towards decoupling from pip. So pip would use whatever “wheel unpacker” the user wanted, initially having a “wheel to site-packages” unpacker that worked like the current code, but longer term we had a more pluggable approach that allowed for user-specified unpackers. I’d love to see something like that, from the perspective of refactoring and simplifying pip (and the general concept of an “installer”) but I think it needs to be driven by the creation of a usable alternative “store”, rather than being designed based on a purely theoretical basis.

tl;dr; Start by building the store you’d like, and once that’s demonstrated the usefulness of having alternatives to site-packages, we can look at standardising an installer interface.

Longer term, I could easily see install schemes like “system site-packages”, “user install”, “--target based installs” being separate unpacker backends.

But as I say, this feels like a pretty long term and complex goal - quite likely even more work than PEP 517/518 were.

PS If I’ve completely misunderstood what the original posting was asking for then I apologise. In that case treat the above as a bit of random musing on long term options for packaging standards

RonnyPfannschmidt · October 22, 2019, 6:54pm

there is a bit of a miss-understanding

i don’t want to solve building wheels
i don’t want to solve “installing wheels”

i want to solve a standardized shared location for wheels so we can stop wasting so much disk space

so a poc is implementable in terms of pip and distlib,
but full pip support to enable the structures and manage the data-structures/installed scripts properly is a end goal

RonnyPfannschmidt · October 22, 2019, 6:56pm

with the proposal of symlinks/hardlinks, the ownership would be confusing in terms of fs structure,
so in order to get ownership managed nicely, a different way is needed

bernatgabor · October 22, 2019, 7:30pm

This would imply wheels don’t need to be extracted? Where would you put the compiled files (pyc)? Who would manage this shared location (this already is an ownership, I find very little difference between managing wheels or extracted wheels that we have today)?

RonnyPfannschmidt · October 23, 2019, 9:24am

i would like to apologize for not wording this careful enough

i want to manage unpacked wheels that are importable in a way that they are not duplicated across dozens of environments

pf_moore · October 23, 2019, 10:33am

I’m still not clear what you’re asking for. On a per-application basis, this should be manageable by using pip install --target and either sys.path manipulation or a custom import hook. If you want something common to multiple applications and/or natively supported by packaging tools and the Python import mechanism, then it’s still doable, but you’ll have to propose something a bit more concrete. There are a lot of possibilities here and no clear “one size fits all” best solution. (As a database guy, this cries out to me for a sqlite database holding the various packages and their metadata, but that’s purely in the abstract).

I can’t see tools like pip and pkg_resources supporting a new package storage layout until it’s proved its usefulness in real world applications, so I’d start by developing a storage layout that works for apps like tox that need to manage multiple environments the way you’re suggesting. You’d need to use ad-hoc tools to put packages into that store, at least in the initial stage. If and when the layout proves popular/useful, that’s when I’d suggest proposing it as a new standard that installers and package discovery tools should support natively.

If you’re looking for advice and use cases on developing such a generally useful layout, you probably want to canvas developers of tools like tox, pipx, nox, etc. I’m not sure how many of them hang out here, but it would be worth making that clear. Otherwise you’ll probably just get confused packaging tool developers like me adding distracting comments

Edit: This topic is interesting enough to me personally, that I might well play with prototyping some sort of sqlite-backed import hook. But only as a toy “proof of concept” - I’ve no feel for what would be practical for real world apps.

uranusjr · October 23, 2019, 5:53pm

There is an existing way to refer to an out-of-site-packages distribution: egg-lnk. It is under-specced and based on an old format, but is a big part to the existing editable install workflow that needs to be modernised at some point. Would it make sense to work in that direction? (I am not very sure what that would look like though TBH.)

RonnyPfannschmidt · October 23, 2019, 7:31pm

i believe that egg links have no future as they break certain capabilities indispensable for modern packages (like that you cant have a mapping from src to a toplevel package name without breaking editable installs

RonnyPfannschmidt · October 23, 2019, 7:36pm

i want to create something that will be a standard that will used by multiple python installations and virtualenvs

having a few projects with a few dozens of tox envs shouldn’t take gigabytes of space - it should take kilobytes
creating a tox env shouldn’t take minutes it should be so cheap that it can be done always and always take less than a second

i’m happy to implement a prototype - but i want to flesh out some semantics first
btw, a sql database is a really bad match imho, as it can’t implement importation of compiled modules natively

FRidh · April 3, 2020, 1:43pm

I think the following is desired:

a store where each store path contains the output of a wheel. The store path could be prefixed by the hash of the wheel to make them unique
a way to compose environments consisting of those store paths

This follows quite a bit with how Nix works, except it ignores the whole building aspect, which is fine because pip already does sandboxing nowadays anyway. If one has many virtual environments, with also often the same wheels, then this could have a significant effect. Of course, it is more complex than separate directories.

What I would like to see is indeed also a way to compose environments out of such a store structure.

We do this currently in Nixpkgs, and it’s a bit hacky because

using PYTHONHOME means users can’t use it anymore. Environment variables leak as well
PYTHONPATH can’t be used for the same reasons, and what about programs that use the interpreter location? There are cases where that causes trouble
we now have a sitecustomize.py where we handle a NIX_PYTHONPATH and uses site.addsitedir to process it. We also set the executable path and the prefixes. Not ideal, but it functions for 99.9%.

Aside from composing the Python modules into a working environment there is one more major issue: shebangs. In Nixpkgs we now wrap all executables.