Proposal - sharing distrbution installations in general

pitrou · April 3, 2020, 2:18pm

Could you explain that sentence in clearer terms? I have trouble parsing it. For example, how is hard-linking files a ‘hack that “works”’?

I’ll note that conda packages more things than just Python packages, so even if Python had a dedicated solution to this, conda would still benefit from its own generic solution.

dstufft · April 3, 2020, 9:00pm

IIRC from seeing @RonnyPfannschmidt talk about this in the past, what he’s effectively looking for is to effectively change how environments work in Python.

Currently, environments work by each of them getting a dedicated site-packages directory, and anytime you install something it gets a full copy of all of the files inside of it’s site-packages directory. This leads to duplication of files anytime you install the same version of a thing more than once and consumes extra disk space to store the same data multiple times.

In the hypothetical solution, there would instead of be some mechanism to simply list out what versions of what should be available inside of a particular environment, and then the environment will do “something” to make them all available.

I don’t know that @RonnyPfannschmidt has a specific plan for what that “something” is, and appears that he’s hoping for a discussion about what that could be.

From my POV, there’s effectively a trade off to make here, the current situation duplicates disk space, but makes the system as a whole easier to reason about in several ways:

Environments are self contained, once you delete them they’re gone and there’s no extra clean up task needed.
You can just pop open an editor and live modify code (this is useful for debugging etc) and then once you throw away the environment there is no longer any concern about that change “leaking” out to other environments.
There’s no problem to deal with in terms of ensuring that permissions sync up (imagine for instance if the wheels live in /foo and you create an environment in /bar, and someone has read/execute access for /bar/ but not for /foo).

The main downside to this @RonnyPfannschmidt already touched on, but it’s primarly optimizations:

Speed (if you already have foo-1.0 “installed”, then installing it into a new environment is just updating the record marker that says “x has been instaling in this environment”).
Disk space (1 copy of each thing isntead of N copies).

The good news is, this is something that can be experimented with, without any real buy in from anyone else, a virtual environment a-like can be created that installs a .pth file that sets up the meta path to create the importer that does the magic, a hacked up pip (or even another project entirely) can be created that manages an install. If it works good then we can explore making it more a default option, or we can decide we don’t like the trade offs and keep the current setup as the default (but tools like tox etc could of course decide to move to it if they like to as well).

steve.dower · April 4, 2020, 12:10am

I don’t think it’s even that complicated. If the wheels are all extracted in their own (versioned) directories, then a single .pth file can just list all the directories you want. Make sure that file is on sys.path (so alongside your main script or in an otherwise empty site-packages) and your imports are there.

uranusjr · April 4, 2020, 5:40am

I suggested something similar a while ago at https://github.com/pypa/packaging-problems/issues/328#issuecomment-590080515

The idea should be fairly straightforward IMO. Most complexity would be to manage the pth[1] files correctly, and to select the correct wheel installation (similar to the wheel selection logic in pip install).

[1]: We probably need another format if this is going to be standardised, pth is too tightly tied to its past and has a bad reputation.

FRidh · April 4, 2020, 6:29am

Getting imports to work is easy; getting executables to also pick up the new environment is much harder. I think entry points would need to use #!/usr/bin/env python3 as shebang then instead of an absolute shebang. This, however, will require to always start a subshell with $PATH pointing to the python3 if our new virtual env.

steve.dower · April 4, 2020, 7:36am

Let’s redeem it then, rather than reinvent it. Provided you avoid completely arbitrary code execution, there’s nothing wrong with extending your search path this way. It’s considerably better than using environment variables, anyway.

Adding more reliance on the shell being correctly configured feels like a step backwards. Scripts that are tied to a specific environment should be tied to that environment, not using /env and hoping that the user got everything right.

FRidh · April 4, 2020, 7:43am

And that means you will always end up installing into your run-time environment, and will prevent you from composing an environment.

The idea that an installation is tied to an installation environment is something we need to get rid of ; as long as all dependencies are full-filled by a provided environment, then that environment is adequate, it does not matter then whether that was the original environment used at installation time or not.

Using this shebang basically means we’re not tying down the exact interpreter yet at installation time. Think of installation as a function; instead of applying all arguments we partially apply it. Yes, we decide which version we use, but we do not say where it is located so that at run-time we can provide a different one to perform composition.

steve.dower · April 4, 2020, 9:58am

Or alternatively, that the only reliable and complete form of environment is one that includes the Python runtime as well.

For the Tox scenario, the environments are temporary, so it doesn’t really matter. I expect pip will eventually get to a similar approach for build environments, which are also temporary.

But for persistent environments, or those that you’re planning to redistribute, you risk a lot by not locking down the runtime as tightly as you can. I know this is hard on Linux, but only because it requires working against established conventions of a system install (on Windows I made all the required runtime fixes to make it easy and nobody even noticed ).

We got to this point in the discussion because of scripts that you want to launch without specifying which Python install to use. Our experience has been that this causes more confusion than anything else.

Reusing package installs across environments is totally fine, and personally I’d love to have “fat” wheels/installs with binaries for a range of platforms/versions to make this even easier, but the “env python” shebang pattern is difficult to get right in any general sense. Anything dependent on running in “your” Python should be run with -m; anything independent of Python version should make its own decision.

pitrou · April 4, 2020, 10:09am

Note that extending the search path (using e.g. .pth files) used to make imports significantly more expensive. I’m not sure that’s still the case (we implemented extended caching in importlib years ago), but I wouldn’t be surprised if there was still a variable per-search path entry cost.

pf_moore · April 4, 2020, 11:52am

Yes, IMO there are two ways that .pth files have a bad reputation. One is the “arbitrary code execution” feature (that some projects rely on, so removing this requires a transition process) and the other is “huge search paths are slow” (which as you say might be fixed now, so it may just be a PR issue at this point).

Also, .pth files can be fiddly to use because they can only be added to site directories (pip, for example, needs to use addsitedir to enable .pth file support in build environments).

A replacement for .pth files that keeps the benefits while removing the downsides might be easier to implement cleanly, rather than fixing .pth files “in place”. We can then deprecate the old-style .pth files once the new form is established.

brettcannon · April 6, 2020, 1:18am

https://bugs.python.org/issue33944

RonnyPfannschmidt · April 6, 2020, 7:45am

this has suddenly seen quite some unexpected input, i’d like to clarify some details

i don’t consider the console-scripts as part of a “distribution/wheel”
they get generated after “install/activation” so their creation should
be part of the installation/activation process
its perhaps “wrong” to consider the shared cached Distributions “wheels” of one thinks in terms of "ready to activate distributions then its getting more capable

for example linux distributions could ship multiple versions
of a binary python package as activatable in a distro managed path
and have each python application they ship use a own environment that just activates correct versions
-> version conflicts would no longer need to be solved between applications
pip could cache activatable distributions in the users home folder just like it caches wheels + gain a command to fix an environment if elements are missing (like when a distro update or a user request removed part of the distribution caches
wanting to edit a package in place should require to install it editable (imagine if all distributions that are editable are stored in a path that is distinct, so in trackbacks it would automatically indicate)

Topic		Replies	Views
Optimizing installs of many virtualenvs by symlinking packages Packaging	10	3566	February 18, 2024
Feature Request: Copy an installed package to a specified directory Packaging	2	1243	July 26, 2019
Announcement: distlib 0.3.4 released on PyPI Packaging release	35	1347	December 15, 2021
A Maven-Inspired Package Manager for Python Python Help	4	742	July 27, 2023
Identifying gaps in Python distribution Packaging	42	2237	April 16, 2023

Proposal - sharing distrbution installations in general

Related Topics