Environments with a Shared Package Installation Directory ('Julia-like' Packaging)

RobertRosca · December 1, 2022, 3:24pm

One-sentence summary: “The aim is to have multiple package versions installed side-by-side, and select which ones are importable at run time, based on a lock file.” (thanks @takluyver!)

This writeup got a bit too long so TL;DR at the top: as a proof of concept, I implemented a basic wheel installer which installs packages to .../multi-site-packages/{package_name}/{version} to allow for multiple package versions to be stored in a single environment; also made a basic importlib PathFinder which reads a requirements specification file (in this case a Poetry lock file) to decide which package versions to import.

With this you no longer need to create a virtual environment to have isolated dependencies for projects, all that is needed is a poetry lock file in a directory to specify which version of a package should be loaded from multi-site-packages/{package_name}/{version}.

IMO this potentially has a lot of benefits, main one being removing the need to have multiple venvs with duplicated packages installed inside them, but before spending more time on this I’d be curious to get some feedback from the community.

The repository for this proof of concept can be found here: GitHub - RobertRosca/pipm at feat/initial-dev (note the branch is feat/initial-dev)

My aim was to achieve this with as few changes as possible and in as simple of a way as possible, I’m aware that there are a lot of caveats and issues with the implementation, it is just a proof of concept for storing multiple versions of packages in a central location and deciding which package to load from a lock file.

For anybody interested here’s some more details:

This is a proof of concept implementation of a Julia-like approach to packaging in Python. For those not familiar, here is a summarised version of the background to the Julia package manager (I recommend reading the page fully for those interested):

Pkg is designed around “environments”: independent sets of packages that can be local to an individual project or shared and selected by name
The exact set of packages and versions in an environment is captured in a manifest (lock) file
Since environments are managed and updated independently from each other, “dependency hell” is significantly alleviated in Pkg
The location of each package version is canonical
When environments use the same versions of packages, they can share installations, avoiding unnecessary duplication of the package

To illustrate this better, here are some comparisons between current behaviour and a theoretical Julia-like implementation:

Installing a package without an environment (e.g. pip install --user):

Currently for python:
- pip install command without a virtual environment activated
- the package is installed under ~/.local/lib/python3.10/site-packages/{package_name}
- python looks through site-packages for imports
For Julia-like package system:
- install command adds the package to a pyproject file, creates/updates a lock file
- pip install --user equivalent command would add the package to a ‘user level’ pyproject file and update the lock file (e.g. Poetry, Pipenv, etc…), these files are stored in a user-level directory, e.g. under ~/.local/state/python3.10/envs/default/{pyproject,lockfile}
- package(s) installed under ~/.local/lib/python3.10/multi-site-packages/{package_name}/{package_version}
- python reads the pyproject/lock files and uses that information to decide which package versions to import

Now, when installing a package with an environment:

Currently for python:
- python3 -m venv .venv to create a new virtual environment
- source .venv/bin/activate to activate it
- pip install to install packages into the environment
- the venv is (ignoring --system-site-packages) completely isolated and all packages are installed in it independently of whatever else is on the system
For the Julia-like approach:
- environments are only defined by a pyproject file and lock file existing in a directory or parent directory, so ‘creating’ one just means having those files there
  
  details on local/global environments
  
  In Julia you have the concept of local environments which are what I describe above, where the project/lock files are in a directory, but you can also have environments stored in a central location which make activating environments anywhere you want possible, in a similar way to how conda works
- pip install equivalent command would add the package to the pyproject file and update the lock file (e.g. Poetry, Pipenv, etc…)
- package(s) installed under ~/.local/lib/python3.10/multi-site-packages/{package_name}/{package_version}
- python reads the pyproject/lock files and uses that information to decide which package versions to import

In both cases the key difference is that packages would continue to get installed under ~/.local/lib/ instead of into a virtual environment directory, with which package to use being specified in the pyproject.toml file.

There are a lot of benefits to this approach, but IMO the main ones are:

Always have a file that specifies what your current environment is, even when just using user installs
Avoids unnecessary duplication of package installs, lowering the space used on user devices and the time taken for installs
No overwriting of packages during updates

As a proof of concept I implemented this in the most basic way I could think of doing, it’s pretty hacky but works alright as a rough proof of concept to demonstrate the idea. The PoC works by:

Using installer to implement a basic wheel installer that installs packages to multi-site-packages/{package_name}/{package_version}
Using Poetry to manage the pyproject.toml and poetry.lock files
Adding a custom importlib finder which reads a lockfile and inserts the path to the requested version of the package into sys.path before importing
Importing this finder and prepending it to sys.meta_path in the sitecustomize.py file
Adding a very crappy pipm (meaning ‘pip multi’, I am not creative) CLI call which just runs pip download . -d ./tmp-wheelhouse in a Poetry-managed project, then runs the custom wheel installer on all files in the wheelhouse, this is what actually installs dependencies to multi-site-packages

I tend to use Poetry for all of my projects so to test this out I ran pipm in a few different repos to populate the multi-site-packages directory and played around a bit, surprisingly enough this basic approach sort of works:

~/.../pipm ❯ python3 -c 'import click; print(click.__file__)'
/home/roscar/.cache/pypoetry/virtualenvs/pipm-p7aS5F8W-py3.10/lib/python3.10/multi-site-packages/click/8.1.3/click/__init__.py

~/.../beanie ❯ python3 -c 'import click; print(click.__file__)'
/home/roscar/.cache/pypoetry/virtualenvs/pipm-p7aS5F8W-py3.10/lib/python3.10/multi-site-packages/click/8.0.4/click/__init__.py

~/.../starlite ❯ python3 -c 'import click; print(click.__file__)'
/home/roscar/.cache/pypoetry/virtualenvs/pipm-p7aS5F8W-py3.10/lib/python3.10/multi-site-packages/click/8.1.3/click/__init__.py

~/.../starlite ❯ python3 -c 'import httpx; print(httpx.__file__)'
/home/roscar/.cache/pypoetry/virtualenvs/pipm-p7aS5F8W-py3.10/lib/python3.10/multi-site-packages/httpx/0.23.1/httpx/__init__.py

~/.../beanie ❯ python3 -c 'import httpx; print(httpx.__file__)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'httpx'

Above you can see:

In pipm directory, click is version 8.1.3
In beanie directory, click is version 8.0.4
In starlite directory, click is version 8.1.3
In starlite directory, httpx is version 0.23.1
In beanie directory, httpx is ‘not installed’

Which is the desired behaviour.

I’d be interested in hearing feedback for this approach, both in the context of a standalone tool and on the potential of something vaguely like this being included in Python.

Caveats

I have found some similar discussions on having multiple packages installed at the same time (Installing multiple versions of a package - #13 by PythonCHB, Allowing Multiple Versions of Same Python Package in PYTHONPATH), however they were mostly centred around the idea of being able to import incompatible versions of packages in the same environment, which this does not attempt to do or enable. You would still have one and only one version of a package available.

There has been a lot of discussion on this topic in the past and there is ongoing discussion on topics like PEP 582, but I didn’t find anything too similar to this apart from a few suggestions like optimize package installation for space and speed by using copy-on-write file clones ("reflinks") and storing wheel cache unpacked · Issue #11092 · pypa/pip · GitHub.

Also I am aware that the proof of concept implementation has a great deal of flaws, but my goal with it was to keep it simple and minimal not complete.

pf_moore · December 1, 2022, 3:34pm

Personally, I’d be unlikely to use something like this, as it seems like it could be too complicated and have too many rough edges to be useful. For example, VS Code support - the editor thinking that my script’s dependencies aren’t installed.

But as a general concept, I’m glad to see people innovating and trying out ideas like this. If it turns out to work well and be popular, I may well be glad to “jump on the bandwagon”

But I don’t think I’ll be an early adopter of this one.

brettcannon · December 1, 2022, 11:59pm

How do you resolve just happening to have the right packages installed but not the latest version, and what would have been installed had you created a fresh pyproject.toml? Are you relying on there always being a Poetry lock file to help with this case? By the very nature of virtual environments being empty upon creation avoid this issue by having to do a fresh install just to get going. I can see a similar situation here if you had an ImportError if there was no lock file present to tell you that you needed to generate the list of projects to pull in.

Another way to do this is to set up a parallel sys.path just for the packages you want to pull in and create your own MetaPathFinder which utilized the list of directories. That way you aren’t having to use a meta path finder to effectively race the rest of the import system to insert your directories into sys.path. The other way to do this is with a sentinel value in sys.path and then your own PathEntryFinder which understands that sentinel value and then does the appropriate thing to once again not try to race the import system.

The hurdles that I can think of off the top of my head that would need to be worked out (ignoring timelines):

What’s the path where everything is going to be stored and how do you make it so you can use a different directory if needed (probably environment variable)?
Is normalized project name and normalized version enough for the directory structure?
How would we want to implement this in Python core for it to work appropriately?
Would any traditional import semantics need to be changed to make this work reasonably (i.e. doing this as a meta path finder makes the most sense, but that changes when the current directory would come into play, hence the special string for sys.path idea above)?
Would this be left out if you ran with -I?
How does one signal they want this form of environment?
Does this require us getting a lock file standard?

brettcannon · December 2, 2022, 5:14pm

How do you handle the case for different interpreters? E.g. how do you make sure you get the best/fastest wheel on CPython and on PyPy when they come from different wheels?
How do you make it super-cheap to calculate where to import from? If you’re expecting to be reading and parsing some file to calculate what to import and where to import it from you want it to be as cheap as possible (this is why the import system now caches so much; file system operations can be surprisingly expensive under some OSs).
What about wheels that span interpreter versions? Think py3 wheels or abi3 wheels like cp36-abi3 that also work for CPython 3.10. Is the disk saving critical or best-effort, and thus it’s okay to install the same wheel multiple times to make this case simpler?
How do you handle editable installs?
Would tools generating the lock file or calculating what to “install”, by default, prefer to still go out and calculate what the best match is, or do they prefer what’s on disk first?

PythonCHB · December 3, 2022, 8:22am

Is this any better than using hard links, as conda does? IIUC, Windows doesn’t support that, so maybe that’s reason enough.
(I have no idea if the other virtual environment systems use hard links)

But the main issue I’m trying wrap my head around is how this would work with multiple “applications” that you want to use together.

I can see how this idea could work for running a particular monolithic application, say a web server, but maybe not so much for other workflows.

I often find myself setting up a conda environment, and then using that to run any number of scripts, some Python and some not. For a given workflow, I need them all to run, and all be using the same versions of various packages. And my current working dir is going to be nowhere near any of the code – it’s where the data I’m working with is. So where would the “lock file” go? how would all the various scripts I’m running, from various packages, find it, and know to use it?

How might all this interact with Jupyter Notebooks and iPython (and various IDEs…)

And to echo one of Brett’s points: “Editable” installs are a critical feature!

I don’t mean to seem negative, and I’m not saying this approach couldn’t work – but these are use cases that are important to many.

One other note: big systems can have a LOT of packages needed with given versions – on order of hundreds – could this be made reasonably performant for those? adding a huge pile of dirs to sys.path doesn’t seem very efficient.

CAM-Gerlach · December 3, 2022, 8:40am

Hard links are supported on Windows, its symlinks that require manually enabling the permission, though Conda may not use them on Windows for other reasons.

At venv and I believe virtualenv doesn’t, since they are self-contained and there is no centralized place they could easily link to.

Yeah, that’s been the fundamental blocker to many of these proposals, so long as things are interoperating at the Python layer, or exchanging data using Python objects or things that get serialized/deserialized to them (pickles, etc). Less coupled forms of IPC could work for simpler cases, that aren’t tightly coupled and feature little interaction between processes and relatively compatible versions on both ends, but the more complex the applications, the more substantial the interaction and data exchange, the more dependencies being used and the more they diverge, the greater chance something in one process might rely on a detail of a dependency in another process, and behave unexpectedly if different versions are present. On a broader sense (e.g. with conda), this is an issue with binary deps in general.

brettcannon · December 5, 2022, 11:06pm

Does what environment you use with that data change, or are you totally fluid with your environments and what you use them with? If you consistently use the same environment/packages every time you work with that data then I suspect the assumption is you will copy/symlink the lock file to your data directory. Otherwise some activation mechanism would be needed for that workflow.

RobertRosca · December 5, 2022, 11:25pm

Took a nice holiday to Prague for the last few days so I haven’t had time to reply, but I’ll catch up on the discussion over the next day or two.

A lock file existing (or being generated if missing) is a hard requirement.

Sounds like the issue you’re getting to is having some version of a package already installed which then gets re-used with new environments, and is never updated, correct? The approach I was thinking of for this is not to just add a wildcard requirement to pyproject.toml, but to have a version range specified.

For example in all of Poetry/Pipenv/PDM when a package is first added to a new pyproject.toml file it is given a version range, typically around the current release.

Thanks for the tip! That sounds much better than what currently happens.

What’s the path where everything is going to be stored and how do you make it so you can use a different directory if needed (probably environment variable)?

I arbitrarily picked multi-site-packages (adjacent to site-packages) for this, but it of course could be set to anything and I’m open to suggestions.

Is normalized project name and normalized version enough for the directory structure?

Perhaps, I’ll have to read through the packaging-related PEPs but I believe that there are some situations where packages can write stuff outside of platstdlib, not sure off the top of my head though.

How does one signal they want this form of environment?

Would any traditional import semantics need to be changed to make this work reasonably (i.e. doing this as a meta path finder makes the most sense, but that changes when the current directory would come into play, hence the special string for sys.path idea above)?

I do not think so, imports should remain the same. The auto-activating environment based on CWD thing is a bit of a red herring which, in hindsight, I should not have mentioned this early.

This approach would not get rid of the ability to just directly activate an environment, analogously to sourcing the activate script.

What would definitely be useful, and now possible, would be to activate an environment programmatically within Python - e.g. import venv; venv.activate("path/to/environment").

As for signalling you wish to use it there are a lot of options and I’m not sure which to go with. For an implementation within core Python then perhaps a flag when creating a venv would work, something similar to --system-site-packages, like --versioned-site-packages? Or this could be its own command, or it could be a configuration in pyproject.toml, there are a lot of ways this could work.

Would this be left out if you ran with -I?

Sure, part of the reason for using a separate mullti-site-packages directory was to be able to isolate it from the existing site-packages.

How would we want to implement this in Python core for it to work appropriately?

Does this require us getting a lock file standard?

No easy answer for these yet. Something like this would provide the most benefit to people who already use venv-based package/project managers, if this approach actually works well and gets adopted across multiple parts of the community then IMO it would be worth integrating core parts of it into python, similarly to how virtualenv was partially integrated in as venv.

How do you handle editable installs?

The main aim of this is to enable the re-use of packages that are common between venvs, so editable installs would continue to work as they currently do. I guess it would be possible to have editable versions in multi-site-packages, allowing you to share an editable package between different environments, but that is probably a bit of a niche case.

How do you handle the case for different interpreters? E.g. how do you make sure you get the best/fastest wheel on CPython and on PyPy when they come from different wheels?

How do you make it super-cheap to calculate where to import from? If you’re expecting to be reading and parsing some file to calculate what to import and where to import it from you want it to be as cheap as possible (this is why the import system now caches so much; file system operations can be surprisingly expensive under some OSs).

What about wheels that span interpreter versions? Think py3 wheels or abi3 wheels like cp36-abi3 that also work for CPython 3.10. Is the disk saving critical or best-effort, and thus it’s okay to install the same wheel multiple times to make this case simpler?

Would tools generating the lock file or calculating what to “install”, by default, prefer to still go out and calculate what the best match is, or do they prefer what’s on disk first?

All good questions, I have not thought about these too much yet but they are potentially large issues that would need to be figured out early on, since they could potentially prevent something like this from working in a reasonable way. I will add a section to the readme in the proof of concept repository with these as ‘open questions’, along with the other ones you’ve asked.

Thanks a lot for all of the feedback and questions! They were very useful, I’ll investigate the potentially breaking issues to see what approaches there could be to deal with them.

RobertRosca · December 5, 2022, 11:49pm

IMO yes for a few reasons:

As you mentioned, OS/FS compatibility is a potential issue
Another is that hard/symlinks cannot go between file systems
A recurring issue at my workplace (for technical good reasons which are probably not very common) is hitting inode limits, which would still be a problem for symlinks

Yeah I did shoot myself in the foot with the cwd thing a bit pretend I never mentioned that part, the core aim is avoiding duplication of package installs across venvs, a method to explicitly activate an environment should definitely still be present.

No worries, I came here for this kind of feedback

Good point, I was also worried about that. With my current implementation sys.path itself is not modified for Python, it is only modified during the process of loading a specific module if the module is present in a lock file, so it doesn’t end up exploding into a permanent huge list of all paths of all packages.

My feeling is that this custom module finder approach should not have huge performance costs like modifications to sys.path may, but a gut feeling isn’t really the best evidence. Brett had some suggestions on improving how this works which I will implement, then I will set up an environment with hundreds of packages and test this in a few situations, both with my hardware as it is and with some artificial restrictions to simulate a scenario with a slower computer.

RobertRosca · December 5, 2022, 11:51pm

Thanks for the feedback everybody, I will post updates on this thread as I look into the comments, questions, and suggestions you’ve all left

RobertRosca · December 6, 2022, 12:31pm

I recently received a message on Mastodon from konstin (konsti) · GitHub (not sure if they have an account here) saying:

Hi! i’ve seen your post on pipm and you might want to look into GitHub - konstin/poc-monotrail: Proof Of Concept for python package management without virtualenvs where i’ve implemented something similar, including installing wheels only once and a custom PathFinder. Specifically instead of ~/.local/lib/python3.10/multi-site-packages/{package_name}/{package_version} i do ~/.cache/monotrail/installed/{distribution}/{version}/{wheel-tag}, and when loading something i check if i have a matching wheel tag (or install a compatible one otherwise)

This is very similar to what’s being discussed here, but it is way further ahead than my very basic proof of concept:

This proof of concept shows how to use python packages without virtualenvs. It will install both python itself and your dependencies, given a requirement.txt or a pyproject.toml/poetry.lock in the directory. […] Every dependency is installed only once globally and hooked to your code. No venv directory, no explicit installation, no activate, no pyenv. […] monotrail means to show you can use python without the traditional “installing packages in an environment”.

The readme has some examples of usage, within Jupyter/interactive environments imports are done via:

import monotrail

monotrail.interactive(
    numpy="^1.21",
    pandas="^1"
)

Which is an interesting approach, I assume there is also some kind of activate("path/to/environment") command as well but have not checked yet.

There are also a number of benchmarks in the readme.

PythonCHB · December 7, 2022, 6:56am

Hmm – maybe I’m too stuck in my current workflow, but I see the python environment and the data I’m working with as pretty orthogonal. I guess that’s because while I have some major projects, where creating.using a lock file for that project would make sense, I also do a lot of small one-off analysis / work on a few bits of data.

My current workflow makes this easy: I conda activate [*] an appropriate environment, and then do my thing. Works great.

[*] actually I have a little chell alias, so all I need to do is workon envname tomake it even easier.

brettcannon · December 7, 2022, 8:57pm

So that’s a bit of a shift then for packaging overall.

You can use a .pth file for that.

But they work because they get installed into the environment. And that works because environments don’t need to track what version they have used. And since you are explicitly trying to avoid creating a virtual environment that means you need to have some mechanism to handle the unversioned editable install. So I would argue that they won’t “continue to work as they currently do” and will require some consideration.

The import system caches everything, so it’s a one-time cost and is entirely based on your file system on how expensive that cost is.

And I’m the complete opposite. For me, code has dependencies, and so it is not interchangeable with simply any environment (so not orthogonal to me). Now I’m not saying you can’t be careful and make some code work with multiple environments or one environment work with multiple projects, but I consider that an optimization, not something inherent in packaging and a general concept I at least one to push for the community to adopt.

I also think that reuse workflow is a bit unique for (data) scientists who use the same key packages constantly in all of their work. I also wonder whether having faster packaging tools would make people want to continue with that practice or be more willing to making their work repeatable by recording their dependencies more often (which I suspect this proposal would help with).