Setting up some guidelines around discovering/finding/naming virtual environments

brettcannon · January 19, 2023, 1:46am

In various comments around PEP 704 - Require virtual environments by default for package installers , people seem to have liked the part of standardizing (or making it a guideline; choose your preferred term) on the name for a virtual environment in your workspace directory.

Unfortunately not everyone likes having their virtual environment in their workspace (see PEP 704 - Require virtual environments by default for package installers - #41 by tacaswell as an example). I also did a whole blog post at Classifying Python virtual environment workflows on the topic of the various ways people manage virtual environments and they definitely vary with no clear winner in terms of practices.

The problem space

The key axes people seem to work with their virtual environment management are:

How many (1 or multiple)
Where they are stored (locally in the workspace or some shared, central location)
What makes the virtual environment(s) special (i.e. if there are multiple virtual environments, how do they differ from each other?)

The one-environment-locally is covered by PEP 704 by recommending .venv. That’s seems to be what various tools are already doing, so I don’t think that’s really a contentious suggestion.

Things become tricky when the virtual environment is stored elsewhere on the file system. How are you to know where that location is? Each tool has their own default, so there’s no real way to discover any of this. A good amount of tools that store environments in a central location respect the WORKON_HOME environment variable that comes from virtualenvwrapper, but that’s only helpful if the user happened to define it.

Then there’s which virtual environment to use when there are multiple options? From a tooling perspective (and I’m specifically talking from experience from VS Code and the Python Launcher for Unix), there’s currently no way to tie a workspace back to any of its virtual environments or what the preferred environment is when there’s more than one. This is important as you want some way to automatically select the right virtual environment as it means you get to skip environment activation which is a sticking point for beginners as well as tooling in general. Activation is even an issue with advanced users who may forget they activated a different environment earlier ; I’m willing to admit I’ve done that thanks to the side-effect that activation follows you around your terminal by default with most tools that don’t provide a run command or you haven’t set up some integration in your shell (which is not typically portable across OSs, let alone shells).

Proposal

To help deal with the “where are the centrally stored environments” and “which of my multiple environments” problems, I’m proposing .venv can also be a file which points to a virtual environment. Think of it like a hacky symlink just for virtual environments. Any tools that have an activate command can just write that file and other tools can just pick up on it. It’s also flexible enough to let tools keep as many virtual environments as they want wherever they want w/o having to coordinate on other details. And the perk of using .venv as the file name is it’s already in a bunch of .gitignore files.

If people want concrete proposals for how the semantics would work, Support a `.venv` file · Discussion #165 · brettcannon/python-launcher · GitHub from @uranusjr suggests only supporting relative paths anchored to the file’s location so no one can accidentally/sneakily point at a virtual environment in a .venv file they were simply handed. I’m not sure if WORKON_HOME is widely used enough to just assume it’s a “standard” and also support it, but that’s also an option.

If we can also agree on where virtual environments can be stored globally/centrally, that’s also great! My guess is there’s some place the XDG specification or something that suggests the directory we use.

Add if a naming scheme to tie virtual environments to workspaces (when that makes sense), then that takes care of all the major issues I have seen come up from storing virtual environments outside of the workspace. The naming scheme would probably would entail something like the tail of the path plus a hash of the directory for a subdirectory to contain all virtual environments for the workspace, and then potentially some naming scheme for the virtual environments themselves (e.g. purpose/label, Python implementation, and Python version).

I do have a more elaborate, flexible solution outlined at Support a way for other tools to assist in environment/interpreter discovery · Discussion #168 · brettcannon/python-launcher · GitHub that could replace the location standardization (the file proposal would still be useful). But I’m not sure if it needs to be the way to handle this, and instead have the more elaborate solution act as an escape hatch for more elaborate scenarios (e.g. adding conda support to the mix for tools that need to support that scenario).

dstufft · January 19, 2023, 2:14am

I mentioned it on the PEP 704 thread, but something I would think would be useful is for whatever direction we go here to support multiple environments for the same project-- even if one of them is the default.

Maybe that means instead of .venv we want .venvs/, and have some default named environment. This is kind of akin to what Cargo does with it’s build directories, and I think it works pretty well.

I don’t have a strong preference for the specific idea of having it be a file that points to another location. It seems fine to me if people really want it, but honestly, I can also just see us getting the problems of in tree narrowed down to where we might be able to just specify in tree.

h-vetinari · January 19, 2023, 4:05am

To me, that would be the real value of a PEP704-alike proposal. I think in-tree is a bad idea for many reasons laid out already, but principally for the following: having several projects use the same environment (of which there might still be several based on a grouping that’s natural) is something that happens a lot in the data science space, not least because full duplication would be really wasteful when several related repos need identical functionality.

This just goes to show my biases, but this was so obvious to me that I didn’t pick up on PEP704 requiring an in-tree solution. Still, it seems eminently possible to agree on a scheme that does something like:

$USER_HOME
|-.python
  |- env
     |- env_name_a
     |  |-...
     |- env_name_b
     |  |-... 
     |- env_name_c
        |-...

It’s possible to bikeshed what the right root is (this would likely be OS-dependent, but that’s tractable), and what to call resp. nest .python and env, but that would be a huge step forward IMO, because it would give all those venvs a standard place where other tools could hook in as well. This would also make PEP668 a trivial matter by leaving a single file in the respective env as to the tool that created it.

From a UX point of view, I also think it would be possible to auto-generate env_name_a when someone creates a project, and later show them how to (re)name the environment if they wish to do so, resp. how to reuse it elsewhere.

uranusjr · January 19, 2023, 4:49am

If there is one thing I learned with my various workflow tooling involvements, it’s that there are firm believers in either in-project and central location camps and it is impossible to convince either camp to adopt the other’s approach. You either need to invent a scheme that works both ways, or arbitrarily pick one and enrage half of the community (which IMO is not a terrible solution actually).

tacaswell · January 19, 2023, 6:21am

I have two main use cases where in-tree is at best awkward.

This is first is something like

src/
   lib1/
       .git/
       pyproject.toml
       lib1/
            __init__.py
            ...
   lib2/
       .git/
       pyproject.toml
       lib2
            __init__.py
            ...
analysis/
   .git/
    script.py
    notebook.ipynb
    notes.org
    ...

So three separate git repositories in maybe different places in the file system and maybe with a mix of other projects mixed in as siblings at any given level.

A typical work session would looks something like

. ~/.virtualenvs/an_env
cd ~/src/lib1
pip install -ve .
cd ~/src/lib2
pip install -v .
pip install --upgrade some stuff
cd ~/analysis
python script.py
# hack at lib1
python script.py
jupyter notebook &
# hack at lib2, ...
pip install -v ~/src/lib2
# re-run notebooks

In this setup which of the 3 checkouts should get the .venv? The most logical place would be analysis. However, then you have to understand enough about venvs to sort out how to get the two libraries to install into it (I assume tooling from the analysis could sort that out) etc. So if the story was just about running code from analysis with some (maybe source installed) it might be OK. However, if you now want to also run the tests or some helper scripts from the libraries you either have to sort out how to activate the venv in the analysis repo (in which case we are back to the global case but just in a very obfuscated way) or to install more in-tree venvs and remember to keep their versions all synced with the other one(s).

Maybe this could be solved by group analysis, lib1 and lib2 under a common root and putting the venv above all of the git repos, but that breaks if you add analysis2 which uses lib1 and lib3. Maybe that could be solved with enough symlinks or accepting two checkouts of lib1. In any case it seems backwards to me to organize your files / go through contortions to conform to a tool when just using a shared env (stored in a “neutral” location) avoids all of the complexity.

The second case is when you have some application / analysis software that has a (maybe complex) set of dependencies, maybe some source installs, maybe some normal installs from local checkouts from an un-published un-tagged repo. However you made it, iou know that the current venv works. If there is a single privileged in-tree venv, any attempt to update is inherently destructive. If your envs are always external then you can make a new one, try it, and if it does not work (and you do not want to spend the time to debug in now) switch back to the old one simply be switching which one you activate. Having access to both simultaneously can also be extremely useful for spot-checking that behavior is consistent.

At my job we actually do this with conda environments (I disagree with @brettcannon that conda envs and virtual envs should be thought of as different (despite the activate scripts feature) as they are both self-contained user-space environments). We support ~25 beamlines which can (very roughly) thought of as unique applications built on the same base libraries. A few times a quarter we build and test an environment and then make it available (read-only) on all of the machines. To upgrade to the new libraries the beamline staff simply activate the newer environment. If things go wrong (typically at 3am in the morning) they can trivially go back to what worked before and we can sort out what is wrong under less time pressure^[1]. I do not think we could reliable run our facility using in-tree environments.

I completely agree and understand that there are some use cases where in-tree makes lots of sense (and see why for the two projects Brett is working on in-tree is the best case!). However my point is that there are valid use cases that are either much simpler to conceptualize as an external env or greatly benefit from multiple envs being available simultaneously.

If pypa had the same relationship to core CPython that the conda ecosystem (or any of the linux package managers or brew or …) has then alienating half the community would probably be OK.

I am eliding a bunch of site-specific details about “cycles” vs quarters, our use of conda-pack, and a bunch of helper scripts, but this is the gist of the story. ↩︎

pf_moore · January 19, 2023, 8:03am

I have a lot of sympathy for this use case, as it is similar to one I have encountered myself. For me, the biggest problem with shared venvs is discoverability - coming back to a project after a long time, how do I find out what the appropriate venv is? And as a related point, when I archive or delete a project, how do I ensure that any related venvs get deleted, so I don’t end up with obsolete environments hanging round? And worst case (which happens to me a lot!) when I do inevitably delete a project and forget to delete the environment, how do I identify “orphaned” environments?

The .venv file approach solves the first two of these (although it’s too easy to run del -rec without checking for a .venv file, so increasing the risk of orphans) but as far as I can see the orphan environment issue remains unsolved.

h-vetinari · January 19, 2023, 8:40am

This discussion is now unfortunately happening in two places, but @jack1142 brought up a good point in the other thread, which is that you could have .venv symlink to an environment in a (user-)global location.

If we always did that, this would solve both @tacaswell’s point (because the env is shared), and the discoverability point of @pf_moore.

The wrench in the works is of course that symlinks are not enabled on windows by default (but can be enabled if you have admin access). For such cases, we could presumably replace .venv with a short log/script that says: “actually, your venv is here”.

To make it fully bidirectional (know which environments are used where, resp. which are obsolete), I guess we’d have to keep some metadata in that user-global env folder, but this is unlikely to ever be perfect (e.g. if a user moves/renames a project folder and never reactivates the project, how should any tool know that the environment is still being used, short of a full disk scan? Taking a conservative stance on this is equivalent to not deleting dangling environments. The only reasonable approach IMO is to give users some utilities to inspect which environments are still around, appear unused, etc., and let them manually trigger deletion).

barry-scott · January 19, 2023, 9:00am

The ~/.python should be either in .local or ,cache in modern unix XDG world.

agoose77 · January 19, 2023, 10:12am

Another use of non-local environments is for contexts such as Docker, where one really wants to isolate the system dependencies from the application, yet the application environment is not confined to a particular part of the filesystem. I know that some people are happy to use the -u feature for this, but there are legitimate reasons to use a virtualenv in Docker containers, such as multi-stage builds.

It seems to me that it’s a fairly uncontroversial statement to point out that both points can be true:

“.venv” is highly convenient^[1]
“.venv” does not cover the span of legitimate uses

Something that I am struggling with is “what problem are we trying to solve?”. I think the answer is twofold:

pip should probably default to using isolated environments for installs".
We need to lower the friction to getting started with a Python environment.

So I think we’d be setting out a standard here so that both pip and python would discover and use the appropriate environment. Is this a reasonable conclusion, @brettcannon?

If so, there are two parts to this:

Environment discovery
Environment creation

@dstufft makes a good point that we want to avoid lock-in to a single environment. I really like @brettcannon’s discussions on both .venv and _py_launcher_. I wonder whether they would have benefited from greater visibility, e.g. here on Discourse?^[2] The idea of letting this be implemented by binaries on $PATH that facilitate environment discovery seems highly promising. We’ve mentioned alternative platforms as one motivational use case, I am not confident that there wouldn’t be other beneficiaries of this flexibility.

For example, using just a discoverer mechanism, the bundled discoverer might look like:

import os
import pathlib

if __name__ == "__main__":    
    # VIRTUAL_ENV
    if 'VIRTUAL_ENV' in os.environ:
        interpreter_path = (pathlib.Path(os.environ['VIRTUAL_ENV']) / "bin" / "python").resolve()
        print(interpreter_path)

    # .venv
    local_interpreter_path = pathlib.Path.cwd() / ".venv" / "bin" / "python"
    if local_interpreter_path.exists():
        print(local_interpreter_path.resolve())

This is probably too primitive; I’d perhaps want to include names in this interface as well.

Conda could then bundle their own discoverer. @dstufft’s example of multiple-platform environments would then either be a new discoverer that users need to install, or a modification to the example I give above. I am not worried about the details here, as you might imagine. Now, pycharm et al. can use this mechanism to identify the environments available for a particular project.^[3] Crucially, unlike a physical in-tree mapping, this system allows anyone controlling $PATH to add new environment discoverers, which feels like a much more flexible, extensible approach.

Environment creation is more tricky. This might be where python has a default that, in the absence of any discovered environments, it creates a .venv.

I think this should work with tools like Hatch that support their own environment management. I could see Hatch defining a discoverer that exposes all of its environments, with priority given to the first (default) env.

Let me finish on this note: I’ve not been hugely involved in these conversations, and I might be repeating old lines of discussion or missing some obvious points. If so, let me know!

I am a Physicist, and I use .venv in all of my projects (except a Docker environment that runs the “core” part of my analysis package, as @tacaswell describes) ↩︎
I suppose that they were narrower scoped conversations at the time of writing, but have since broadened given the substantial overlap of these new discussions. ↩︎
Of course, this means that these discoverers need to be on the system $PATH, which might be a pain if one wanted to use a discoverer from PyPI … but that’s already a chicken-and-egg problem anyway. ↩︎

pf_moore · January 19, 2023, 10:41am

From what I’ve heard, I’m feeling really left out by the fact that @brettcannon’s launcher is for Unix and the Windows launcher shipped with Python is missing all these neat features

sinoroc · January 19, 2023, 1:35pm

When I need to get something done quickly, I always start by creating a virtual environment at .venv. I even wrote a small tool to do just that. So it would be the perfect default location for me. Clear winner, no debate.

Anytime I do actual regular work it is in tox-managed virtual environments. So I follow tox naming rules, no surprise. And I guess that if I were to use a “dev workflow tool” such as Poetry, Hatch, or PDM, then I would not need to know where the virtual environments are because I would always use their run or shell sub-commands.

If I need to do anything that is a bit more involved (maybe working on 2 libraries at the same time to debug something a bit tricky, or deploying something on production machine, or anything else a bit out of the ordinary) where a single .venv does not cut it then I will create virtual environments by hand with names and locations that I will pick on the spot, depending on the actual task, which can be anything. I do not think there is any rule or logic that can be predicted here, and in my opinion trying to make up rules here seems like it would be a waste of time and energy.

If there was something like a .venv file (or any kind of pointer to the actual environment), then this pointer would need to be kept up-to-date, which means we would most likely have a tool to manage its content and this tool should probably offer run and shell sub-commands.

agoose77 · January 19, 2023, 1:51pm

I imagine many of us do something very similar. For posterity, I end up doing something like

cd $(mktemp -d)
echo "layout python3" >> .envrc && direnv allow .

probably 2+ times per day. So I can see a need for this to be immediate, e.g. if venv was used by default if no environments could be discovered.

ofek · January 19, 2023, 2:10pm

I think the default structure of what Hatch does is ideal:

hashed_root = sha256(str(project_root).encode('utf-8')).digest()
checksum = urlsafe_b64encode(hashed_root).decode('utf-8')[:8]
virtual_env_path = data_directory / normalized_project_name / checksum / venv_name

The data directory in this standardized approach would be platformdirs.user_data_dir('.python', appauthor=False) / 'env'.

Note that it is necessary to incorporate the path to the project because the same name might be used elsewhere, perhaps for testing. IDEs like VSCode have that information necessarily so they would be able to resolve the path to the virtual environment.

sinoroc · January 19, 2023, 2:20pm

Is Hatch able to detect (and possibly garbage collect) orphaned virtual environments?

I wonder if there are cases where I would want to run a 3rd party tool outside of hatch run or hatch shell (so that this tool needs to know hatch’s naming logic for virtual environments). And if I understood correctly the venv_name part is a user-defined variable (that can not be inferred by a 3rd party tool), right?

ofek · January 19, 2023, 2:30pm

Orphaned as in the project directory no longer exists? Not yet.

That is a good point but not exclusive to Hatch as the name of the environment would need to be known in the case of all tools. I think the solution is still Brett’s Python launcher idea where there’s some communication mechanism that each tool exposes.

This thread is just about what the path should be so I thought I would chime in since I think I came up with the most appropriate way to isolate them

sinoroc · January 19, 2023, 3:10pm

Ok, that was just out of curiosity. Seems like everything is in place to make it possible anyway. Maybe it is an idea for a plugin.

I understood it as the point is that the path to the environment should be inferred without external input. So in the case of hatch, it’s all good up until the venv_name. I think Poetry has (or had, last time I looked into it years ago) only 1 environment per project (and per Python interpreter) so it can be inferred (there is also some kind of a hash of the project’s path). I do not know how PDM does it, except in the __pypackages__ case.

ofek · January 19, 2023, 3:34pm

One environment per project will not work for standardization.

sinoroc · January 19, 2023, 4:00pm

Right, I re-read the proposal and the discussion. I understand better now.

brettcannon · January 20, 2023, 1:13am

And enough people feel that way that I don’t think we can ignore that use case overall. This is actually why I started this conversation about whether we can come up with some guidelines for tool creators and integrators to follow beyond just .venv so we can support the multiple environment scheme (although it honestly seems to most be in some global directory instead of being local).

This is why I’m proposing the .venv file idea as that works around the symlink issue.

How important is this to people? I assume this is mostly for automated cleanup of orphaned environments? We could suggest tools record the workspace the environments are meant for in some text file or something.

So the plan was to always bring that discussion here, but I have been waiting on some critical feedback from …

conda. Review the proprosal to develop a JSON schema and approach to facilitating environment/interpreter discovery · Issue #11283 · conda/conda · GitHub (I have gotten some tacit confirmation that conda likes the idea).

I am hoping to use the code in the Python Launcher to handle environment discovery in VS Code, which would mean some form of Windows support. So I assume it will happen eventually.

Key questions

To try and refocus this conversation, my questions for everyone are:

What do you think of the .venv file idea as a cheap, simple way to tie a workspace to a virtual environment stored elsewhere?
Is there a directory where people would install virtual environments that we can recommend to tools to use?
Is there some naming/structure scheme within that global directory that we can recommend to tools for having multiple environments for an associated workspace (like what @ofek suggested in Setting up some guidelines around discovering/finding/naming virtual environments - #13 by ofek)?

I get it if the answer to the above questions is “don’t need it”, but I will say this is not a a theoretical issue; we have constant problems trying to find people’s environments properly in VS Code and right now it’s a jumble of custom code per environment management tool we choose to support (which is a similar problem for the Python Launcher). My planned solution is Support a way for other tools to assist in environment/interpreter discovery · brettcannon/python-launcher · Discussion #168 · GitHub (which I will have a proper discussion here when I’m ready to start implementing it), but I’m not sure if that’s not a bit too heavy-handed for common cases (although it will totally meet my needs and everything I’m asking about). But if all we can agree on is what’s in PEP 704 for the situation of when one only needs a single virtual environment and chooses to store it locally, then so be it.

hauntsaninja · January 20, 2023, 2:40am

I think the .venv file idea is great. I’ve been using it for a couple years and it’s been really nice and flexible.

I’ve shared this setup with n=2 novice Python users and they’ve found it really easy to use, in combination with a shell plugin that automatically a) activates and deactivates venvs based on .venv in cwd, b) suggests creating venvs when entering directories with pyproject.toml or setup.py (and without a venv active).

Orphaned virtualenvs haven’t been a concern for me; it’s been easy to clean them up if I feel the need to. I think an individual tool could easily keep track of which virtualenvs it’s created and detect orphaned ones.

platformdirs.PlatformDirs("virtualenvs").user_data_dir seems like a solid choice (although I currently just use ~/.virtualenvs )
ofek’s suggestion is good, although I’d hash sys.base_prefix or maybe sys.implementation.cache_tag or something in there as well. And maybe use hexdigest instead of base64 for simplicity. Could also be worth hashing in the tool name.