How should virtualenv behave respective of seed packages (e.g. pip) by default

@dstufft has raised an issue on virtualenv tracker related to the changed behaviour of virtualenv 20 respective of what version of the seed package is installed. The historical run down goes as:

  • virtualenv < 14 installed always the embedded pip (the pip that was packaged with virtualenv),
  • 14 <= virtualenv <= 16 always downloaded the latest version from PyPi,
  • 20 <= virtualenv tried to strike a middle ground: install the latest version available offline (by default this is the embedded, however, if someone pulls in a new version e.g. via a manual run that has --download flag then in subsequent creations use the newer version).

The way I understand @dstufft champions the download by default is for to users always get the latest pip that allows:

  • pip will not display in a newly created virtual environment the annoying message of there’s a newer version available
  • easier role out of pip changes, as everyone gets it immediately (faster iteration on packaging ecosystem development).

Historically setuptools/wheel also benefited from these changes, however with PEP-518 they’re no longer anywhere as relevant (as the build frontend handles now their provision, and is less relevant what the virtual environment starts out with).

I and Anthony facilitated the behaviour change in virtualenv 20 driven by our experience from the industry that most engineers priorities go along the order of reproducibility, speed and then ease of use. Always downloading the latest version from PyPi while helps with improving ease of use (as we can quickly improve the ecosystem as side effect by rolling out new changes quicker) it directly hurts both speed and reproducibility.

Anthony detailed war stories of how within Yelp they patched virtualenv to revert to the no download behaviour due to the amount of breakage this caused in their CIs. And I can second back this myself with experience from Bloomberg. The recent release of pip broke two CIs of ours (by switching to in-place builds - a change since agreed to be reverted). Furthermore, a big plus of virtualenv 20 is how quick it is to build a virtual environment. By switching to always download would hurt this aspect too. Note virtualenv does a new release shortly after pip/setuptools/wheel does. So if someone upgrades virtualenv they will get newer versions of the packages too.

I proposed a middle-ground solution that would go along the lines of:

  • always install the embedded version,
  • check for upgrading the embed wheel once every two weeks and upgrade if there have been no releases in the last 10 days (the 10 day period is there to ensure the release is stable and does not contain major breaking changes).

I’m reaching out to the community to ask what you’re thinking? Which of the solutions do you prefer? Please vote in the poll and explain in responses below. Thanks!

  • Speed and reproducibility all the way - always install the embedded wheel
  • Force the latest release always - let’s roll out changes as quick as possible even if that breaks things
  • The middle ground - always install embedded but check for upgrades every two weeks (and upgrade if no release in 10 days)

0 voters

I voted, but I’ll add a comment: ease of use, by default, with a flag for folks who are worried reproducibility. (Download will hit the cache, so I’m not convinced that speed is really an issue.)

1 Like

It is when you’re on slow network or VPN. At the very least you’re paying for ping time, so adds at least 50ms. And when the entire operation takes 200ms, 50 is a lot. Then again I’m more worried about reproducibility than speed here. It will only hit cache if the remote index supports and allows it, and a lot of company indexes don’t.

Yeah, absolutely, so I’d be 100% for an option. But for the default I’d argue for ease of use.

(Clearly that’s opinion so there’s room for rational disagreement by weighting factors differently. :slightly_smiling_face:)

Agrred this is basically looking for opinions. But on a factual note, some people (including me!) do find that speed is an issue, even given the existence of caches. Also, in one of the environments I work in, download fails in some circumstances due to proxy silliness, so the cycle would go “virtualenv x; wait a couple of minutes to discover that I get a proxy error; fix proxy settings; virtualenv x”.

The vote is for people’s opinions, but factually, for some people, download behaviour is significantly slower.

FWIW, pipx’s approach to pip management is similar to the middle ground option. It downloads a shared pip on installation, and the cache is refreshed (with network access) every once in a while.

Also, I just changed my vote from “the middle ground” to “speed and reliability” because I don’t want to “split” the vote here. I’d personally far prefer installing from embedded, but I’m happy with the middle ground to cater for both camps. But if the “download” camp starts to look more popular, I want “embedded”.

I’m not sure the vote as structured reflects this, hence posting my reasons here.

People often use the virtualenv version that comes with linux distros. In that scenario, I find it good that they get modern pip/setuptools/wheel by default when creating new virtualenvs, because the distro bundled version are quickly outdated. Exposing new pip/setuptools/wheel versions close to their release date helps move the ecosystem forward. Pinning is always an option away for those who need it.

1 Like

The problem @sbidoul is that all CI environments/docker images use the default. Auto updating by default leads often to broken CIs in masess, and these are the angry mob showing up at version trackers that hey why you broke me again, and as such the packing ecosystem gets a bad reputation. I know tens of people in my company that feel like this :pensive: and Anthony had same experience in Yelp/Lyft.

I’d argue that the installer should be in charge of when to upgrade and how often. E.g. in case of pip this is the user, in case of Linux distributions the distribution (e.g. they can patch to set default to download on), given they already heavily patch it :man_shrugging:

1 Like

And in the case of pip, we have this argument every time we get problems with a pip release - we argue that people should take more care and control when they release a new version of pip to production “just like any other software”. And yet auto updating by default gives precisely the behaviour that we argue against in that context.

1 Like

At least now we can predict when the angry mob will show up :slight_smile:

Does the majority of CI flows pin a specific version of virtualenv? Or do they just rely on their base image to provide it? If it’s the latter and virtualenv installs bundled pip/setuptools/wheel then it’s upgrading the base image that triggers the angry mob.

In the same way that pip install something gives you the latest and greatest something, which you need to pin for production, I find it appealing that virtualenv gives you the latest and greatest packaging ecosystem. Maybe there is some education needed to make it clearer that that needs pinning just like the rest ? Perhaps virtualenv could print a welcome banner saying To create the same virtualenv in production/CI, use ‘virtualenv --pip X.Y --setuptools A.B’ ?

1 Like

Why not avoid minimize the chance of having an angry mob? People who want latest can set it on, effectively opting in to first tests new releases. These people might detect breaking changes early-ish and then everyone else can upgrade at their own pace (or a few weeks later in case of middle ground option - that strikes a balance between the convenience of latest/greatest and stability - most engineers IMHO care a lot more about the stability than latest most of the time)

Yes. Again absolutely.

My take is that the default should be to optimise for the beginner use case. I think automatically using the latest version does that. And that we can allow for folks with other needs who know what they’re doing.

I understand your point too though.

Thanks.

1 Like

Since we’re looking for opinions, here’s mine!

As one of the main people who’ve dealt with the “angry mob” (I really don’t like this characterization but that’s not a thing worth debating on right now)
 I want us to have “get latest w/ caching” as the default and continue providing the “use an embedded / seeded pip” logic, which can be used by opting in, ideally via passing a flag or a configuration file.

This is similar to how pip install flask isn’t meant to be reproducible but for users who care about that, there’s a separate workflow that can make things work. I do think “defaults that work for a broad audience, with escape hatches for users with specific needs/wants” is a good approach to take here (and in general). Most notably, the defaults shouldn’t block us out of specific changes.

I think this is a nice middle ground, where users who care about reproducible + fast setup can pass that flag, and beginner + Linux distro users get the newest when they make the environment.

Again, I do like the fast snappy nature of things, but the lock in for beginners and Linux distros [who will patch both pip and virtualenv and break things subtly (: ] on specific versions of pip isn’t great IMO.

I’d much rather push newer pip versions to most users by default, instead of waiting for them to get anl newer virtualenv (either via pip install -U virtualenv in the “base environment” or via their distros updates).

3 Likes

Most dev things do need an internet connection. Virtualenv is the one thing where I can be sure that it doesn’t depend on the internet being available. Network availibility can’t be assured at all time even in developed areas. Let alone below average developed areas.

You can always update your system pip if you want to. Dosn’t need to be apple-ish or microsoft-ish.

1 Like

I would note that catering to what reduces the “angry mob” is not always the correct outcome. If you never ever release another version then you’re never going to have an angry mob after you, because you’ve never changed anything for them.

You’re also only going to see the people unhappy with the pre 20.0 behavior coming to you, because nobody is going to go “oh this feature didn’t break me, let me go tell people about it”. People generally only make noise when something is hurting them, not when something is causing them no problems.

This sort of thing is similar in a way to a security feature. When it’s doing what you expect it just sits in the background not really noticeable so you never really think about it, when it’s not doing what you expect, that’s when you actually take notice of it.

My own personal experience both in industry and as part of the wider packaging ecosystem suggests that if we leave virtualenv 20 as it is, we’re going to see a large number of people using a 5+ year old version of pip, not because they choose to do that, but because that’s what the tooling did by default without them knowing about it. The vast majority of developers will not in any way know what they need to do to upgrade to the latest pip, and they’re not going to proactively go out and look for the latest version because it won’t even occur to them to do so.

Some people might ask what the harm is in people using an old version of pip. This basically comes down to network effects. In the general case people packaging software for Python have no control over what version of pip is being used to install their software, so they have to take a conservative approach and only target features that the pip in wide circulation supported. The same used to be true for setuptools but thanks to PEP518 we’ve moved to a situation where project can depend on newer versions (or older) and know that the features they need will be there. However, that’s not the case for pip, there is no control and so they simply can’t do anything about it accept not rely on anything newer.

As an example, the python-requires handling was a major feature that a lot of projects relied on to ease the transition from 2.x to 3.x, without that people who shipped packages dropping support for an older version of Python have to choose between breaking users or never dropping support. If virtualenv had not been installing the latest version of pip for all of those users, far larger share of users would not be on a version of pip that supported it.

A more recent example, we’re talking about compression algoritms for wheels, even if we implemented that right now, due to virtualenv 20 choosing to not install the latest version, we’re likely 5-8 years away from being able to recommend people publish wheels using that newer compression format (which if we get it done is going to do a lot hopefully to speed up downloads as well).

On top of all of that, using virtualenv version as a proxy for pinning what version of pip you want to install is a poor substitute for actually pinning. Practically all of the tooling that allows you to pin software has no mechanism to allow you to pin the version of virtualenv (only tox does that I could find or know of), that means it’s a pin whose version you have to invent some other mechanism to control. It’s a pin that is going to get you a different version of something installed on different OSs or even different machines (or the same machine, with different configurations). You’ll get a different version when you develop locally than you’ll get when you use GitHub Actions, then you’ll get when you use Azure Pipelines, than you’ll get when you use Travis.

It’s basically the worst possible way to “pin” software, because it doesn’t actually do anything to control the version installed. It just goes “eh, whatever you have is fine, but don’t ask PyPI for it that would break me!”. If somebody wants to ensure reproducibility, then they need to actually set the pip version, relying on implicit details like the “version of pip that happened to be bundled with whatever version of virtualenv is on this machine” is complete nonsense.

2 Likes

I would not consider this a complete success story myself, as I found out with virtualenv itself. python-requiers to work actually requires both a new enough pip (that’s still not true on LTS systems package managers - think CentOS os packages - aka stuff not provided by virtualenv) and also an index server that can provide this (notably Ansible index servers don’t do this). So we’re still getting quite a few users that were broken by relying on python-requires
 and we had to tell them to upgrade your OS pip
 use an index server that forwards this content. It might have helped people using virtualenv and the PyPi index server though. So IMHO this might decrease breakages
 but does not remove it

IMHO installers here will need to handle both the old and new compression for a long time. Similar to how distutils is still supported today, even though we don’t recommend it for years. We’re looking at a few years solely because before we can recommend this we don’t just need to implement this on PyPi, but also vendors need to do so for mirror servers (Artifactory, devpi, bandersnatch, Ansible mirror and so on). And even after I’d expect old pip would still work (for LTS systems where it’s not virtualenv controlled the pip version). So no matter the fact we’re looking at a long transition phase. The only thing you might be able to control with virtualenv download always is the rate of adoption, but IMHO not length of transition.

IMHO the question is less about what version you’re getting, more about reproducibility in a sense of within the system. If it worked today, let it work tomorrow too, given the same system. Though agreed that tox allowing pinning both the seeded pip version and virtualenv is a more useful approach in general.

Now looking at the polling numbers seems to me we can agree on one thing. We don’t want to download always. The question turns into should we upgrade periodically or never. The numbers between those are close enough that maybe we do want a periodic upgrade, but perhaps with a bigger time window than initially suggested (to be on the safer side). Perhaps release needs to be 21 days old to upgrade (instead of 10 - this would accommodate most bug-fix releases I can see looking at pip release dates).

Right. I don’t personally care about virtualenv getting the latest version immediately upon release, just that folks are getting it within a timely manner. 10 days versus 21 days or whatever is not a super meaningful difference to me.

The one worry I have is a blanket 21 day old rule might leave users in a worse position than always downloading. Roughly the scenario I’m worried about is say pip X.Y.0 is released and there is some show stopping bug in it, but it’s edge case enough that it doesn’t really effect anyone until the 21 day time hits, pip then fixes it, releases, and
 those people stay broken because the bugfix hasn’t been out for 21 days yet. This case gets worse if the implementation is such that virtualenv’s in the wild continue to start pulling in X.Y.0 as a brand new version while waiting for X.Y.1 to “age”.

So my recommendation would be to tweak the middle ground a bit:

  • Check on new versions once a day (can do this in a background thread to not block the other work happening since this only is checking for new versions), and record when a new release series (X.Y.0) is first seen.
  • Once a release series has been seen, start the 21 day count down for that series.
  • Download a newer version IF it’s a bugfix to our current series OR it’s a new series and the 21 day timer has elapsed.

You can tweak this slightly to make the count down for a “new” series to start either on the X.Y.0 or whatever the latest is in X.Y.*, the key idea here is that bug fixes to a version that has already been downloaded don’t get gated behind a 21 day timer. This should work out fine with pip, because we generally only issue bug fixes for major show stopping bugs and for general bug fixes we just let them roll into the next release,

We can’t realistically check age like this because there’s no guarantee virtualenv will run every day
 Not sure if there’s any other way to check age of projects here :thinking:

You also can’t rely on any state not being reset daily.

CI systems aren’t going to persist any state at all, so after however many days you decide, they’ll all spend extra time updating on every single use.

In contrast, if it stays pinned within virtualenv but we advise CI maintainers to frequently update virtualenv in their images (and include some recommended validation steps for them), we can help ensure all their users get a fast, working version of both tools. Currently, many users will force a pip update as part of their CI run, which adds unnecessary time and risk.