Packaging and Python 2

As an FYI, today I posted a simple PR on pip’s tracker (#6273) using an approach suggested by @njs. The PR adds to pip’s User Agent string whether it looks like pip is running under CI. This addresses pip’s #5499 (“Differentiating organic vs automated installations”). Issue #5499 wasn’t in @ncoghlan’s list of three items in his post.

2 Likes

I expect most downloads are from provisioning tool (e.g. cloudinit, ansible), not only from CI.
Especially, awscli and it’s dependencies are downloaded about 1M/day. It is about 3x of pytest.
So I don’t think excluding only CI download makes download stats usable.
(I was sad when I heard Amazon Linux 2 is based on Python 2.7.)

Ranking only from downloads from macOS and Windows is considerable, although it is still far from real Python usage.

To bring it into actionable tasks for pip, here’s what I think we have:

  • We should update the message we’re printing on Python 2.7 currently, to add a link to pip’s documentation, in the next release. That can include a proper description of our final decision here.
  • We will drop support for Python 2.7 from pip, when usage falls below a threshold we are comfortable dropping support at. (we can discuss that separately)

Regarding the “big” question of who will maintain Python 2 support, as far as I can tell, there have been 2 proposed approaches:

  • PyPA members doing the maintenance work, the Python 2 CI will be kept green and maintainers will fall back to “external support” for resolving any future Python 2-only issues.
  • Python 2 support on mainline is maintained not by PyPA members, but by a vendor’s team who will make sure that pip’s master is kept in a working state for 2.7, and take responsibility/credit for the same.

Something interesting that was brought up is that urllib3 & requests are more or less holding onto Python 2 support until pip doesn’t need it any more. Should we just hang in waiting until pip reaches the usage threshold to end support? Can/should we drop support ahead of pip (since pip vendors us anyway)?

1 Like

I think in general pip’s POV has been we generally hope that our dependencies will keep around support for the things we support, but that it’s not mandatory. If need be pip will either cope somehow (patching, bundling two versions, sticking with the last known version, something) or it will force the issue.

1 Like

Okay then. Here goes.

This question is still unresolved. I don’t feel any maintainer here prefers that the answer to this to being “the volunteers currently maintaining these packages”. (Do correct me if I’m wrong!)

What are our alternatives and how do we move forward on this?

I’m not sure if prefers is the word I would use, but I do think it’s the only workable solution. Anything else I could come up with either doesn’t actually solve the problem for why we would want to keep supporting 2.7 or its so cumbersome that it’s really just dropping support for 2.7 without coming out and saying it.

I’m also of the opinion that keeping 2.7 support isn’t much additional effort and last I looked still represents the majority of our users. They might have changed though but I’m not at my computer to pull up the numbers.

last I looked still represents the majority of our users.

According to pypistats.org this might soon change (we are almost at 50/50%) :slightly_smiling_face:

Always the same problem: do those numbers represent actual humans or CI systems?

One interesting data point: those numbers also put Linux usage at 90%. While Linux is probably more popular among developers than among the general population, that number still seems high. However, Linux is by far the most accessible platform on public CI services…

1 Like

That much is certainly true. It’s also true that Linux is the most typical deployment platform as well, so deploys are likely influencing those numbers a fair amount too. However I think it can be a mistake to start discounting numbers because we think they might come from a class of use. We can only guess at whether they do or not— and even if they do for our purposes people using us in CI is still people using us.

I generally agree with this, but I think what we’re trying to figure out by looking at statistics is “how many people will this impact?” If a large fraction of the download stats are coming from a relatively small number of real people, or some automated configuration that can be changed en masse, it’s worth taking that into account.

That said, I can’t say I know what those numbers represent. If it’s mostly CI systems of projects that are testing against Python 2 because it’s just another line in their matrix and the number of Python 2 end users is small, we might start to see a dramatic decline in that number as projects start dropping Python 2, or we may see the numbers hold steady-ish up until the moment we drop Python 2 support, and then a dramatic decline, or some completely different thing :man_shrugging:

1 Like

Here’s the general trend since 2016. Like pypistats.org, also from BigQuery:

(https://dev.to/hugovk/python-version-share-over-time-4-26hm)

1 Like

I don’t think much has changed since my post above. Specifically:

I don’t think anyone’s really arguing any more that we suddenly on 1st January remove 2.7 from our CI, etc. We may do so at some point in the future, but I’ve no idea when anyone will have the time or energy to bother doing so. So (1) is something of a moot point.

That leaves (2),.

But to be honest, I’m not at all clear what the question even is here. What exactly is “Python 2.7 support”? And no, that’s not a joke - in the context of a completely volunteer driven organisation, what are we promising? There’s no guarantee that any bug report gets looked at, whether it’s about Python 2.7 or not.

At an individual level, I don’t work on any tickets that are MacOS specific, and I only work on the more superficial Linux ones. Not because I “don’t provide support”, just because I don’t have knowledge of, or access to, those systems. From 2020 onwards, I’ll almost certainly just treat Python 2.7 the same. If every one of the pip developers does that, will that mean we don’t support Python 2.7 any more? If some of us are still willing to work on Python 2.7, but they are too busy with other work to do so, does that mean we no longer support Python 2.7?

That’s honestly the only answer I can give.

1 Like

Very interesting, but from 2106 up to March, so now over 2 months old.

I find the other stats on the same blog post, for dev tools and the scientific packages, to be very telling. Those show a much better picture of current use, I’d say, and 2.7 is loosing ground very fast there!

Pytest is the most dramatic:

1 Like

I can’t prove it, but personally my guess is that our Python 2 download metrics these days are heavily influenced by large companies installing on large clusters.

Mostly this is because of every time I look at the stats, the python 2 downloads are super spiky. For example, I just checked numpy’s recent downloads:

https://pypistats.org/packages/numpy

And sure enough, last monday (06-03), it had 598k python 2 downloads. Compared to recent history, this is highly anomalous: the previous few Mondays had 258k, 236k, 305k. Also, if you look at the per-OS stats, the spike is clearly all or almost-all on Linux.

Now, the law of large numbers tells us that if you have lots of independent random events – like say, a few hundred thousand different people with no connection to each other, each independently deciding whether to download numpy – then the noise tends to average out pretty quickly, and you shouldn’t see giant spikes like this. It would be extremely weird for 300k people to all independently say “hey, Monday June 3rd, I like the sound of that, that’s a great day for python 2 linux users like me to upgrade numpy”. But if it’s like, one person rolling out a huge cluster, then it makes more sense.

And I feel like I see these kinds of weird spikes like, practically every time I look at some random project – there’ll be like 1 day or 3 days or something where some specific python version + OS will go wild, and then settle back down to the baseline.

Of course I haven’t done any systematic study of it, and there’s no way to actually check this hypothesis :frowning:

I guess in theory it might be possible to come up with some kind of clever statistical analysis that tries to infer what distribution of latent users could produce the kinds of spiky patterns we see in the data. Anyone know any stats PhD students that might be nerd-snipable?

5 Likes

At my university, there might be someone. There’s a professor there who’s highly into statistical signal processing, and is writing a book on Robust Statistical Signal Processing. You might try shooting him an email: https://www.spg.tu-darmstadt.de/spg/staff_1/currentstaffmembers/zoubir.en.jsp

1 Like

I’ve updated most of them at hugovk/pypi-tools.

Thanks!

So the scientific community is roughly at 60% Python 3 (numpy, matplotlib, tensorflow), CI at 70-75% (flake8, and pylint at the 70% end, and coverage and pytest trending to 80%). Pillow and Pylast fit in with the scientific crowd.

From this I’d say that the broad developer / hacker / learner user base sits somewhere around that 60% / 40% point too, and CI tooling keeps up to date and runs everything on latest 3.x releases and pulling the developers running linting on their dev machines along.

From ‘broad developer / hacker / learner’ point of view and Ubuntu user, (1) there is still in Ubuntu 18.04 the urge to just type ‘apt-get install python’ (as written on many tutorials or installation notes) that installs Python 2.7 and many servers might still be on 14.04 and 16.04. (2) During conversion from Python 2.7 to Python 3, there was a lot of online help that focused on making the code compatible between both 2.7 and 3, so due to (1) and easier to type ‘python xyz.py’, it mostly runs in 2.7. (3) Only when a package only runs in Python 3, i feel the urge to change the run.sh.

Probably as long as ‘apt install python’ is still 2.7 and packages are compatible, users that don’t need the latest scientific packages don’t upgrade automatically. I notice this with Python users around me also.

For the statistics and percentages, there is overlap that Python 3 and Python 2.7 are installed on same machines and projects tested in both.

Does anyone have any theories about how pip's downloads could so incredibly py2-heavy compared to everything else, even PyPI as a whole?

Just intuitively, I feel like people who download pip also download other stuff, right? Because why else would you download pip. Yet somehow those overwhelming numbers of python 2 pip users aren’t showing up in the rest of the stats. Do python 3 users just use a lot more packages than python 2 users, or what?