Packaging and Python 2

pitrou · June 8, 2019, 10:50pm

Always the same problem: do those numbers represent actual humans or CI systems?

One interesting data point: those numbers also put Linux usage at 90%. While Linux is probably more popular among developers than among the general population, that number still seems high. However, Linux is by far the most accessible platform on public CI services…

dstufft · June 8, 2019, 11:54pm

That much is certainly true. It’s also true that Linux is the most typical deployment platform as well, so deploys are likely influencing those numbers a fair amount too. However I think it can be a mistake to start discounting numbers because we think they might come from a class of use. We can only guess at whether they do or not— and even if they do for our purposes people using us in CI is still people using us.

pganssle · June 9, 2019, 12:02am

I generally agree with this, but I think what we’re trying to figure out by looking at statistics is “how many people will this impact?” If a large fraction of the download stats are coming from a relatively small number of real people, or some automated configuration that can be changed en masse, it’s worth taking that into account.

That said, I can’t say I know what those numbers represent. If it’s mostly CI systems of projects that are testing against Python 2 because it’s just another line in their matrix and the number of Python 2 end users is small, we might start to see a dramatic decline in that number as projects start dropping Python 2, or we may see the numbers hold steady-ish up until the moment we drop Python 2 support, and then a dramatic decline, or some completely different thing

hugovk · June 9, 2019, 7:59pm

Here’s the general trend since 2016. Like pypistats.org, also from BigQuery:

(https://dev.to/hugovk/python-version-share-over-time-4-26hm)

pf_moore · June 9, 2019, 10:45pm

I don’t think much has changed since my post above. Specifically:

I don’t think anyone’s really arguing any more that we suddenly on 1st January remove 2.7 from our CI, etc. We may do so at some point in the future, but I’ve no idea when anyone will have the time or energy to bother doing so. So (1) is something of a moot point.

That leaves (2),.

But to be honest, I’m not at all clear what the question even is here. What exactly is “Python 2.7 support”? And no, that’s not a joke - in the context of a completely volunteer driven organisation, what are we promising? There’s no guarantee that any bug report gets looked at, whether it’s about Python 2.7 or not.

At an individual level, I don’t work on any tickets that are MacOS specific, and I only work on the more superficial Linux ones. Not because I “don’t provide support”, just because I don’t have knowledge of, or access to, those systems. From 2020 onwards, I’ll almost certainly just treat Python 2.7 the same. If every one of the pip developers does that, will that mean we don’t support Python 2.7 any more? If some of us are still willing to work on Python 2.7, but they are too busy with other work to do so, does that mean we no longer support Python 2.7?

That’s honestly the only answer I can give.

mjpieters · June 10, 2019, 8:48am

Very interesting, but from 2106 up to March, so now over 2 months old.

I find the other stats on the same blog post, for dev tools and the scientific packages, to be very telling. Those show a much better picture of current use, I’d say, and 2.7 is loosing ground very fast there!

Pytest is the most dramatic:

njs · June 10, 2019, 10:00am

I can’t prove it, but personally my guess is that our Python 2 download metrics these days are heavily influenced by large companies installing on large clusters.

Mostly this is because of every time I look at the stats, the python 2 downloads are super spiky. For example, I just checked numpy’s recent downloads:

https://pypistats.org/packages/numpy

And sure enough, last monday (06-03), it had 598k python 2 downloads. Compared to recent history, this is highly anomalous: the previous few Mondays had 258k, 236k, 305k. Also, if you look at the per-OS stats, the spike is clearly all or almost-all on Linux.

Now, the law of large numbers tells us that if you have lots of independent random events – like say, a few hundred thousand different people with no connection to each other, each independently deciding whether to download numpy – then the noise tends to average out pretty quickly, and you shouldn’t see giant spikes like this. It would be extremely weird for 300k people to all independently say “hey, Monday June 3rd, I like the sound of that, that’s a great day for python 2 linux users like me to upgrade numpy”. But if it’s like, one person rolling out a huge cluster, then it makes more sense.

And I feel like I see these kinds of weird spikes like, practically every time I look at some random project – there’ll be like 1 day or 3 days or something where some specific python version + OS will go wild, and then settle back down to the baseline.

Of course I haven’t done any systematic study of it, and there’s no way to actually check this hypothesis

I guess in theory it might be possible to come up with some kind of clever statistical analysis that tries to infer what distribution of latent users could produce the kinds of spiky patterns we see in the data. Anyone know any stats PhD students that might be nerd-snipable?

hameerabbasi · June 10, 2019, 10:40am

At my university, there might be someone. There’s a professor there who’s highly into statistical signal processing, and is writing a book on Robust Statistical Signal Processing. You might try shooting him an email: https://www.spg.tu-darmstadt.de/spg/staff_1/currentstaffmembers/zoubir.en.jsp

hugovk · June 10, 2019, 1:39pm

I’ve updated most of them at hugovk/pypi-tools.

mjpieters · June 10, 2019, 1:50pm

Thanks!

So the scientific community is roughly at 60% Python 3 (numpy, matplotlib, tensorflow), CI at 70-75% (flake8, and pylint at the 70% end, and coverage and pytest trending to 80%). Pillow and Pylast fit in with the scientific crowd.

From this I’d say that the broad developer / hacker / learner user base sits somewhere around that 60% / 40% point too, and CI tooling keeps up to date and runs everything on latest 3.x releases and pulling the developers running linting on their dev machines along.

aldwinaldwin · June 11, 2019, 4:34am

From ‘broad developer / hacker / learner’ point of view and Ubuntu user, (1) there is still in Ubuntu 18.04 the urge to just type ‘apt-get install python’ (as written on many tutorials or installation notes) that installs Python 2.7 and many servers might still be on 14.04 and 16.04. (2) During conversion from Python 2.7 to Python 3, there was a lot of online help that focused on making the code compatible between both 2.7 and 3, so due to (1) and easier to type ‘python xyz.py’, it mostly runs in 2.7. (3) Only when a package only runs in Python 3, i feel the urge to change the run.sh.

Probably as long as ‘apt install python’ is still 2.7 and packages are compatible, users that don’t need the latest scientific packages don’t upgrade automatically. I notice this with Python users around me also.

For the statistics and percentages, there is overlap that Python 3 and Python 2.7 are installed on same machines and projects tested in both.

njs · June 11, 2019, 5:18am

Does anyone have any theories about how pip's downloads could so incredibly py2-heavy compared to everything else, even PyPI as a whole?

Just intuitively, I feel like people who download pip also download other stuff, right? Because why else would you download pip. Yet somehow those overwhelming numbers of python 2 pip users aren’t showing up in the rest of the stats. Do python 3 users just use a lot more packages than python 2 users, or what?

mjpieters · June 11, 2019, 5:26am

Ubuntu 16.04 and 18.04 both come with python3, versions 3.5 and 3.6 respectively. 14.04 is no longer supported, so rapidly becoming a distant exception.

aldwinaldwin · June 11, 2019, 5:48am

Yes, standard with Python3 … but when running ‘sudo apt install python-pip’ or just ‘sudo apt install python’, it will install Python 2.7. And 14.04 is no longer supported but still installed on a lot of servers. Many small companies have a thing of ‘don’t touch, it’s working’. Just mentioning, even things are not supported anymore … the situation in the field might be different. I find it difficult to get totally away from Python 2.7 and other people are surprised when i tell them Python 2.7 is almost EOL in 2020. What could explain those numbers of 40%-50%-60% still on Python2.7.

dstufft · June 11, 2019, 2:06pm

pf_moore:

But to be honest, I’m not at all clear what the question even is here. What exactly is “Python 2.7 support”? And no, that’s not a joke - in the context of a completely volunteer driven organisation, what are we promising? There’s no guarantee that any bug report gets looked at, whether it’s about Python 2.7 or not.

At an individual level, I don’t work on any tickets that are MacOS specific, and I only work on the more superficial Linux ones. Not because I “don’t provide support”, just because I don’t have knowledge of, or access to, those systems. From 2020 onwards, I’ll almost certainly just treat Python 2.7 the same. If every one of the pip developers does that, will that mean we don’t support Python 2.7 any more? If some of us are still willing to work on Python 2.7, but they are too busy with other work to do so, does that mean we no longer support Python 2.7?

I think supporting 2.7 doesn’t mean anything differently than what we’ve been doing. We keep the CI working and ensure that new PRs don’t regress Python 2.7 and that they work on 2.7 as well as Python 3. In my (admittingly very rudimentary tests), even the oldest pip available on PyPI still works fine on Python 2.7 after you adjust the default index url, which suggests ongoing maintenance isn’t something we need to worry about, what works today is likely going to continue to work tomorrow except when we change the code through PRs, so it’s really just about ensuring any new changes don’t break 2.7.

Personally I think it would be fine to officially put 2.7 in some kind of “limited support” or “community support” or whatever we want to term it, which just basically means that we’ll keep pip running on 2.7, but that any non-show stopping bugs that only exist on 2.7 will be tagged as community support and ignored by the pip developers, and if someone wants them fixed they’ll be responsible for writing the PR to do so. This is basically the scenario you described here, just being explicit about it rather than implicit. The non-show stopping qualifier is a bit fuzzy and purposely so, but I think the OS parallels is pretty good here, there have been cases where Windows broke, and broke badly enough during a release that I attempted to fix it because there was no-one else around but that I generally don’t touch any Windows specific issues because I don’t have a Windows computer laying around nor do I use Windows enough to know it well enough to feel comfortable working on it. I don’t think we’d need to do anything differently than that, where how important any specific bug is, is kind of up to us.

Because I had just seen it come up in a pip ticket, I don’t think that means we need 1:1 feature parity between them. Features exist on a spectrum of importance, and I think it’s fine to limit less “important” features to Python 3+ (or really not even specific to Python 3+, I think it’s also fine to do for Python 3.6+ or something too). Obviously that should be used sparingly because version specific features are generally confusing to end users, but it’s a useful thing to do in cases where the feature is low enough importance and getting it to work on older Pythons is significant enough work.

The tl;dr here is I think the main benefit is ensuring that our major new packaging features are able to get into the hands of people who are still using 2.7, because those people represent a large number of users for us and packaging has network effects where people will not use new features if it locks them out of supporting a version of Python that they want to continue to support. I don’t think the value is in dedicating time fixing minor bugs or ensuring every single minor feature works on 2.7.

zehauser · June 11, 2019, 2:43pm

Total speculation, but inside corporate-land I’ve regularly seen incantations like this for building/testing Python code in a CI system:

...
pip install --upgrade pip
pip install -r requirements.txt -i https://some.private.repository/simple
...

Which would download pip from PyPI but other packages from an internal mirror. I doubt there was any particular reason for this.

pf_moore · June 11, 2019, 3:01pm

For reference, this is the relevant comment. And I think “available on platforms that support it (specifically not on 2.7)” sends a good message that we’re not going to support 2.7 indefinitely.

My worry here is that if, when writing new feature code, we hit Python 2.7 issues, who will work around them? Hopefully, this won’t be a common issue, but I think I did hit things like that on the PEP 517 work. And my point is that I wouldn’t want to put more than minimal effort into that myself (much like I’d say “weird breakages on MacOS, can anyone help fix these?”). Are we comfortable blocking new feature development on increasingly rare “developer willing to work on 2.7” resource?

There’s also a social aspect to this, of course - and this is one reason I’m being so explicit about my preferences - that there’s a pressure on people developing new features to not dump their half-working code on others, which puts pressure on them to continue working on 2.7 when they don’t want to. And because “working on 2.7” is a choice (where “I don’t have a Mac/Windows/Linux” is more of a simple reality) it’s harder to stick to (and consequently easier to feel pressured).

Overall, I think we’ve probably spent more time debating this than we’ll actually use supporting 2.7 in a year. So I’m OK with just keeping the status quo, if others prefer to do that. But at some point, we will have to drop 2.7 support, and we’ll have to give our users some notice of that, so we can’t keep postponing the decision indefinitely.

pradyunsg · June 13, 2019, 4:53am

Seems like our concensus here is to stick with the maintainers continuing to volunteer their time to keep Py 2.7 afloat.

Personally, @dstufft’s suggested approach makes sense for me here.

If there’s no major concerns with this, let’s do this. That would wrap up the “who maintains it and how” part of this discussion.

As of right now, we decided to inform this with metrics but I’m not sure where we’d draw the line yet.

I’m thinking ~15% usage is a good number but I don’t know what metric to use for that. This is really the only ghing left in this discussion IMO. What metrics do we use to decide when to drop support?

Does anyone have any theories about how pip 's downloads could so incredibly py2-heavy compared to everything else, even PyPI as a whole?

We should look at setuptools too for this.

methane · June 13, 2019, 11:45am

The answer is: Install, Update, and Uninstall the AWS CLI version 1 on Linux - AWS Command Line Interface

This document refers Python 3 now, but it had not refer Python 3 at all for long time.
Many users installs awscli on python2 automatically.
For example, Amazon Linux cloud-init script · GitHub

Top 10 packages are pip, setuptools, awscli, and it’s dependencies: PyPI Download Stats

njs · June 14, 2019, 12:24am

Whoa. You’re right.

In the last 28 days, pip had 44146025 py2 downloads and 11052346 py3 downloads, so 80% py2.

Over the same period, awscli had 24469288 py2 downloads and 5409278 py3 downloads, so 82% py2.

For a back of the envelope calculation, let’s assume that pretty much all of the people installing awscli on py2 are also downloading pip. (Especially since py2 doesn’t ship with pip.)

That suggests that if we split off the people using awscli on py2, then among the rest of the world pip’s downloads were ~64% py2. And if everyone using awscli switched to py3, then pip’s downloads would be ~65% py3.

@dstufft So uh… anything you can share about how AWS is going to get people off py2? It’s kind of single-handedly holding volunteers hostage for supporting py2. And looking at the awscli downloads timeline, the py3 usage appears to be completely flat over the last six months, no trend away from py2 at all right now.