Packaging and Python 2

These two points make me think that the PyPI download stats could be very misleading. Consider that CI builds tend to be done from a clean VM, so the pip cache won’t apply, whereas manual installs will likely almost always be on a machine with a filled cache. As a result, the cache is essentially biasing the results in favour of CI builds over manual installs.

Add to this the fact that CI builds (where the various Python versions run in the same CI instance - which may be the minority as a lot of projects seem to use a separate CI run per Python version) likely run in version order, which will mean universal wheels and sdists will be counted against Python 2 more often.

I’m not saying the download stats are useless, but these points make me very much more inclined to discount them when trying to measure actual Python 2 usage. Of course, I’m on record as wanting to desupport Python 2, so there’s bound to be a high level of bias in my interpretation, so take the above analysis with a large pinch of salt.

1 Like

Possibly it should be in a separate thread, but I think the packaging/install chain should definitely lag behind the main EOL date because we should be responsive to the needs of major libraries that want to drop support for Python 2 without leaving their Python 2 users in a lurch.

We are only now starting to get a decent test of heavy reliance on python_requires, and I frequently run into issues with it when packages I use drop support for Python 3.3 - and I’m not even actively using Python 3.3. I suspect that we will only start getting a real sense of the ways that python_requires fails after some tipping point where major libraries are relying on it as their “2/3” support mechanism. As much as I’m the first person to recommend abandoning Python 2, I think the PyPI probably needs to be working to fight the “rearguard” action.

Of course, this level of support could be in the form of some kind of “LTS” branch that is focused almost entirely on migration path issues.

2 Likes

I agree with Donald that pip and setuptools probably will end up being at the tail end of packages dropping support for py2. What I wonder though is whether there’s anything we can do ahead of that to avoid simply being held hostage by enterprise users.

Do we think we could work out some kind of coordination with the enterprise distros too share the load? Or in my fantasy world, we’d announce that community support for py2 will continue so long as users collectively donate ($200,000 * ({year} - 2020))/year to making it happen (though that runs into the legal delicacies around the PSF’s non-profit status).

1 Like

I think the PyPA client tools are a case where a “long term maintenance branch” model would be appropriate, and we’d look for ways to kick the task of actually maintaining those branches in the direction of the Python maintenance team at Red Hat (cc @encukou). Backporting new work to old stable versions is a good part of what Red Hat does in general, and with RHEL 7 & 8 both supporting Python 2.7 out to 2024, it’s in Red Hat’s interest to ensure there’s a PyPI client that aligns with that.

So I think PyPA client projects will be able to follow the example of the Scientific Python ecosystem: use Python-Requires to restrict Py2 users to still-compatible maintenance branches, and then open up their main dev branches to Py3-only contributions.

As far as timelines go, I think it would be reassuring to users if that branch point for pip was in 2020 sometime, rather than being in 2019.
That would also allow the final Python 2.7 release to include a fully up to date copy of pip for bootstrapping with ensurepip.

For wheel and setuptools, I don’t think it matters as much exactly when they branch, as long as they start setting Python-Requires appropriately first.

This is a good point. We need @benjamin to comment, but I believe the likely plan was for the final Python 2.7 release to only include critical fixes from the previous one. Updating the bundled packaging tools at this time seems appropriate, and then there’s no need to rev the major version for 2.7 (without an application of cash/effort from an interested organisation).

It would potentially be interesting to make a change in tox so that it runs Py2 CI after other Python versions, and see if that makes an observable difference in the PyPI Python version download metrics for sdists and universal wheels.

There are also a few open discussions against linehaul and pip regarding improving the UserAgent metrics in Big Query to better distinguish different download scenarios:

There may also be value in teaching pip to identify well-known CI execution contexts, and report that info as an extra string field. (e.g. set “travis-ci” based on seeing “TRAVIS_BUILD” from Environment Variables - Travis CI environment variables, or “jenkins” based on seeing “JENKINS_URL” from Jenkins : Building a software project)

There will likely be one in Jan 2020. I agree that it would be a good idea to branch at that point, as well as include this release in Py 2.7.

Yeah, this is my main concern with having an LTS release, how long we’d have to support it and who would do the work to support it.

I personally have little interest in maintaining Python 2 compatible code beyond late 2020/early 2021, since I’ve basically not used Python 2 in my personal projects for a fair amount of time now.

Aside; let’s split out the discussion on the metrics to a different thread? I’m pretty sure we collectively have a lot of thoughts there and it’ll be nice to keep a thread focused on what our plans/options are.

As long as the packaging ecosystem is maintained by volunteers, I think the last word should go to whoever maintains each tool or package.

Going forward, maintaing Python 2 compatibility should become a vendor-supported activity. Python 2 users should look towards companies such as Anaconda, RedHat, Microsoft if they want continued compatibility.

3 Likes

I don’t think that vendor supported is going to work for Packaging tools, because of the large amounts of network effect that is at play here.

An open source project can just use an unsupported version of Python 2.7 to test their package with, and can utilize backports to get a large chunk new Python features available to them even while keeping 2.7 compatibility. Unfortunately the same isn’t true for the packaging tool chain, you can’t extend pip 19 by installing an additional libraries from PyPI, you’re stuck with whatever features existed in pip 19 if you want to continue to support 2.7.

Thus I don’t think being aggressive in dropping support for Python 2.7 actually saves us any work, at best I think it just shifts the work around to make it harder to design new standards and features so that they can co-exist with whatever was existing in pip 19 as well as possibly forcing us to compromise future designs.

Ultimately, I think it will bifurcate the user base, and will cause Python packaging improvements to stall and remain stalled for years.

4 Likes

I don’t know what Red Hat will fund (despite working there), but it wouldn’t surprise me if we keep things working with 2.7, with any patches freely available :‍)
We’ll be happy to coordinate, and I’ll be happy to set something up if there’s interest. However, RH can’t do testing and care for Ubuntu/Windows/Mac. Is anyone interested in doing that, either in the PyPA space or in some LTS fork?

Based on some of the replies, I realised I should probably clarify what I had in mind when I mentioned the idea of an LTS branch for pip (since I somewhat agree with @dstufft regarding the risks of delaying adoption of new packaging features if there turn out to be a non-trivial number of projects still publishing hybrid packages to PyPI, but also think there’s a significant cost in demanding that all potential new contributors learn how to write hybrid Python 2/3 code, rather than just letting them write for Python 3).

Specifically, I am envisioning something akin to what we’ve had for the last several years with the CPython 2.7 branch: the branch is still open to contributions, but the majority of CPython contributors simply don’t need to care that it exists. Instead, the only folks that need to care are those that are either working on Python 2.7 specific bug reports, or else are working on backporting fixes from the main Python 3 development branch.

So in a “pip LTS branch” model, I’d see there being an LTS branch in the main repo that kept Python 2.7 in its CI matrix, while the main line of development became Python 3.x only.

If there were a change where it was absolutely critical at an ecosystem level that Py2.7 users be able to benefit from the enhancement in order for the change to be effective at all, then that change would be made to the LTS branch in addition to the main line of development (if that means that 2020 nominally ends up having more than 12 months in it, at least as far as the LTS version numbers are concerned, so be it).

Personally, I expect such changes to be relatively rare once the PEP 517/518 implementations settle down - the main things where I could see us really wanting a Py2.7 compatible implementation is if funding is forthcoming for an actual implementation of package signing in the PEP 458/480 model, as well as for any enhancements that get made to the platform compatibility tagging model.

I’m not a fan of the LTS branch model, largely because I think it is either pointless or significantly more work. Either the delta between master and the LTS branch is small, in which case the branch isn’t really buying you much besides some minor niceties OR the delta becomes large in which case backporting changes becomes a nightmare (I think it took weeks to backport the SSL improvements from 3.x to 2.x for instance, most of which was reconciling differences between the two branches). I think the latter problem will be a lot worse for pip than it was for CPython, largely because while CPython has a lot of modules which are largely independent other than using the public API of each other, pip’s code base weaves around itself and doesn’t have nearly the same clear boundaries that Python itself has between modules (although maybe other tools have a better story here?).

One of the typical goals of dropping 2.x support (being able to do a bunch of compat code cleanup and start utilizing 3.x only features) actively makes the nightmare case far more likely.

The main benefit of a LTS branch model is that it lets you continue new feature work that doesn’t get backported without having to deal with 2.x support in that branch. However, I don’t feel like that is a very big benefit given the cost in one case (or the pointless-ness in the other).

Ultimately I think that an LTS branch is likely going to be just dropping support for 2.x, but in a way where we pretend we haven’t. I think that we need to either just be upfront and drop support or we need to keep support for installing/packaging in the “primary” tools (I don’t care if someone makes a new build backend or something that is py3 only) in some fashion as I described above.

Having worked on major projects that are Py3 only and major projects that straddle the line between the two, I don’t honestly feel that straddling the line is particularly difficult. There’s nothing major in Python 3 that is going to benefit pip (and likely other projects) other than the ability to do clean async code to allow us to do concurrency more cleanly. Everything else I can think of in Python 3 is either backported (via PyPI) or isn’t really that big of a deal.

As far as new contributors go, I suspect that the “hairy-ness” of the code bases and the debt we’re still paying down from the era of gridlock is a far larger barrier to entry than needing to deal with the syntax to straddle the boundaries.

All that being said, I don’t explicitly think that we should keep 2.x support for as long as anybody is using it, anywhere, ever. All I think is that we shouldn’t look at the dates at all (except in that we shouldn’t drop support before Python itself does) but we should rather let usage inform our timeline instead. That doesn’t mean we need to target < 5% usage like pip has historically done, maybe < 20% is the right answer (as a random number pulled out of my ass). I think that the current 60% number (as per the PyPI statistics) and a projected 50% number in 2020 is a bad place to draw the line.

It might be useful to explore this avenue – extending and enabling pip to install on an interpreter it is not running on. This enables the codebase to be Python 3 only while providing a way for Python 2 users stay on the newer releases.

I remember that someone had posted a PoC for this somewhere in pip’s tracker once (I can’t seem to find it right now; on mobile)

This would also enable installing into a venv that doesn’t have its own pip in it, which would significantly improve venv’s creation time (I recently benchmarked venv on Windows being about 0.5s --without-pip and 40s otherwise, so I’m very in favour of this for a number of reasons).

1 Like

Yea, like @steve.dower said, there’s a lot of benefits to doing this (allowing people to only have one pip on their machine is a big one).

I think for pip specifically it wouldn’t be super hard, trying to go off memory I think we inspect the current running Python for:

  • Basic computer/Python traits for determining compatibility tags, user agent, etc.
    • Should be pretty easy to do in a subprocess and return to a host process.
  • Compiling the .py(c|p) files.
    • Again, easy to do in a subprocess, Python already has a command to do it (python -m compileall).
  • Determining paths to install files to.
    • There’s a lot of legacy code here where we’re monkeypatching (IIRC) and using distutils to determine the path instead of just using sysconfig and computing it ourselves. It would be useful to figure out why and if we can switch to just computing it ourselves given values from sysconfig. If we can compute it ourselves, then this is pretty trivial, but it gets more complex if we have to continue to the logic that we currently have.
  • Determining what versions of a project are already installed.
    • I suspect this is the hardest bit of code, because we’re currently using pkg_resources, and that doesn’t have a “target” mode. Probably the best route here is to get something into packaging that can enumerate the installed packages and other information we need (including any standards we might need to draft) and make that a targeted API instead of a “only the current environment” API.

I think that’s everything?

Of course that would solve the problem for pip but not for build backends, particularly setuptools since that’s the backend that the bulk (all? I forget if flit supports 2.7 or not) 2.7 using projects are going to be using.

2 Likes

(all? I forget if flit supports 2.7 or not)

From Flit docs:

Flit requires Python 3 and therefore needs to be installed using the Python 3 version of pip.
Python 2 modules can be distributed using Flit, but need to be importable on Python 3 without errors.

This would be an immensely useful feature in many ways. It’s something I’ve considered for a long time, but never had the free time to work on. I’m a strong +1 on this, both as an independent feature and as an approach to maintaining Python 2 support without needing to ensure pip can run under Python 2. And yes, I know, “install Python 3 to allow you to install packages in Python 2” is suboptimal for users.

Let’s get specific here. Who, precisely, will support pip on Python 2 going forward? There’s a lot of talk about the amount of work needed to support Python 2, but it’s not at all clear to me who will do that work. I’m on record as saying that I won’t, for example.

As things stand, the work of Python 2 support is a hidden overhead on all support activities. You can’t avoid it - if you work on a PR and the 2.7 CI fails, it needs fixing, even if 3.x is all working fine. The significant advantage of a LTS support branch is, as @ncoghlan pointed out, that it separates the concerns. People interested in maintaining Python 2 can work on the LTS branch, people not interested can ignore it and work just on master. If it’s a lot of effort to maintain the LTS branch, that means that Python 2 support is a lot of effort. If it’s trivial, Python 2 support is trivial. No problem - either way, it should be the people willing to provide Python 2 support that pay that cost.

The massive disadvantage here is that this arrangement splits the (already small) pip maintainer base. It adds overheads that we don’t have the resources to support. But in effect that’s just exposing a problem that we already have - we wouldn’t be talking about dropping Python 2 support if it wasn’t a drain on resources that we didn’t want to carry.

I’d like to see some commitment from the Python 2 using community to provide maintenance resource for pip under Python 2. That doesn’t need to wait till 2020, it could happen now. PRs for Python 2 issues, help in looking at how we set up and maintain an LTS branch, etc. Here, of course “the Python 2 using community” probably means “the enterprise distributions” - because getting commitment from grass roots users isn’t really sustainable or practical (although if one or two individuals come along and say they’ll specifically cover Python 2 support for a few years, that might work).

So there are a few parts of this.

One is that the code base, as it stands, works on Python 2.7. So no real additional effort is required to make what already works there continue to work, particlarly since no new releases of CPython will be coming out past 2020 so we’re effectively going to be targeting a fixed point, so it won’t really be changing out from underneath us.

Another part is new incoming bug reports that only effect 2.7 code bases. This is, IMO, the bulk of the effort of supporting a particular runtime where the code is already working. As I mentioned earlier, I’m perfectly happy declaring 2.7 in “community support” or something that means that if a 2.7 only bug report lands, we either close it or tag it as community support or something, and we basically just ignore it until the community opens up a PR or we finally fully drop support for 2.7, in which case we close it.

The last part is the part you mentioned is that any new PRs will have t continue to make sure they run on 2.7. For this, I don’t think there is any way around the fact that people writing those PRs are going to have to make sure it runs on Python 2.7. Typically that isn’t very hard, and largely means either locating backport libraries from PyPI for Python 3 features or avoiding a few language level changes.

This is not really true, because it’s not necessarily Python 2 that’s causing the pain, but simply the decisions made on the Python 3 branch.

For instance, right now we can’t use asyncio or trio. The fact we have to keep maintenance going prevents us from making decisions that make it hard to impossible to continue Python 2 support. But in the hypothetical of asyncio/trio/etc if the master branch switched to using that, effectively all future patches written for pip master branch would have to be written from scratch to support the non asyncio/trio’d Python 2 code base.

Now maybe you’ll say “well we’ll be careful not to introduce anything that can’t be backported to Python 2 if someone puts in the effort”. Which first of all, sounds wholly impossible to me, because until we’re running the test suites on Python 2 we don’t really know what Python 3-isms we’re adding and how that’s going to affect Python 2. But beyond even that, just the fact you have two disparate sources means that backporting becomes harder the longer the two sources are split. CPython managed to get away with it largely because (A) they weren’t backporting features (which we would need to enable) so the backported patches were generally pretty small and (B) CPython doesn’t allow broad cleanup pull requests., which doing just that is one of the stated goals of a backport. The more of a delta the two branches have just from code churn, the more difficult it is to maintain the backport branch, not because supporting Python 2.7 is hard, but because of the nature of split branches.

I do not think it is meaningfully possible for us to maintain a LTS branch that gets anything more than select bug fixes pulled back to it. Anything else is effectively a fork IMO and would be divergent development streams that need patches effectively rewritten to go between them.

In either case, a LTS branch does not mean that “If Python 2 support is trivial, then a LTS branch is trivial”, because IMO, the bulk of the effort of maintaining Python 2 in that model comes directly from the model itself, not from trying to continue to support Python 2.

I basically don’t care about this at all, because I don’t think the LTS branch model is in any way workable for the actual problems we would have dropping support for 2.7, and If we keep 2.7 support in mainline, then there’s no significant porting effort needed, and I don’t think we need to care about fixing 2.7 only bugs in master. The only real, additional, work in my proposed model only comes from ensuring that any new changes to the code base don’t break 2.7 support, and there’s not much that an external contributor can do to contribute there. It’s not reasonable to hold up every PR and say that this PR can’t merge until Red Hat or someone comes along and adds in 2.7 support in that PR.

Ultimately, I don’t think dropping support even saves us real effort, it just shifts effort around and makes other activities a lot harder. Keeping new code changes working in 2.7 is not very hard, but the knock on effects of dropping support for 2.7 when it’s still the largest source of traffic will likely be far more work.

1 Like

To be clear here, I don’t care about making it optimal for users. They’re on a legacy runtime at that point. As long as it’s “reasonably possibly” without hamstringing our ability to continue to improve packaging when most of PyPI is still supporting 2.7 then I think that’s absolutely fine.