Python core development dynamics

pitrou · October 10, 2018, 10:16am

Another observation is that while we’re clearly on the decline, the # commits/day is still higher than it was during the 2005 dip. So it’s not catastrophic. It may correspond to a relatively tranquile, but still active, era for the project.

ambv · October 10, 2018, 11:16am

Pablo, those graphs are great. But they lie a bit. You are missing the most important Python 2 version in them. My intuition is that most of activity in 3.1 - 3.3 is actually Python 2.7.

pitrou · October 10, 2018, 12:08pm

The improvements made in 2.7 were also made in either 3.1 or 3.2 at the time (in addition to a minority of 3.x-only changes).

guido · October 10, 2018, 2:39pm

I like seeing the number of commits per version chart. Does this also count backports? I.e. if I fix a nit in master and backport it to 3.7 and 3.6, does it get counted in each bar?

pablogsal · October 10, 2018, 2:55pm

No, this are the commits reachable from master and the versions are taken from the tags (ie 3.7 is the tag v3.7.0).

I will correct the graphs to include 2.7.

pablogsal · October 10, 2018, 6:00pm

Including 2.7 is a bit challenging as the CPython repo has a weird graph topology. This is the graph minimal topological representation using a subset of the tags that are available in the repository:

40%20PM

The number in the edge is the number of commits between the two nodes. The blue nodes are tags while the yellow/orange ones are common ancestor commits. Notice how 3.1, 3.2 and 3.3 branch away from 2.7. This complicates the analisis around those releases. I am using https://github.com/ChristianStroyer/BranchMaster to do the analysis.

If you wonder how the complete structure looks like, here you go:

46%20PM

Because the tags are not in a line, this complicates constructing properly the statistics.

nad · October 10, 2018, 8:18pm

These are very cool, thanks @pablogsal. I wonder to what extent changes in our workflow affect the results. For example, during the Mercurial era (ending after the 3.6.0 release), our workflow was to commit bug fixes to the oldest applicable branch first, then as necessary merge them up through newer branches up to default (now master). We also tried to keep only default open so, at times, there were null merges needed to do that. (I’m already fuzzy about how that all worked!) I would guess it would not have a significant impact but I believe it to be the case that we used more commits in our previous workflow than we do now. I’m not sure how easy it would be to identify and exclude things like those null merges.

vstinner · October 10, 2018, 8:53pm

I see a decreasing trend starting in 2012…

gpshead · October 10, 2018, 9:49pm

As interesting as they are to look at, I don’t see an inherent problem with the trends. There was also a decrease after 2.2 through 2.5 before the idea of py3k picked up steam spurring us to simultaneously toss stuff into 2.6 or 2.7 knowing they were an end point while working on 3.0 which still needed a lot of work up until ~3.3 to even enable the a transition to 3 to be plausible.

People value a stable language, we seem to be giving them that again in 3.x now which naturally means less changes?

steve.dower · October 11, 2018, 1:18am

When did we start sending new modules to PyPI rather than the stdlib?

barry · October 12, 2018, 12:59am

People value a stable language, we seem to be giving them that again in 3.x now which naturally means less changes?

I think that’s part of it. Also, as has been talked about elsewhere, there aren’t a ton of low-hanging fruit (easy) bugs to fix. Many issues on the tracker are difficult, have long threads, or no clear solution. So I think that’s discouraging new developers too.

hynek · October 15, 2018, 6:48am

I don’t think it’s even necessary to explicitly “send” people to PyPI. The fact that Python packaging is much better now a) makes people go there first and b) competes for people’s time.

I can see it with myself: when I started in 2011, I hadn’t any own projects to speak of and could direct all my energy towards core. Nowadays I’m completely inundated by my own FOSS projects.

steve.dower · October 15, 2018, 1:30pm

What I meant is when people propose a new module, when did we start telling them to put it on PyPI first and then we might put it in the stdlib later? Most never came back, so I assume that somewhat slows down the number of commits and new development on CPython.

(I’m totally fine with this, as it happens. I’d love to move most of the current stdlib to PyPI too, which would help us get by with fewer active developers more focused on the core runtime. Just wondering about the impact.)

guido · October 15, 2018, 2:46pm

So I have an idea that could possibly lead to more insight (though I don’t have time to do it myself).

You could randomly select 10-20 commits from the year with the most commits, and do the same for the most recent period. Then look in depth at each selected commit and classify it. The bins to classify it will be obvious once you see the commits, but I imagine you could perhaps rank them by complexity or number of lines (e.g. diffstat) or number of files or nature of the change (e.g. docs, bugfix, new feature; or C code vs. Python code, etc.

Comparing this for the two time periods should give a lot of real insight.

brettcannon · October 15, 2018, 6:10pm

Years ago. I don’t remember exactly when, but it has been for some time where we tell people to put their project on PyPI first and after a year’s worth of experience they can come back and talk to us about inclusion (to basically stop people from going “I wrote this thing this weekend and I think it should be in the stdlib” suggestions).

barry · October 15, 2018, 8:26pm

For a long time the mantra has been “The stdlib is where packages go to die” where “die” really means don’t evolve very quickly. There’s still the tension between moving quickly and batteries included, and I wonder if whether a serious push to move dev of most core libraries out of the CPython repo would encourage more alternative implementations, or at least those which can move more quickly because they can more easily contribute to stdlib.

pitrou · October 16, 2018, 8:29am

I started looking into this, but I bumped into a problem: in the 3.2 area, we would backport changes to 2.7 as independent commits. Those appear as “duplicate” commits in that they have the exact same commit message as the corresponding 3.x commit, but without an independent indication that they are a backport. If I start random sampling commits from this area, I cannot seem to tell from an individual commit that it is a backport. Yet it would be important to classify commits as backports rather than “primary” work. Does someone have an idea?

(PS: 10-20 is probably too small a sample size, I was going for 50)

pablogsal · October 16, 2018, 9:54am

Maybe you could use git-cherry to check if the commit is also on the other branch and then mark it as “backport” or “duplicate”.

pitrou · October 16, 2018, 10:33am

AFAICT, there are no separate branches from git’s point of view.

pablogsal · October 16, 2018, 10:35am

You can pass it a range to check if the commit is a copy of one in that range:

git cherry SHA1...SHA2 the_commit_sha

also there is the full syntax that will check if something in SHA_MIDDLE…SHA_HIGHT is in SHA_LOW…SHA_MIDDLE

git cherry SHA_HIGHT SHA_MIDDLE SHA_LOW

Although this may not be exactly what you need.