CPython codebase plots!

The year is almost over and plenty of internet services are generating the “2020 in review” summary so I hope everyone is in the mood for some extra plots :slight_smile: I have generated some “CPython lifetime in review” plots that I hope you enjoy. Generating these plots take a lot of time because the git blame for every single commit needs to be analyzed and aggregated (and that is a very filesystem intensive O(n^2) process), but there is some interesting insights and statistics that can be analyzed by looking at the results.


This plot shows the total number of lines in the codebase broken down into cohorts by the year the code was added. Looking at the different colours, you can observe how the code added in particular year survives over time. (“other” in this plot refers to everything before 1997).

Same idea but for extensions of the files in the codebase. Here you can see how the file extensions evolve over time.

The same idea again but broken down by authors. Is impossible to plot every author, so the ones displayed here are the ones that have added/deleted/modified the most lines cumulatively. Almost everyone is really in the “other” group.

This curve shows the percentage of lines in a commit that are still present after x years. It aggregates it overall commits, no matter what point in time they were made. So for x=0 it includes all commits, whereas for x>0 not all commits are counted (because we would have to look into the future for some of them). The survival curves are estimated using Kaplan-Meier.

These plots have been generated using a modified version of this tool.

25 Likes

Such an amazing work, thanks for sharing this Pablo!

1 Like

Apparently the years 1998, 1999 and 2009 didn’t leave any trace in the CPython codebase. Clearly those were anni horribiles for the project.

Also, for those wondering why there’s a huge jump in my output in the middle 2010: I essentially rewrote the entire C codebase around that time… for some meaning of “rewrote” :wink: .

5 Likes

I find the abrupt drops and disappearances hard to interpret. What did Fred Drake do wrong, or Jack Jansen, that their lines should be snuffed out like that? :fearful:

I suspect relevant changes were the conversion of the TeX docs to reST by Georg and the removal of the old Carbon wrappers.

4 Likes

That means that someone else modified or deleted lines that were last touched by them, so the git blame now points to the person who did these edits.

They did nothing wrong! Is quite normal that lines written by person A are edited/modified/deleted by person B. That’s the nature of an evolving code base

Some of the tectonic faults in the graphs may be related to cvs -> svn -> hg -> git migration artifacts. IIRC we converted the trunk (aka main) into Python 2 legacy branch and py3k branch into trunk in 2010.

Some of the tectonic faults in the graphs may be related to cvs → svn → hg → git migration artifacts. IIRC we converted the trunk (aka main) into Python 2 legacy branch and py3k branch into trunk in 2010.

Yeah that left some scars in the topological structure of the repo:

Since ‘other’ in the top plot is pre-1997, it would make more sense and make the plot easier to read if other were kept on the bottom. I suspect naming ‘other’ as ‘1992-6’ would do that.

1 Like

Since ‘other’ in the top plot is pre-1997, it would make more sense and make the plot easier to read if other were kept on the bottom. I suspect naming ‘other’ as ‘1992-6’ would do that.

Here is also a version with all the years unfolded:

Although the readability is attrocious :slight_smile:

I have modified that plot with your suggestions. It reads much better, thanks!

In this plot with all years, the total lines of pre-1997 code, gray and below, only decreases, as it should, and is nearly all gone now, which I sort of expect. In the revised graph up above, with one pre-1997, the lines go down and back up, sometimes considerably, and remains fairly large. Perhaps ‘other’ included more than dated pre-1997 lines.

In this plot with all years, the total lines of pre-1997 code, gray and below, only decreases, as it should, and is nearly all gone now, which I sort of expect. In the revised graph up above, with one pre-1997, the lines go down and back up, sometimes considerably, and remains fairly large. Perhaps ‘other’ included more than dated pre-1997 lines.

Presumably it also includes 1998, 1999, 2005, and 2010 (which the spikes seem to confirm).

Indeed, that was the problem, I have corrected the plot to include all years and now “pre-1997” is really everything before 1997. Thank you both for pointing that out!