PEP proposal: Automatically Formatting the CPython Code

taleinat · October 29, 2020, 5:11pm

Hi all,

In the dev sprint last week the issue of auto-formatting came up again, despite nobody having planned to discuss or work on the topic in advance. Several participarts started a discussion, where we reconsidered this from a fresh perspective. After getting positive feedback on the concept from several core devs, @ammaraskar, @isidentical, @corona10 and I have worked up a draft for a PEP proposing automatically applying and enforcing formatting of the CPython code (see below).

This is still in an early stage; It’s not an actual PEP, yet. We’d like to get feedback on the idea and high-level approach, as well as hear about additional issues or problems that we may have missed, before proceding to make a PEP for this.

With that in mind, please take a look at the following draft!

Abstract

The CPython codebase consists of three major languages: Python, C and reStructuredText [1]. For each of these languages there is a set of written code style conventions that are followed by contributors and core developers. Learning, applying, and enforcing these conventions are time-consuming tasks which impede our workflow.

In recent years many software projects have adopted workflows where code formatting is entirely automated. This PEP outlines the usage of such tools in the CPython development process.

Motivation

The aim of this proposal is to completely automate the applying and enforcing of the existing code style guidelines. In a wider context, this is one of several initiatives aimed at addressing the large backlog (see below) of the CPython project. It would achieve this both directly, by making code reviews more efficient, and indirectly, by making contribution easier and friendlier, thus attracting more developers to work on the project.

At the time of this PEP’s writing, there are over 1,300 open pull requests (“PRs”) on the CPython Github repository [2], and over 3,000 open bug tracker issues with a patch or a PR [5]. This large “backlog” has been acknowledged as a serious problem, as can be seen, for example, by the Python Steering Council discussing this in November 2019 [13]. The most significant reason for this is a lack of core developer time to review and handle these patches and PRs, as noted in the Python Developer’s Guide [6] and many times on the python-dev mailing list [7] and discuss.python.org [8].

In recent years many aspects of the CPython development workflow have been improved via automation, such as automatically running tests for PRs and automating the backporting of PRs to maintenance branches. Like testing and backporting, applying and enforcing code formatting styles is a time-consuming process that is natural to automate.

At this point in time (late 2020) the authors believe that this aspect of the workflow is prime for automation, thanks to the growing popularity and maturity of automatic formatting tools. Some prominent examples of such tools are clang-format, rustfmt, gofmt, prettier and standardjs. In the Python ecosystem, the most popular options are black and yapf. In contrast to earlier tools which would highlight or print warnings on style violations, this new generation of tools automatically format code to conform to a specific style. This is the foremost reason we are reconsidering this idea, which was previously discussed and rejected in 2016 [20].

Besides making the PR process more efficient, adopting auto-formatting would “lower the bar” for new contributors. The current workflow requires contributors to read PEP 7, PEP 8 and the documentation style guide [14], and to understand the subtleties of when to apply them or conform to the style of existing code. The status-quo also often results in reviews asking that PR authors fix code style issues, which introduces additional delays into the PR review process and sometimes causes frustration [12, 17, 18].

Finally, constantly thinking about code style is a distraction from more significant aspects of code, such as correctness and readability. Automatic formatting allows everyone working on a project to almost never think about how to style their code.

Other Notable Projects Using Automatic Formatting

In addition to the evolution of tooling and benefits outlined above, there are also a wide variety of large open source projects that have adopted some form of automation for their formatting. Some prominent projects are:

The Rust programming language uses rustfmt to keep their standard library and compiler written in Rust formatted [4].
The Linux kernel uses clang-format to keep their C/C++ code automatically formatted [3].
NodeJS uses clang-format to format their C/C++ code. https://github.com/nodejs/node/blob/master/.clang-format
The LLVM compiler uses clang-format. https://github.com/llvm/llvm-project/blob/master/.clang-format
The Go programming language uses gofmt. https://github.com/golang/go/wiki/CodeReviewComments#gofmt
Django formats its code with Black.
https://github.com/django/deps/blob/master/accepted/0008-black.rst

Roadmap

For each language (C, Python and ReST), choose an automated code style checker and formatter. Configure and adapt the chosen tools to our needs, such as the desired styles and support for re-formatting only new and changed lines.
Require all new/changed code from a certain point in time to be checked and formatted using these tools. From that point in time, enforce this with CI checks.
Announce well in advance (just after this PEP being accepted) when this will take place.
Just before “flipping the switch” making this a requirement, reformat our entire codebase with the chosen auto-formatters. This will be done exactly once, in a single commit, on each of the active branches (e.g. master, 3.9 and 3.8).
From this point on, auto-formatting will be applied only to new and changed lines of code, as reported by git. We’ll need to ensure the chosen formatters all support this.
Supply tools to make local application of formatting simple and painless. Document how to use these tools in common workflows and environments.
Supply tools and instructions to simplify merging patches and PRs from before the reformatting. Likewise for updating down-stream patches.

Potential Problems and their Solutions

This section outlines some pitfalls that can arise from the usage of auto-formatting tools, and the solutions we propose to avoid or overcome these problems.

Language Syntax Bootstrapping

Problem: When the Python language acquires new syntax, the formatting tool will need to be updated to be able to format Python code using this new syntax. Code using this syntax, such as for tests, will need to be added to the codebase before the formatting tool could be updated.

Solution: Code files, blocks and/or lines will be able to be marked for exclusion from auto-formatting. For new syntax, these exclusions will be marked as temporary with specific comments to make finding them and removing them easy when support for the new syntax is added to the formatter.

(Rejected Solution: Keep the formatter’s source code in the codebase, and be forced to always update it alongside any language syntax change.)

Existing Patches and PRs

Problem: We have a large set of patches on bugs.python.org and pending PRs on Github made against the old, unformatted, code. These pull requests will have merge conflicts once the code has been reformatted.

Solution: These can be fixed (semi-)automatically by: (1) merging the patch with the commit just before the codebase-wide-reformatting; (2) applying the new formatting; (3) merging with the head of the relevant branch. We will supply scripts for this purpose.

Downstream Maintainers

Problem: Aside from our own pending patches and pull requests, many downstream maintainers of Python such as Linux packaging folks have their own set of patches they apply against CPython [15][19].

Solution: These can be updated using the same process and tools as for existing patches and PRs (see above). The tools can be distributed in the CPython repo for downstream maintainers to use.

Backporting

Problem: Once this PEP is applied, it will make it harder to apply backport patches.

Solution: Apply the new formatting to all active branches simultaneously. Thus, manual fixing should only be needed when applied to branches accepting only security fixes.

Auto Generated Files and Vendored Files

Problem: Automatically generated code files, such as those generated by argument clinic and codec generators, should not be checked or automatically formatted. Vendored files, such as those for libmpdec, libffi_osx, _sha3, and _blake2, should be ignored as well.

Solution: A list of excluded paths. A configuration file will be made that manages the exclusion list for the formatters.

Negatively affecting ‘git blame’ and similar features

Problem: Codebase-wide changes touching on many lines of code get in the way of inspecting the history of specific lines or blocks of code, such as when using git blame.

Solution: git has gained the ability to ignore certain commits when performing such operations. This is done through the use of a .git-blame-ignore-revs file that many other large open source projects have adopted [9, 10, 11]. Initially, this solution will not resolve this problem when using the “blame” feature on GitHub; for that, we will need to bring this up with GitHub and hope that they provide a solution in the future [16].

No reStructuredText Auto-Formatter

Problem: reStructuredText is not a widely used document format, and to our knowledge there are no existing auto-formatters for it.

Solution: Write a ReST formatter! Ammar Askar, an author of this PEP has declared his willingness to do so.

Requiring More Developer Tooling

Problem: Requiring auto-formatting for three different languages will require most contributors to set up three more tools in their local environment and update them occasionally. Worse, for work on maintenance branches, different versions of the Python formatter (and possibly the ReST formatter) may be required.

Solution: A GitHub action which not only checks formatting, but also makes formatting fix suggestions directly on the PR, which could be easily applied. Also, make local installation and updating simple, with clear instructions for different platforms. Finally, with automerging and backporting mostly done by Miss Islington, updating her to apply formatting automatically with the correct versions of tools should mostly eliminate the need to keep different versions installed locally for non-core devs.

References

Copyright

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.

uranusjr · October 29, 2020, 6:07pm

Rust asked for contribution experience feedback a while ago, and the rustfmt enforcement was explicitly brought up as an obstacle to contributing. I don’t really have an opinion toward this topic, but we may want to take notice this is far from a universally accepted practice, even in communities that are already doing it.

terrdavis · October 29, 2020, 6:15pm

I’d like to add IPython to the list of projects using the blame-ignore feature.

This PR also includes a script to configure git’s blame.ignoreRevsFile locally, since the feature isn’t active by default nor is there a default name for the file with the commit list. This only needs to be run while setting up a development environment.

Also worth noting are the related git config options blame.markIgnoredLines, and blame.markUnblamables.

ammaraskar · October 29, 2020, 6:31pm

Thanks for linking this. Reading through the thread it seems to be split into two concerns:

The workflow process around making sure your commits are formatted is a pain.

There is lot of rust specific stuff here, such as the fact that their CI doesn’t fail fast on formatting failures and they need a nightly (latest/unstable in rust lingo) version of rustfmt. The solutions proposed at the end of the thread like:

Another good suggestion I’ve seen in this thread was to apply rustfmt by bors on merge. This way master would always be rustfmt-formatted, but individual PRs wouldn’t have to worry about it.

are pretty much exactly what’s proposed in the PEP in the form of a bot adding a commit to fix formatting or miss-illington performing the formatting on automerge. These are fairly easy to address with the right tooling.
rustfmt formats way too much and rejects what some authors consider good code.

This is much more of a philosophical problem. The rust thread devolved into bikeshedding about what a formatter should or shouldn’t do because at the end of the day no one will be truly satisfied with the final output.

Also the Rust thread links to the creation of gofmt is a pretty good read: https://groups.google.com/g/golang-nuts/c/HC2sDhrZW5Y/m/7iuKxdbLExkJ?pli=1

stoneleaf · October 29, 2020, 7:28pm

Readability and style go hand-in-hand. I’ve run my code through both yapf and black and was unhappy with the results from each.

For me to support auto-formatting we would need to use a formatter that allows customization so I could (re)format the code I’m looking at and working with on my machine to a style that fits my brain.

pf_moore · October 29, 2020, 7:40pm

One thing I have found is an obstacle on other projects that require autoformatters is the process of setting up the necessary tools. Projects tend to write instructions that assume you’re OK with a particular workflow/process (often "just install black using pip install --user black" or “add black to the virtualenv you set up for working on this project”, neither of which work for me). Sure, I can work out my own installation process that suits my workflow, but that’s what I mean by it being an obstacle - particularly on projects where I contribute infrequently, so I discard or lose the setup between contributions.

With Rust and Go, the formatters are (effectively) shipped with the language, so they have a much lower bar to clear here.

(Some of this basically comes down to the fact that I feel we need a better way of deploying standalone Python applications, but that’s a whole different topic).

brettcannon · October 29, 2020, 7:47pm

I’m not sure if this requires that we need to use a formatter that allows customization, more that you find a formatter that you are happy with. Since it’s just for you personally to make things format a way that you find pleasing to your eyes then we don’t need to be involved.

But I think a point of this going to be finding a formatter that we can make format relatively closely to how we write code in the stdlib already, so if that’s the general format you like then that would be what could should mostly look like.

BTW there’s also autopep8 if you’re looking for formatters used in the community (it’s actually the default in VS Code due to history).

Yep, which is why this is a question of how much of a barrier this is for people wanting to contribute compared against how much time and effort it will save people doing code reviews. It’s a balance, but it’s something we have to view from a group perspective in both directions, otherwise this could be easy to view it through the perspective of personal impact versus team impact since we all know how to contribute already and so don’t really need this to make a PR .

stoneleaf · October 29, 2020, 7:55pm

Heh, good point. It would be nice, though, to have a single formatter.

brandtbucher · October 29, 2020, 7:55pm

One thing I have found is an obstacle on other projects that require autoformatters is the process of setting up the necessary tools.

Well, between automated bot commits/suggestions and the possibility of using make patchcheck (which we already suggest that contributors run) to script the whole workflow, I doubt that this will be much of an issue in practice.

taleinat · October 29, 2020, 7:56pm

Could you elaborate about what you were unhappy with from those tools outputs?

Did you try (the poorly named) autopep8?

I can see the point of arguments in favor of tools like gofmt which do allow some choice, e.g. allowing splitting statements that could fit on a single line. This, compared to tools like rustfmt which leave absolutely no choice. (See discussion about this on Rust’s forums, linked above.) I’ve had some frustration of this sort lately using prettier on JavaScript and TypeScript projects, and I did feel that it got in the way of readability in some cases, but in a rather minor way.

pf_moore · October 29, 2020, 7:58pm

This I very strongly agree with. Formatters are useful for bypassing pointless or mechanical formatting issues like “how should I line up this dict declaration” or tidying up spacing typos. In particular, if we mandate a code layout that I don’t normally use, an autoformatter fixes my inevitable reversions to my “preferred” style. And a mandated formatter avoids long debates over questions like this which are little more than individual opinion.

But there’s a lot of style/format choices that aren’t that simple. Expressing a complex condition, or a big comprehension, for example, is a matter of clearly conveying intent, and that’s very much about choosing a format that emphasises the key factors in the code, which involves human judgement. No automatic tool can do that. The result is an increase in consistency of style, but overall a slight loss in readability and maintainability.

Yes, tools typically have a flag that says “don’t modify this block”. But projects that adopt autoformatters in my experience buy into the ideology, and push back hard on contributions that override the formatter, without looking at readability. So contributors give up and submit suboptimally-formatted code rather than get into a fight.

taleinat · October 29, 2020, 8:08pm

The major thing we’re thinking of is to have our PR check not only check formatting, but also suggest formatting fixes in a way that makes applying them trivial. It was also suggested that it could directly push commits into the PR with formatting fixes. Both options would remove the requirement for “drive-by” contributors to install these tools locally.

Additionally, we could consider using something like pre-commit, which takes care of all of the details of installing the different tools on different platforms in venvs etc., making it all “just work”. (It does require installing pre-commit itself, running the setup command once, and likely doing updates once in a while. We could have our Makefile handle these automatically too, except on Windows, for which we could write a batch script.)

That is also an option that had been brought up. For the C formatter, which we don’t expect to change often, we may bundle a binary for Windows users. And we may write the ReST formatter ourselves, in which case its source could very well be in the repo. But we’ve initially been wary of suggesting adding a Python auto-formatter to the codebase due to the additional maintenance burden that would entail.

stoneleaf · October 29, 2020, 8:30pm

yapf didn’t seem to be deterministic – I would make changes to its configuration file and nothing would change in the output file.

black’s hanging indents in wrapped function and class headers was disturbing; other hanging indents didn’t flow well for me.

Not yet, I’ll give it a shot later.

One thing I find irritating is alignment of rows/columns: there are times when, for example, having the values of a dict-literal in a nice column is very helpful; likewise, when I’m passing several arguments to a function, often I don’t need to have one argument per line as related groups can go together on one line and reduce the clutter:

result = self.calculate(
        cr, uid,             # in this case, the "relation" is "boiler-plate"
        density,
        pressure,
        )

taleinat · October 29, 2020, 8:35pm

Thanks, those are helpful, now I understand what kinds of issues you had. We’ll take these into account leading up to choosing which formatters to use and with what configurations.

At this point I’d like to avoid getting into further details, but it’s good to know that you would be okay with this in general assuming that such concerns are addressed.

ammaraskar · October 29, 2020, 9:38pm

Out of curiosity, what is your current workflow for using the blurb tool? Do you tend to use the web version, blurb-it?

pf_moore · October 29, 2020, 10:44pm

As I said, I don’t create PRs frequently. I’d create a temporary virtualenv as needed and install it (I don’t have it installed at the moment).

EpicWink · October 29, 2020, 11:11pm

darker runs black on only updated lines

adamchainz · October 31, 2020, 10:10am

FYI Django isn’t yet formatted with Black - we’re waiting for it to leave beta.

mburszley · October 31, 2020, 11:20pm

Looks like they’re finally looking to go that direction

sfdye · November 1, 2020, 2:46pm

As the person who proposed
16. Support ignore-revs-file in Github’s blame view, I would really like to see Github implement this feature as more and more big open source projects are undergoing the formatting changes like this.
I have worked (just this year actually) on proposing the same automatic code formatting ADR within my organization for all our Python projects and getting it approved and implemented across many of production-running projects already. Let me if my experience would help with the Python project.