Proposal: Retaining Inline Comments (#) in AST via Module.comments

Abstract:
Currently, the Python standard ast module discards all inline comments (#) during parsing, except for type_ignores. This poses a major limitation for code formatters, linters, and refactoring tools, forcing them to build heavy custom CST (Concrete Syntax Tree) parsers. We propose adding a comments attribute to ast.Module to store inline comments as lightweight metadata objects.

Specification:
We propose a new AST node specifically for inline comments. Since multiline docstrings (""") are already handled perfectly as ast.Constant string literals, this new node targets only single-line comments starting with #.

To maximize memory efficiency and eliminate redundant allocation, the node omits end_lineno (since a single-line comment inherently terminates at the newline) and focuses purely on inline horizontal boundaries.

The underlying C implementation would look like this:

Comment(
    const char *comment,  // The literal string content of the comment
    int lineno,           // Line number where the comment resides
    int col_offset,       // Starting column byte offset
    int end_col_offset    // Ending column byte offset
)

Rationale:

  1. Memory Efficiency: Unlike other full-scope AST nodes, an inline comment always spans exactly one line. Storing end_lineno is completely redundant. Omitting it prevents unnecessary memory bloat when parsing large codebases with millions of comments.
  2. True Literal Nature: Comments are essentially non-mutating string constants. Storing the text as a constant pointer (const char *) fits seamlessly into CPython’s literal management.
  3. Exact Tooling Boundaries: By keeping col_offset and end_col_offset, modern IDEs and formatters can easily calculate the exact visual range of the comment for syntax highlighting, auto-formatting, and precise user selection.
1 Like

I think the 4 bytes you save for the int (I suppose it’s 64b) does not really help with communication moments that are most likely 10s of bytes long. Sure, the comment can omit the #, but even the whitespace following would be data that could be of value.

I think using some flag when parsing, as well as exposing that flag for the ast module would make more sense. Having the node be given by default could also break some AST visitors / transformers, as the generic visit might break on comments (which it shouldn’t, but you can never be too sure). The flag should be opt-in, perhaps some PyCF_ALL_COMMENTS from the _ast module, which can be used with ast.parse(..., comments=True).

1 Like

You could get the comments with tokenize.generate_tokens.

1 Like