Proposal: Documenting CPython Reference Counting Semantics via Automated Analysis

Hello everyone,

I’ve been working on a third-party project to systematically document CPython’s reference counting semantics—including internal APIs—through automated analysis. So far, I’ve collected 1,534 entries covering a variety of functions. The analysis is largely automated, with an estimated accuracy of around 90%, though full manual verification is still ongoing.

Your feedback matters — whether it’s a :+1::heart:, a comment, or a suggestion, any form of engagement from the community would mean a lot to me and help improve this work.

Current Design & Format
The data is structured in JSON to facilitate processing, integration, and further tooling. Each entry includes the function name and its reference semantics (e.g., “return new reference,” “stealing reference,” “return borrowed reference,” etc.).

Example structure:

[
    {
        "function": "mocked_funcA",
        "semantics": [
            { "semantic": "return new reference" },
            { "semantic": "stealing reference", "stealing param": 0 },
            { "semantic": "stealing reference", "stealing param": 1 },
            { "semantic": "stealing reference", "stealing param": 2 }
        ]
    },
    {
        "function": "mocked_funcB",
        "semantics": [
            { "semantic": "return borrowed reference" }
        ]
    },
    {
        "function": "mocked_funcC",
        "semantics": [
            { "semantic": "return immortal reference" }
        ]
    }
]

Sample from the current dataset:

   {
        "name": "_PyDict_GetItemRef_KnownHash_LockHeld",
        "semantics": [
            {
                "semantic": "return a new reference via an output pointer parameter",
                "new ptr param": 3
            }
        ]
    },

Purpose & Hope for Collaboration
This dataset aims to serve as a machine-readable reference for developers working with CPython’s C API, aiding in debugging, static analysis, and tooling development.

I would love for the community to:

  1. Review and discuss the approach and structure.
  2. Help validate entries, especially for edge cases or internal APIs.
  3. Consider whether something like this could be useful as a supplemental resource or possibly integrated into CPython’s documentation ecosystem in the future.

The full JSON file are available here:
CPython_PyAPI_FUNC_RF_Semantics

Looking forward to your thoughts, feedback, and hopefully a lively discussion!

3 Likes

Currently documented reference counting semantics:

  • No Reference Semantic – The function does not involve reference counting operations.
  • Return new reference – The function returns a new reference (caller is responsible for decref).
  • Return borrowed reference – The function returns a borrowed reference (caller should not decref).
  • Return immortal reference – The function returns an immortal reference (never needs decref).
  • Stealing reference – The function steals a reference from one or more parameters.
  • Return a new reference via an output pointer parameter – The function provides a new reference through an output pointer parameter.
  • Return a borrowed reference via an output pointer parameter – The function provides a borrowed reference through an output pointer parameter.

This might be interesting in PyO3 / rust-for-cpython where we could automatically generate thin wrappers around C API functions which use richer types than raw pointers to encode the refcounting semantics. cc @emmatyping

2 Likes

Thank you for working on this!

This format is rather verbose at first sight.
I’d recommend using short identifiers (like new_ref, borrow, etc.) – you already have a bullet list of longer explanations.
It’s not able to express the weirdness of PyModule_AddObject, which only steals on success. But maybe such outliers should be special cases (it’s less work to remove calls to PyModule_AddObject than to teach tooling to handle it).

Consider structuring this to separately describe the return value and each argument.
PyErr_Fetch “returns” three new references; PyDict_Next two borrrowed ones. (Those functions are problematic, but a hypothetical “good” C API for dict.popitem would return two new refs. We might add something like that in the future.).

Also consider being able to encode “unknown” states. For example, in your example, mocked_funcA has { "semantic": "stealing reference", "stealing param": N } for N=0,1,2, but for argument 3, it’s not possible to tell if it’s not stolen or if that information is unknown.

Consider structuring the format to allow these extensions in the future:

  • argument names (for error messages or code generation)
  • can it be NULL? When?
  • does the return value use one of the 2 standard ways to signal errors (NULL+exception set or non-NULL; -1 with exception set or >=0)?
  • what is a reference borrowed from? (for example, PyTuple_GetItem borrows from its argument)
  • is an argument an array?
  • is an argument a size of an array argument? (e.g. PyTuple_FromArray)

Maybe something like:

"_PyDict_GetItemRef_KnownHash_LockHeld": {
    "return": {"semantics": "tristate"},
    "arguments": [
        {"name": "op", "semantics": "borrow"},
        {"name": "key": "semantics": "borrow"},
        {"name": "hash"}
        {"name": "result", "output": true, "semantics": "new_ref", "null": "on_nonpositive_result"}
    ],
},
"PyTuple_FromArray": {
    "return": {"semantics": "new_ref", "null": "error"},
    "arguments": [
        {"name": "array", "semantics": "borrow", "array": true},
        {"name": "size", "size_for": "array"},
    ],
},

I imagine any extra data would be useful for public API. This could replace things like the current Doc/data/refcounts.dat file, for example.
To be useful, the file needs to be used/checked by some (public) tooling.
For internal APIs, it might be better to keep the metadatata auto-generated.

1 Like

Thank you for the suggestions. Your point is especially enlightening. This format is not only more concise, but also allows for the inclusion of more valuable information (semantics).