Systematically finding bugs in Python C extensions (575+ confirmed so far)

Systematically finding bugs in Python C extensions (575+ confirmed so far)

Greetings all,

Python’s C extension ecosystem contains a large class of subtle correctness bugs that are rarely caught by testing or linters.

Over the past two weeks, I’ve been running a systematic analysis across 44 extensions (~960K LOC), resulting in 575+ confirmed bugs (~10-15% false positive rate after review, ~140 reproduced from Python) and fixes already merged in 14 projects including Pillow, h5py, Cython, lxml, bottleneck, greenlet, bitarray, guppy3, pycurl, igraph, enaml, APSW, regex, and simplejson. These range from hard crashes and memory corruption to correctness issues and spec violations.

The goal is to provide maintainers with high-signal, reproducible bug reports that would be extremely difficult to find manually. This work suggests that a large class of non-trivial bugs in Python C extensions can be systematically discovered and fixed with high signal.

I’d like feedback on how to make this more useful and scalable for maintainers. If you’d like your own extension analyzed, just tell me and I’ll do it.

What it does

I built a Claude Code plugin called cext-review-toolkit. The key difference from traditional static analysis is that this system tracks Python-specific invariants (refcounts, GIL discipline, exception state) across control flow, and validates findings with targeted reproducers. That is done by 13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class.

The agents use Tree-sitter for C/C++ parsing, which enables analysis that pattern matching can’t do, like tracking borrowed reference lifetimes across function calls, or cross-referencing type slot definitions with struct members.

Each agent can run a scanner script to find candidates, then performs qualitative review of each candidate to confirm or dismiss it. The scripts have a ~20-40% false positive rate and the agents are there to bring that down. After the agents finish, I try to reproduce every finding from pure Python and write a reproducer appendix.

What it found

Across 43 extensions (~950K lines of C/C++ reviewed), the toolkit found ~560 FIX-level bugs. These results are not noise-free: after review, false positives are ~10-15%, and some findings cannot yet be reproduced from Python. Here are some representative findings:

In Cython’s runtime: 3 fix PRs were created by @da-woods after reviewing the report: a cyfunction ref-counting error (#7594), a reference leak in Generator_Replace_StopIteration (#7597, also backported to 3.2), and unguarded PyErr_Clear calls (#7600). Since these are in the Cython runtime, the fixes affect every Cython-generated extension.

In h5py: 16 issues filed, 9 fix PRs created by @neutrinoceros and @takluyver, all merged within 3 days. Findings included HDF5 type handle leaks, unchecked malloc calls, and resource leaks on error paths. (umbrella issue)

In bottleneck: 15 issues filed, 9 fix PRs created by @neutrinoceros, 7 merged. The move_median function had a MEMORY_ERR macro that set an exception but didn’t return, causing a segfault under OOM. The INIT/INIT_ONE macros had unchecked PyArray_EMPTY affecting ~46 functions. (umbrella issue)

In guppy3: The maintainer reviewed every finding line by line, fixed 24 of 30 issues, and found additional bugs the tool missed. Their feedback was invaluable: they identified false positives that helped improve the scanner (types inheriting slots from base types, immutable container borrowed-ref safety, defensive API patterns). They wrote: “The tool has been surprisingly decent. It made me re-read the code and find issues it didn’t find.”

In bitarray: After reviewing the report, Ilan Schnell fixed all critical findings and released bitarray 3.8.1. (issue, PR). Ilan said: “Bitarray has an extensive test suite with almost 600 unittests. Nevertheless, cext-review-toolkit discovered several important edge cases that I had overlooked, and probably never would have uncovered myself”.

In regex: mrabarnett applied fixes within hours of receiving the report, addressing a KeyboardInterrupt swallowing bug, a string_search race condition, and NULL check issues across 6 commits and 3 releases.

In Pillow: 13 fix PRs were created by @hugovk and team after receiving the report, covering refcount leaks, NULL dereference checks, error handling, dealloc fixes, and migration to PyModule_AddObjectRef. Three PRs explicitly credit the analysis.

In lxml: Stefan Behnel committed 16+ fixes directly within a day of receiving the report, released as lxml 6.0.3. The changelog notes: “Several out of memory error cases now raise MemoryError that were not handled before.”

In greenlet: @jamadden created PR #502 with 16 commits, noting: “User @devdanzin provided the results of running various coding agents over the code base. […] I independently verified each finding and either made changes or left comments as to why the behaviour was correct.” The PR fixes potential crashers, reference leaks, and free-threading issues.

In pycurl: @swt2c merged PR #961 “Fix bugs noted by C Extension Review Toolkit” — fixing missing PyErr_Occurred checks, a BytesIO leak, a use-after-free, and a deadlock.

In pyerfa: The toolkit found a copy_from_double33 function that writes all 9 elements of a 3×3 rotation matrix to the same memory address (using p instead of p1 in the inner loop). This affects 48 ufunc loops covering 72 ERFA functions. The agents confirmed the bug with a reproducer (Fortran-ordered output arrays produce wrong results), then conducted a directed investigation that concluded the bug is dormant, so standard astropy usage never triggers the non-contiguous output path. Also found: a template typo (dt_eraLBODY instead of dt_eraLDBODY) causing memory leaks in 3 ufunc loops.

In simplejson: 6 PRs filed (4 merged, 2 open) covering a use-after-free in encoder ident handling, reference leaks in the dict encoder, iterable_as_array swallowing MemoryError/KeyboardInterrupt, NULL dereferences on OOM, error-as-truthy bugs in maybe_quote_bigint, and member table bugs. The multi-run approach on simplejson is where the technique of running the analysis multiple times was first validated: a second naive pass found 4 additional bugs missed by the first, and an informed third pass found 5 more. (umbrella issue)

I also submitted PRs directly to lz4 (6 PRs), memray (2 PRs), and scipy (2 PRs).

How it works

The typical workflow for each extension:

  1. Run all agents.

  2. Review and synthesize agent findings into a report.

  3. Try to reproduce every finding from pure Python, using techniques like OOM injection (_testcapi.set_nomemory), evil subclasses (__hash__ that raises), mischievous file-like objects, and more (documented in the reproducer techniques guide).

  4. Write a reproducer appendix with confirmed bugs and their evidence.

  5. Share the report with the maintainer and file issues/PRs.

Reports are shared as secret GitHub gists. Communication happens through whatever channel works: email, Mastodon DM, Discord, Discourse. Of 40+ maintainers contacted, about 60% responded positively. I’d love if you could share this post so that it could reach those maintainers that didn’t respond, as well as find more maintainers interested in having their extensions analyzed.

How it’s different

Traditional tools like clang-tidy, Coverity, and sanitizers struggle with Python C API semantics (reference ownership, exception state, GIL constraints). The analyses cext-review-toolkit performs target those invariants specifically. Besides that, the tool uses guided semantic analysis (LLM-assisted) to analyze aspects like “was that bugfix complete, and do similar bugs still lurk in the codebase?” that other tools cannot cover.

The rich set of agents cover:

  • Reference counting: leaked refs, borrowed-ref-across-callback, stolen-ref misuse.

  • Error handling: missing NULL checks, return without exception, exception clobbering.

  • NULL safety: unchecked allocations, dereference-before-check.

  • GIL discipline: API calls without GIL, blocking with GIL held.

  • Type slots: dealloc bugs, missing traverse/clear, __new__-without-__init__ safety.

  • PyErr_Clear: unguarded exception swallowing (MemoryError, KeyboardInterrupt).

  • Module state: single-phase init, global PyObject* state.

  • Version compatibility: deprecated APIs, dead version guards.

  • Git history: fix completeness (same bug fixed in one place but not another).

  • Plus: stable ABI compliance, resource lifecycle, complexity analysis.

I’m still learning best practices with this tool. I recently discovered that running multiple analysis passes on the same codebase finds significantly more bugs. For example, a second naive pass on one extension found 7 additional bugs the first pass missed, and a third informed pass (where agents are told what was already found and directed to unexplored areas) found 5 more, including a systematic error-as-truthy pattern across 14+ call sites that no naive pass caught.

Adding the pure-Python reproducing step, while not inherent to the tool, is made much easier with help from the rich findings produced.

Working with maintainers

Reports like these can be time and energy-intensive for maintainers to investigate. Historically, automated bug-finding tools have produced far more false positives than useful information, and AI can make those false positives look incredibly convincing.

These findings are rarely obvious: if they were, compilers, standard linters, or users would have caught them long ago. Because maintainers are already overworked, dumping low-quality, automated “AI slop” into their trackers only makes the problem worse.

To respect their time, I take great care to ensure these reports are of the highest possible quality. I focus on presenting pure-Python reproducers wherever I can. The multi-agent review step and the reproducer efforts act as a filter against noise. When a maintainer points out a false positive, I immediately update the agents’ prompts so that specific pattern is avoided in the future.

Beyond polishing the tools, I try to communicate in a non-invasive, helpful manner. The maintainer always holds the reins: I ask them how they prefer to receive the information (an umbrella issue? individual issues? direct PRs? or do nothing at all) and let them decide exactly what to do with the findings.

A deep dive

To give a concrete sense of what the toolkit finds, here’s a walkthrough of one bug in bottleneck, a library of fast NumPy array functions.

The move_median function allocates a helper struct for the median computation. If the allocation fails, it calls a macro to set a MemoryError:


mm_handle *mm = mm_new_nan(window, min_count);

// ...

if (mm == NULL) {

MEMORY_ERR("Could not allocate memory for move_median");

}

BN_BEGIN_ALLOW_THREADS // releases the GIL

WHILE {

// ... dereferences mm ...

}

The MEMORY_ERR macro is defined as:


#define MEMORY_ERR(text) PyErr_SetString(PyExc_MemoryError, text)

It sets the Python exception but does not return. Execution falls through to BN_BEGIN_ALLOW_THREADS, which releases the GIL while a MemoryError is pending (a Python/C API violation), then enters the computation loop which dereferences mm, which is now NULL. The result is a segfault inside GIL-released code, with no meaningful error message.

The fix is to add return NULL; after the macro call:


if (mm == NULL) {

MEMORY_ERR("Could not allocate memory for move_median");

return NULL; // <-- missing

}

Four agents flagged it independently (error-path-analyzer, GIL-checker, git-history-analyzer, NULL-scanner), and I confirmed it triggers a segfault via OOM injection:


import _testcapi

import numpy as np

import bottleneck as bn

_testcapi.set_nomemory(1, 0) # This forces a memory allocation failure (OOM)

try:

bn.move_median(np.random.rand(100), window=10)

except MemoryError:

pass

# Segmentation fault: mm is NULL, dereferenced inside GIL-released code

(Issue #519. The related INIT/INIT_ONE macro issue, affecting ~46 functions, was fixed in PR #543.)

What didn’t work

False positives exist. The guppy3 maintainer identified ~8 false positives in a 30-finding report. These led to concrete scanner improvements (PR #26). The most common false positive categories: types inheriting slots from base types, borrowed refs from immutable containers (safe because tuples can’t be mutated), and APIs that handle NULL arguments defensively.

Frameworks that generate C/C++ code, like mypyc, pybind11, and nanobind, present unique challenges. These tools produce optimized, battle-tested code that is the foundation of many extensions, meaning any changes to their generation logic must be treated with extreme caution. A single flagged pattern might be replicated across every generated module, so the reports can be overwhelming to triage, and overwhelmed maintainers aren’t always able to respond. However, the potential payoff is huge: when a framework is able to incorporate a fix, as Cython recently did, the entire community of downstream projects immediately benefits.

Some reports are noisy. Extensions with generated code (Cython, mypyc) or very large codebases produce many low-confidence findings. I’m working on better filtering.

OOM reproduction is hit-or-miss. _testcapi.set_nomemory hooks PYMEM_DOMAIN_RAW but not PYMEM_DOMAIN_OBJ, so many OOM paths in extensions can’t be triggered from Python. About 25% of FIX findings are “code-confirmed but not reproducible from pure Python”.

Free-threading analysis

With free-threading becoming a priority for Python, I built ft-review-toolkit for analyzing free-threading readiness. It uses ThreadSanitizer integration (via labeille for building TSan-enabled Python and running tests), combined with static analysis for shared state, lock discipline, and unsafe API usage.

So far it has analyzed 11 extensions, finding real data races in multidict (233 TSan warnings, zero synchronization on the hashtable), bitarray (70 warnings, 6 SIGABRTs), zope.interface (LookupBase cache race causing SIGABRT), kiwisolver (non-atomic SharedDataPtr refcount), and others. It also produces migration plans for maintainers who want to adopt free-threading.

I plan to write a post about this work when more extensions are reviewed for their free-threading status. As always, please let me know if you’d like your extension analyzed.

The numbers

Metric Count
Extensions analyzed (correctness) 44
Extensions analyzed (free-threading) 12
FIX-level bugs found ~575
Bugs reproduced from Python 155+
GitHub issues filed 90
PRs (mine + maintainers’) 62
Fixes merged 49 PRs + ~22 direct commits
Extensions with landed fixes 14
Maintainers contacted 40+
Positive responses ~60%

Questions for the community

I’d appreciate feedback on any of these:

  • Is this useful? Do maintainers find these reports helpful, or are they noise? The best feedback I’ve gotten was from maintainers who engaged deeply (guppy3, h5py, bottleneck, Pillow, greenlet, APSW), but I don’t hear from the ones who didn’t respond. Given that:

    • Would you want this run on your project?

    • What would make you trust (or ignore) a report like this?

    • At what false positive rate would this stop being worth your time?

  • Report format: I currently share reports as gists and file grouped issues (one issue per bug family, not one per line change), or umbrella issues. Is there a better format?

  • False positives: The false positive rate after agent review is roughly 10-15%. Is that acceptable, or should I only report findings that can definitely be reproduced?

  • AI disclosure: Every PR I submit includes a note that it was “authored and submitted by Claude Code (Anthropic), reviewed by a human before submission.” Is this the right level of transparency?

  • Prioritization: I’ve been targeting popular extensions with hand-written (non-generated) C code, and extensions maintained by people I know would be receptive. Should I focus differently?

  • Free-threading: Several maintainers (guppy3, APSW) have asked for help migrating to free-threading. Is there interest in a more structured program for this?

All the tools are open source:

I’m happy to run the toolkit on any extension if a maintainer is interested, just let me know.

Thanks to all the maintainers and contributors who have been helping in this effort, with special mention to Clément Robert for welcoming it in multiple projects, APSW maintainer Roger Binns for insightful suggestions that led to significant new capabilities, and YiFei Zhu who not only gave precious feedback on false positives but also spurred the development of ft-review-toolkit.

Thank you for reading this far! As a reward, if you are interested in exploring this work, here are two more tools and one umbrella issue in CPython itself:

Daniel

19 Likes

Can you run the numbers and share how many of these bugs, using Rust instead of C would have prevented :wink: ?

Anyway, this is impressive stuff - great work Daniel. You’re using AI to help open source in the right way. By keeping the human in the loop, putting work in, communicating with maintainers, and being available.

For me personally though, even clearly though you have found lots of real bugs (not just potential bugs, code quality suggestions, and false positives) I don’t want to have to read a huge machine generated report and work out what’s what.

Perhaps those lists make more sense to other maintainers with immediate knowledge of their code bases. But personally I think it would be even better if a test was included for each bug in the report. Especially if you run that test in Github Actions on your fork, and link to the result. If you went one further, including steps to reproduce the bug, actually carrying out those steps yourself to prove it is a real bug, either in CI where I can see it and play with it too or sometimes just locally in a venv or docker container, then you’d be producing the perfect bug reports for me, and making super useful contributions.

2 Likes

Thank you!

Makes total sense.

I’d like to tailor the reports to what maintainers need, some like having reproducers and suggested fixes, others would prefer just a short description and code locations. I’ve got some feedback from them that will allow customizing some reports, should make asking for this feedback the SOP. And verbosity is something we can surely adapt.

The reports are meant to separate confirmed bugs from code quality suggestions, and to keep false positives to the lowest possible.

Not all bugs the tools find are testable, but we try to reproduce them all and explain why some aren’t possible: simplejson C Extension Analysis Report · GitHub. When maintainers ask me to open PRs, I add tests where possible. Making the reproducer themselves tests (in the correct style for the project) is better, I’ll do that, thanks!

That’d be very informative and I think simple to implement, I’ll give it a try.

I do that locally and include the steps in the report, but doing it in CI seems like the next logical step.

Thank you very much for your feedback!

1 Like

Thanks – As a pillow maintainer, this was one of the better sets of reports that we’ve gotten about potential security/correctness issues. Sorry more of the PRs weren’t attributed with thanks, we do appreciate the effort. I’d second your mention that the coverage isn’t complete – I definitely found unmentioned similar bugs in related functions in the response PR that I did, but by inspection it was obvious that there was the same issue.

The issues raised were mostly difficult to test, especially when one would need to have a specific malloc fail without earlier ones failing.

It would be interesting as a test run to have a fuzzer that used coverage guidance to fail mallocs (or c-api python methods) to test the error handling in those cases. It would need to run under valgrind to catch memory leaks or invalid accesses. This could give better code coverage for the repetitive if(ptr==null) {free everything allocated in the function} c level error handling.

3 Likes