Systematically finding bugs in Python C extensions (575+ confirmed so far)

Systematically finding bugs in Python C extensions (575+ confirmed so far)

Greetings all,

Python’s C extension ecosystem contains a large class of subtle correctness bugs that are rarely caught by testing or linters.

Over the past two weeks, I’ve been running a systematic analysis across 44 extensions (~960K LOC), resulting in 575+ confirmed bugs (~10-15% false positive rate after review, ~140 reproduced from Python) and fixes already merged in 14 projects including Pillow, h5py, Cython, lxml, bottleneck, greenlet, bitarray, guppy3, pycurl, igraph, enaml, APSW, regex, and simplejson. These range from hard crashes and memory corruption to correctness issues and spec violations.

The goal is to provide maintainers with high-signal, reproducible bug reports that would be extremely difficult to find manually. This work suggests that a large class of non-trivial bugs in Python C extensions can be systematically discovered and fixed with high signal.

I’d like feedback on how to make this more useful and scalable for maintainers. If you’d like your own extension analyzed, just tell me and I’ll do it.

What it does

I built a Claude Code plugin called cext-review-toolkit. The key difference from traditional static analysis is that this system tracks Python-specific invariants (refcounts, GIL discipline, exception state) across control flow, and validates findings with targeted reproducers. That is done by 13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class.

The agents use Tree-sitter for C/C++ parsing, which enables analysis that pattern matching can’t do, like tracking borrowed reference lifetimes across function calls, or cross-referencing type slot definitions with struct members.

Each agent can run a scanner script to find candidates, then performs qualitative review of each candidate to confirm or dismiss it. The scripts have a ~20-40% false positive rate and the agents are there to bring that down. After the agents finish, I try to reproduce every finding from pure Python and write a reproducer appendix.

What it found

Across 43 extensions (~950K lines of C/C++ reviewed), the toolkit found ~560 FIX-level bugs. These results are not noise-free: after review, false positives are ~10-15%, and some findings cannot yet be reproduced from Python. Here are some representative findings:

In Cython’s runtime: 3 fix PRs were created by @da-woods after reviewing the report: a cyfunction ref-counting error (#7594), a reference leak in Generator_Replace_StopIteration (#7597, also backported to 3.2), and unguarded PyErr_Clear calls (#7600). Since these are in the Cython runtime, the fixes affect every Cython-generated extension.

In h5py: 16 issues filed, 9 fix PRs created by @neutrinoceros and @takluyver, all merged within 3 days. Findings included HDF5 type handle leaks, unchecked malloc calls, and resource leaks on error paths. (umbrella issue)

In bottleneck: 15 issues filed, 9 fix PRs created by @neutrinoceros, 7 merged. The move_median function had a MEMORY_ERR macro that set an exception but didn’t return, causing a segfault under OOM. The INIT/INIT_ONE macros had unchecked PyArray_EMPTY affecting ~46 functions. (umbrella issue)

In guppy3: The maintainer reviewed every finding line by line, fixed 24 of 30 issues, and found additional bugs the tool missed. Their feedback was invaluable: they identified false positives that helped improve the scanner (types inheriting slots from base types, immutable container borrowed-ref safety, defensive API patterns). They wrote: “The tool has been surprisingly decent. It made me re-read the code and find issues it didn’t find.”

In bitarray: After reviewing the report, Ilan Schnell fixed all critical findings and released bitarray 3.8.1. (issue, PR). Ilan said: “Bitarray has an extensive test suite with almost 600 unittests. Nevertheless, cext-review-toolkit discovered several important edge cases that I had overlooked, and probably never would have uncovered myself”.

In regex: mrabarnett applied fixes within hours of receiving the report, addressing a KeyboardInterrupt swallowing bug, a string_search race condition, and NULL check issues across 6 commits and 3 releases.

In Pillow: 13 fix PRs were created by @hugovk and team after receiving the report, covering refcount leaks, NULL dereference checks, error handling, dealloc fixes, and migration to PyModule_AddObjectRef. Three PRs explicitly credit the analysis.

In lxml: Stefan Behnel committed 16+ fixes directly within a day of receiving the report, released as lxml 6.0.3. The changelog notes: “Several out of memory error cases now raise MemoryError that were not handled before.”

In greenlet: @jamadden created PR #502 with 16 commits, noting: “User @devdanzin provided the results of running various coding agents over the code base. […] I independently verified each finding and either made changes or left comments as to why the behaviour was correct.” The PR fixes potential crashers, reference leaks, and free-threading issues.

In pycurl: @swt2c merged PR #961 “Fix bugs noted by C Extension Review Toolkit” — fixing missing PyErr_Occurred checks, a BytesIO leak, a use-after-free, and a deadlock.

In pyerfa: The toolkit found a copy_from_double33 function that writes all 9 elements of a 3×3 rotation matrix to the same memory address (using p instead of p1 in the inner loop). This affects 48 ufunc loops covering 72 ERFA functions. The agents confirmed the bug with a reproducer (Fortran-ordered output arrays produce wrong results), then conducted a directed investigation that concluded the bug is dormant, so standard astropy usage never triggers the non-contiguous output path. Also found: a template typo (dt_eraLBODY instead of dt_eraLDBODY) causing memory leaks in 3 ufunc loops.

In simplejson: 6 PRs filed (4 merged, 2 open) covering a use-after-free in encoder ident handling, reference leaks in the dict encoder, iterable_as_array swallowing MemoryError/KeyboardInterrupt, NULL dereferences on OOM, error-as-truthy bugs in maybe_quote_bigint, and member table bugs. The multi-run approach on simplejson is where the technique of running the analysis multiple times was first validated: a second naive pass found 4 additional bugs missed by the first, and an informed third pass found 5 more. (umbrella issue)

I also submitted PRs directly to lz4 (6 PRs), memray (2 PRs), and scipy (2 PRs).

How it works

The typical workflow for each extension:

  1. Run all agents.

  2. Review and synthesize agent findings into a report.

  3. Try to reproduce every finding from pure Python, using techniques like OOM injection (_testcapi.set_nomemory), evil subclasses (__hash__ that raises), mischievous file-like objects, and more (documented in the reproducer techniques guide).

  4. Write a reproducer appendix with confirmed bugs and their evidence.

  5. Share the report with the maintainer and file issues/PRs.

Reports are shared as secret GitHub gists. Communication happens through whatever channel works: email, Mastodon DM, Discord, Discourse. Of 40+ maintainers contacted, about 60% responded positively. I’d love if you could share this post so that it could reach those maintainers that didn’t respond, as well as find more maintainers interested in having their extensions analyzed.

How it’s different

Traditional tools like clang-tidy, Coverity, and sanitizers struggle with Python C API semantics (reference ownership, exception state, GIL constraints). The analyses cext-review-toolkit performs target those invariants specifically. Besides that, the tool uses guided semantic analysis (LLM-assisted) to analyze aspects like “was that bugfix complete, and do similar bugs still lurk in the codebase?” that other tools cannot cover.

The rich set of agents cover:

  • Reference counting: leaked refs, borrowed-ref-across-callback, stolen-ref misuse.

  • Error handling: missing NULL checks, return without exception, exception clobbering.

  • NULL safety: unchecked allocations, dereference-before-check.

  • GIL discipline: API calls without GIL, blocking with GIL held.

  • Type slots: dealloc bugs, missing traverse/clear, __new__-without-__init__ safety.

  • PyErr_Clear: unguarded exception swallowing (MemoryError, KeyboardInterrupt).

  • Module state: single-phase init, global PyObject* state.

  • Version compatibility: deprecated APIs, dead version guards.

  • Git history: fix completeness (same bug fixed in one place but not another).

  • Plus: stable ABI compliance, resource lifecycle, complexity analysis.

I’m still learning best practices with this tool. I recently discovered that running multiple analysis passes on the same codebase finds significantly more bugs. For example, a second naive pass on one extension found 7 additional bugs the first pass missed, and a third informed pass (where agents are told what was already found and directed to unexplored areas) found 5 more, including a systematic error-as-truthy pattern across 14+ call sites that no naive pass caught.

Adding the pure-Python reproducing step, while not inherent to the tool, is made much easier with help from the rich findings produced.

Working with maintainers

Reports like these can be time and energy-intensive for maintainers to investigate. Historically, automated bug-finding tools have produced far more false positives than useful information, and AI can make those false positives look incredibly convincing.

These findings are rarely obvious: if they were, compilers, standard linters, or users would have caught them long ago. Because maintainers are already overworked, dumping low-quality, automated “AI slop” into their trackers only makes the problem worse.

To respect their time, I take great care to ensure these reports are of the highest possible quality. I focus on presenting pure-Python reproducers wherever I can. The multi-agent review step and the reproducer efforts act as a filter against noise. When a maintainer points out a false positive, I immediately update the agents’ prompts so that specific pattern is avoided in the future.

Beyond polishing the tools, I try to communicate in a non-invasive, helpful manner. The maintainer always holds the reins: I ask them how they prefer to receive the information (an umbrella issue? individual issues? direct PRs? or do nothing at all) and let them decide exactly what to do with the findings.

A deep dive

To give a concrete sense of what the toolkit finds, here’s a walkthrough of one bug in bottleneck, a library of fast NumPy array functions.

The move_median function allocates a helper struct for the median computation. If the allocation fails, it calls a macro to set a MemoryError:


mm_handle *mm = mm_new_nan(window, min_count);

// ...

if (mm == NULL) {

MEMORY_ERR("Could not allocate memory for move_median");

}

BN_BEGIN_ALLOW_THREADS // releases the GIL

WHILE {

// ... dereferences mm ...

}

The MEMORY_ERR macro is defined as:


#define MEMORY_ERR(text) PyErr_SetString(PyExc_MemoryError, text)

It sets the Python exception but does not return. Execution falls through to BN_BEGIN_ALLOW_THREADS, which releases the GIL while a MemoryError is pending (a Python/C API violation), then enters the computation loop which dereferences mm, which is now NULL. The result is a segfault inside GIL-released code, with no meaningful error message.

The fix is to add return NULL; after the macro call:


if (mm == NULL) {

MEMORY_ERR("Could not allocate memory for move_median");

return NULL; // <-- missing

}

Four agents flagged it independently (error-path-analyzer, GIL-checker, git-history-analyzer, NULL-scanner), and I confirmed it triggers a segfault via OOM injection:


import _testcapi

import numpy as np

import bottleneck as bn

_testcapi.set_nomemory(1, 0) # This forces a memory allocation failure (OOM)

try:

bn.move_median(np.random.rand(100), window=10)

except MemoryError:

pass

# Segmentation fault: mm is NULL, dereferenced inside GIL-released code

(Issue #519. The related INIT/INIT_ONE macro issue, affecting ~46 functions, was fixed in PR #543.)

What didn’t work

False positives exist. The guppy3 maintainer identified ~8 false positives in a 30-finding report. These led to concrete scanner improvements (PR #26). The most common false positive categories: types inheriting slots from base types, borrowed refs from immutable containers (safe because tuples can’t be mutated), and APIs that handle NULL arguments defensively.

Frameworks that generate C/C++ code, like mypyc, pybind11, and nanobind, present unique challenges. These tools produce optimized, battle-tested code that is the foundation of many extensions, meaning any changes to their generation logic must be treated with extreme caution. A single flagged pattern might be replicated across every generated module, so the reports can be overwhelming to triage, and overwhelmed maintainers aren’t always able to respond. However, the potential payoff is huge: when a framework is able to incorporate a fix, as Cython recently did, the entire community of downstream projects immediately benefits.

Some reports are noisy. Extensions with generated code (Cython, mypyc) or very large codebases produce many low-confidence findings. I’m working on better filtering.

OOM reproduction is hit-or-miss. _testcapi.set_nomemory hooks PYMEM_DOMAIN_RAW but not PYMEM_DOMAIN_OBJ, so many OOM paths in extensions can’t be triggered from Python. About 25% of FIX findings are “code-confirmed but not reproducible from pure Python”.

Free-threading analysis

With free-threading becoming a priority for Python, I built ft-review-toolkit for analyzing free-threading readiness. It uses ThreadSanitizer integration (via labeille for building TSan-enabled Python and running tests), combined with static analysis for shared state, lock discipline, and unsafe API usage.

So far it has analyzed 11 extensions, finding real data races in multidict (233 TSan warnings, zero synchronization on the hashtable), bitarray (70 warnings, 6 SIGABRTs), zope.interface (LookupBase cache race causing SIGABRT), kiwisolver (non-atomic SharedDataPtr refcount), and others. It also produces migration plans for maintainers who want to adopt free-threading.

I plan to write a post about this work when more extensions are reviewed for their free-threading status. As always, please let me know if you’d like your extension analyzed.

The numbers

Metric Count
Extensions analyzed (correctness) 44
Extensions analyzed (free-threading) 12
FIX-level bugs found ~575
Bugs reproduced from Python 155+
GitHub issues filed 90
PRs (mine + maintainers’) 62
Fixes merged 49 PRs + ~22 direct commits
Extensions with landed fixes 14
Maintainers contacted 40+
Positive responses ~60%

Questions for the community

I’d appreciate feedback on any of these:

  • Is this useful? Do maintainers find these reports helpful, or are they noise? The best feedback I’ve gotten was from maintainers who engaged deeply (guppy3, h5py, bottleneck, Pillow, greenlet, APSW), but I don’t hear from the ones who didn’t respond. Given that:

    • Would you want this run on your project?

    • What would make you trust (or ignore) a report like this?

    • At what false positive rate would this stop being worth your time?

  • Report format: I currently share reports as gists and file grouped issues (one issue per bug family, not one per line change), or umbrella issues. Is there a better format?

  • False positives: The false positive rate after agent review is roughly 10-15%. Is that acceptable, or should I only report findings that can definitely be reproduced?

  • AI disclosure: Every PR I submit includes a note that it was “authored and submitted by Claude Code (Anthropic), reviewed by a human before submission.” Is this the right level of transparency?

  • Prioritization: I’ve been targeting popular extensions with hand-written (non-generated) C code, and extensions maintained by people I know would be receptive. Should I focus differently?

  • Free-threading: Several maintainers (guppy3, APSW) have asked for help migrating to free-threading. Is there interest in a more structured program for this?

All the tools are open source:

I’m happy to run the toolkit on any extension if a maintainer is interested, just let me know.

Thanks to all the maintainers and contributors who have been helping in this effort, with special mention to Clément Robert for welcoming it in multiple projects, APSW maintainer Roger Binns for insightful suggestions that led to significant new capabilities, and YiFei Zhu who not only gave precious feedback on false positives but also spurred the development of ft-review-toolkit.

Thank you for reading this far! As a reward, if you are interested in exploring this work, here are two more tools and one umbrella issue in CPython itself:

Daniel

27 Likes

Can you run the numbers and share how many of these bugs, using Rust instead of C would have prevented :wink: ?

Anyway, this is impressive stuff - great work Daniel. You’re using AI to help open source in the right way. By keeping the human in the loop, putting work in, communicating with maintainers, and being available.

For me personally though, even clearly though you have found lots of real bugs (not just potential bugs, code quality suggestions, and false positives) I don’t want to have to read a huge machine generated report and work out what’s what.

Perhaps those lists make more sense to other maintainers with immediate knowledge of their code bases. But personally I think it would be even better if a test was included for each bug in the report. Especially if you run that test in Github Actions on your fork, and link to the result. If you went one further, including steps to reproduce the bug, actually carrying out those steps yourself to prove it is a real bug, either in CI where I can see it and play with it too or sometimes just locally in a venv or docker container, then you’d be producing the perfect bug reports for me, and making super useful contributions.

3 Likes

Thank you!

Makes total sense.

I’d like to tailor the reports to what maintainers need, some like having reproducers and suggested fixes, others would prefer just a short description and code locations. I’ve got some feedback from them that will allow customizing some reports, should make asking for this feedback the SOP. And verbosity is something we can surely adapt.

The reports are meant to separate confirmed bugs from code quality suggestions, and to keep false positives to the lowest possible.

Not all bugs the tools find are testable, but we try to reproduce them all and explain why some aren’t possible: simplejson C Extension Analysis Report · GitHub. When maintainers ask me to open PRs, I add tests where possible. Making the reproducer themselves tests (in the correct style for the project) is better, I’ll do that, thanks!

That’d be very informative and I think simple to implement, I’ll give it a try.

I do that locally and include the steps in the report, but doing it in CI seems like the next logical step.

Thank you very much for your feedback!

1 Like

Thanks – As a pillow maintainer, this was one of the better sets of reports that we’ve gotten about potential security/correctness issues. Sorry more of the PRs weren’t attributed with thanks, we do appreciate the effort. I’d second your mention that the coverage isn’t complete – I definitely found unmentioned similar bugs in related functions in the response PR that I did, but by inspection it was obvious that there was the same issue.

The issues raised were mostly difficult to test, especially when one would need to have a specific malloc fail without earlier ones failing.

It would be interesting as a test run to have a fuzzer that used coverage guidance to fail mallocs (or c-api python methods) to test the error handling in those cases. It would need to run under valgrind to catch memory leaks or invalid accesses. This could give better code coverage for the repetitive if(ptr==null) {free everything allocated in the function} c level error handling.

4 Likes

Thank you! And no worries, attribution is a detail, the important part is getting the fixes landed :smile:.

I have a plan for something kinda sorta like that: adapt a non-coverage guided fuzzer (which is what I have) to randomly fail mallocs, running with ASan enabled.

Making c-api python methods fail hadn’t occurred to me, sounds very interesting, Maybe we could LD_PRELOAD something that would do that?

1 Like

To try to answer what is probably a non-serious question:

Probably some of the reference counting bugs, since you can encode ownership in a wrapper around PyObject*. C++ would also do this nicely though. Ultimately you are wrapping a C interface though so you need to deal with raw pointers eventually.

A good chunk of the issues in Cython were to do with not handling BaseException (e.g. MemoryError, maybe KeyboardInterrupt) but instead clearing them as if they were a regular exception. I don’t believe Rust would have helped with that.

2 Likes

Well, here’s an estimate from the Claude that runs the plugin:

About 20-35%, the memory safety bugs. The majority of what we find are Python/C API logic errors (wrong exception handling, wrong refcount protocol, wrong GIL discipline) that Rust doesn’t prevent. PyO3 helps more than Rust itself, by automating the refcount boilerplate that’s the source of most bugs.

Rust would NOT prevent (~60-70% of our findings):

  • Reference counting errors (leaked refs, borrowed-ref-across-call) — these are Python/C API semantics, not memory safety. PyO3 helps but doesn’t eliminate them.
  • PyErr_Clear swallowing MemoryError/KeyboardInterrupt — exception handling logic errors
  • Error-as-truthy (PyObject_IsTrue returning -1 treated as true) — logic errors
  • Missing NULL checks after failable API calls — the C API contract, not memory safety
  • GIL discipline (blocking with GIL held, missing GIL release) — concurrency design
  • Module state issues (single-phase init, global state) — Python/C API architecture
  • Version compatibility (deprecated APIs, dead version guards) — API evolution
  • new-without-init crashes — Python object lifecycle design
  • Re-init safety (calling init twice leaks resources) — API design
  • Exception clobbering — logic errors
  • Free-threading races (missing critical sections) — concurrency design, Rust doesn’t auto solve this

Rust WOULD prevent (~15-25%):

  • Use-after-free / double-free (the SetItem double-free pattern — 62 sites!)
  • Buffer overflows (rare in our findings, but when present)
  • Py_DECREF(NULL) — null pointer dereference
  • std::bad_alloc through C boundary (Rust panics are at least catchable)
  • Some heap-type dealloc issues (Rust’s ownership model would enforce cleanup order)

Partially prevented (~10-15%):

  • Heap type missing Py_DECREF(Py_TYPE(self)) — PyO3 handles this automatically, but it’s a PyO3 feature, not a Rust language feature
  • Resource leaks on error paths — Rust’s RAII helps but you can still leak via mem::forget

Given LLM’s troubles with numbers and estimates, I wouldn’t trust the percentages too much (I didn’t actually “run the numbers”, just passed your question along). But the bug classes per category seem correct to me.

I’d contest several items in the “Rust would not prevent” category. GIL discipline, refcount errors, error-as-truthy, NULL checks, object lifecycle, … all are non-issues in Rust — if the Rust API is designed safely (in the Rust sense) instead of literally following the C API.

2 Likes

I can’t argue with that, TIL.

Do you think adapting these tools to check Rust extensions would be worth it, or is it too easy to design safely and hence not likely to result in interesting findings?

I’m thinking about having a separate tool to check non-extension C/C++ code, but checking Rust extensions might benefit the Python ecosystem more.

I think it might be difficult for you to parse the results and drive the LLM if you don’t already have an understanding of Rust.

That said, as a PyO3 maintainer, we always like getting soundness reports. Writing unsafe Rust is hard and PyO3 is a big ball of unsafe Rust code (calling into any C api is unsafe).

I think ecosystem extensions are probably less likely to have issues, if only because they rely on safety guarantees from PyO3 and Rust itself. Of course any unsafe code in extensions outside the PyO3 implementation might also be problematic.

1 Like

That would be so useful. I’ve got unchecked malloc()s that I can’t even bring myself to fix because the only way I’d have to verify their proper cleanup is to edit the source code and temporarily replace each call to malloc() with null.

1 Like

You are spot on with this belief, we even have an open issue about this: `BaseException`(`SystemExit`/`KeyboardInterrupt`) is swallowed and converted to a `TypeError` when extracting struct fields · Issue #5457 · PyO3/pyo3 · GitHub

I think it’s possible PyO3 could design an alternative error handling API which encourages users to distinguish BaseException from other exceptions, however that would probably come at a cost of increased complexity at the type level.

2 Likes

What I was thinking of would be a deterministic coverage guided fuzzer.

It would require a couple of abilities:

  • The ability to selectively fail a call to specified functions.

  • Following code execution coverage. If a change doesn’t result in new code coverage we don’t trigger that one anymore, recursively. However, I think a lot of the fails are going to be quickly followed by correct error exit or a crash of some form.

  • Tracing if calls occurred within a specific directory/tree. (e.g. pillow/src). Don’t fail calls that are from other source trees (like core python)

  • ASAN coverage.

What I’d see it doing is run the test harness once for a baseline of code coverage, and then systematically fail one new call per run, in code execution order looking for code coverage changes. If you’ve failed one (new) call, run to completion. There would potentially be issues with calls in a tight loop, bit if it was failing per-code location rather than per-call it would prevent running one test run per iteration, assuming code coverage didn’t change.

That should exhaustively cover error cases. Probably slowly, but it would definitely help getting much higher branch test coverage.