Systematically finding bugs in Python C extensions (575+ confirmed so far)

devdanzin · April 6, 2026, 8:16pm

Systematically finding bugs in Python C extensions (575+ confirmed so far)

Greetings all,

Python’s C extension ecosystem contains a large class of subtle correctness bugs that are rarely caught by testing or linters.

Over the past two weeks, I’ve been running a systematic analysis across 44 extensions (~960K LOC), resulting in 575+ confirmed bugs (~10-15% false positive rate after review, ~140 reproduced from Python) and fixes already merged in 14 projects including Pillow, h5py, Cython, lxml, bottleneck, greenlet, bitarray, guppy3, pycurl, igraph, enaml, APSW, regex, and simplejson. These range from hard crashes and memory corruption to correctness issues and spec violations.

The goal is to provide maintainers with high-signal, reproducible bug reports that would be extremely difficult to find manually. This work suggests that a large class of non-trivial bugs in Python C extensions can be systematically discovered and fixed with high signal.

I’d like feedback on how to make this more useful and scalable for maintainers. If you’d like your own extension analyzed, just tell me and I’ll do it.

What it does

I built a Claude Code plugin called cext-review-toolkit. The key difference from traditional static analysis is that this system tracks Python-specific invariants (refcounts, GIL discipline, exception state) across control flow, and validates findings with targeted reproducers. That is done by 13 specialized analysis agents analyzing the C extension source code in parallel, with each agent targeting a different bug class.

The agents use Tree-sitter for C/C++ parsing, which enables analysis that pattern matching can’t do, like tracking borrowed reference lifetimes across function calls, or cross-referencing type slot definitions with struct members.

Each agent can run a scanner script to find candidates, then performs qualitative review of each candidate to confirm or dismiss it. The scripts have a ~20-40% false positive rate and the agents are there to bring that down. After the agents finish, I try to reproduce every finding from pure Python and write a reproducer appendix.

What it found

Across 43 extensions (~950K lines of C/C++ reviewed), the toolkit found ~560 FIX-level bugs. These results are not noise-free: after review, false positives are ~10-15%, and some findings cannot yet be reproduced from Python. Here are some representative findings:

In Cython’s runtime: 3 fix PRs were created by @da-woods after reviewing the report: a cyfunction ref-counting error (#7594), a reference leak in Generator_Replace_StopIteration (#7597, also backported to 3.2), and unguarded PyErr_Clear calls (#7600). Since these are in the Cython runtime, the fixes affect every Cython-generated extension.

In h5py: 16 issues filed, 9 fix PRs created by @neutrinoceros and @takluyver, all merged within 3 days. Findings included HDF5 type handle leaks, unchecked malloc calls, and resource leaks on error paths. (umbrella issue)

In bottleneck: 15 issues filed, 9 fix PRs created by @neutrinoceros, 7 merged. The move_median function had a MEMORY_ERR macro that set an exception but didn’t return, causing a segfault under OOM. The INIT/INIT_ONE macros had unchecked PyArray_EMPTY affecting ~46 functions. (umbrella issue)

In guppy3: The maintainer reviewed every finding line by line, fixed 24 of 30 issues, and found additional bugs the tool missed. Their feedback was invaluable: they identified false positives that helped improve the scanner (types inheriting slots from base types, immutable container borrowed-ref safety, defensive API patterns). They wrote: “The tool has been surprisingly decent. It made me re-read the code and find issues it didn’t find.”

In bitarray: After reviewing the report, Ilan Schnell fixed all critical findings and released bitarray 3.8.1. (issue, PR). Ilan said: “Bitarray has an extensive test suite with almost 600 unittests. Nevertheless, cext-review-toolkit discovered several important edge cases that I had overlooked, and probably never would have uncovered myself”.

In regex: mrabarnett applied fixes within hours of receiving the report, addressing a KeyboardInterrupt swallowing bug, a string_search race condition, and NULL check issues across 6 commits and 3 releases.

In Pillow: 13 fix PRs were created by @hugovk and team after receiving the report, covering refcount leaks, NULL dereference checks, error handling, dealloc fixes, and migration to PyModule_AddObjectRef. Three PRs explicitly credit the analysis.

In lxml: Stefan Behnel committed 16+ fixes directly within a day of receiving the report, released as lxml 6.0.3. The changelog notes: “Several out of memory error cases now raise MemoryError that were not handled before.”

In greenlet: @jamadden created PR #502 with 16 commits, noting: “User @devdanzin provided the results of running various coding agents over the code base. […] I independently verified each finding and either made changes or left comments as to why the behaviour was correct.” The PR fixes potential crashers, reference leaks, and free-threading issues.

In pycurl: @swt2c merged PR #961 “Fix bugs noted by C Extension Review Toolkit” — fixing missing PyErr_Occurred checks, a BytesIO leak, a use-after-free, and a deadlock.

In pyerfa: The toolkit found a copy_from_double33 function that writes all 9 elements of a 3×3 rotation matrix to the same memory address (using p instead of p1 in the inner loop). This affects 48 ufunc loops covering 72 ERFA functions. The agents confirmed the bug with a reproducer (Fortran-ordered output arrays produce wrong results), then conducted a directed investigation that concluded the bug is dormant, so standard astropy usage never triggers the non-contiguous output path. Also found: a template typo (dt_eraLBODY instead of dt_eraLDBODY) causing memory leaks in 3 ufunc loops.

In simplejson: 6 PRs filed (4 merged, 2 open) covering a use-after-free in encoder ident handling, reference leaks in the dict encoder, iterable_as_array swallowing MemoryError/KeyboardInterrupt, NULL dereferences on OOM, error-as-truthy bugs in maybe_quote_bigint, and member table bugs. The multi-run approach on simplejson is where the technique of running the analysis multiple times was first validated: a second naive pass found 4 additional bugs missed by the first, and an informed third pass found 5 more. (umbrella issue)

I also submitted PRs directly to lz4 (6 PRs), memray (2 PRs), and scipy (2 PRs).

How it works

The typical workflow for each extension:

Run all agents.
Review and synthesize agent findings into a report.
Try to reproduce every finding from pure Python, using techniques like OOM injection (_testcapi.set_nomemory), evil subclasses (__hash__ that raises), mischievous file-like objects, and more (documented in the reproducer techniques guide).
Write a reproducer appendix with confirmed bugs and their evidence.
Share the report with the maintainer and file issues/PRs.

Reports are shared as secret GitHub gists. Communication happens through whatever channel works: email, Mastodon DM, Discord, Discourse. Of 40+ maintainers contacted, about 60% responded positively. I’d love if you could share this post so that it could reach those maintainers that didn’t respond, as well as find more maintainers interested in having their extensions analyzed.

How it’s different

Traditional tools like clang-tidy, Coverity, and sanitizers struggle with Python C API semantics (reference ownership, exception state, GIL constraints). The analyses cext-review-toolkit performs target those invariants specifically. Besides that, the tool uses guided semantic analysis (LLM-assisted) to analyze aspects like “was that bugfix complete, and do similar bugs still lurk in the codebase?” that other tools cannot cover.

The rich set of agents cover:

Reference counting: leaked refs, borrowed-ref-across-callback, stolen-ref misuse.
Error handling: missing NULL checks, return without exception, exception clobbering.
NULL safety: unchecked allocations, dereference-before-check.
GIL discipline: API calls without GIL, blocking with GIL held.
Type slots: dealloc bugs, missing traverse/clear, __new__-without-__init__ safety.
PyErr_Clear: unguarded exception swallowing (MemoryError, KeyboardInterrupt).
Module state: single-phase init, global PyObject* state.
Version compatibility: deprecated APIs, dead version guards.
Git history: fix completeness (same bug fixed in one place but not another).
Plus: stable ABI compliance, resource lifecycle, complexity analysis.

I’m still learning best practices with this tool. I recently discovered that running multiple analysis passes on the same codebase finds significantly more bugs. For example, a second naive pass on one extension found 7 additional bugs the first pass missed, and a third informed pass (where agents are told what was already found and directed to unexplored areas) found 5 more, including a systematic error-as-truthy pattern across 14+ call sites that no naive pass caught.

Adding the pure-Python reproducing step, while not inherent to the tool, is made much easier with help from the rich findings produced.

Working with maintainers

Reports like these can be time and energy-intensive for maintainers to investigate. Historically, automated bug-finding tools have produced far more false positives than useful information, and AI can make those false positives look incredibly convincing.

These findings are rarely obvious: if they were, compilers, standard linters, or users would have caught them long ago. Because maintainers are already overworked, dumping low-quality, automated “AI slop” into their trackers only makes the problem worse.

To respect their time, I take great care to ensure these reports are of the highest possible quality. I focus on presenting pure-Python reproducers wherever I can. The multi-agent review step and the reproducer efforts act as a filter against noise. When a maintainer points out a false positive, I immediately update the agents’ prompts so that specific pattern is avoided in the future.

Beyond polishing the tools, I try to communicate in a non-invasive, helpful manner. The maintainer always holds the reins: I ask them how they prefer to receive the information (an umbrella issue? individual issues? direct PRs? or do nothing at all) and let them decide exactly what to do with the findings.

A deep dive

To give a concrete sense of what the toolkit finds, here’s a walkthrough of one bug in bottleneck, a library of fast NumPy array functions.

The move_median function allocates a helper struct for the median computation. If the allocation fails, it calls a macro to set a MemoryError:


mm_handle *mm = mm_new_nan(window, min_count);

// ...

if (mm == NULL) {

MEMORY_ERR("Could not allocate memory for move_median");

}

BN_BEGIN_ALLOW_THREADS // releases the GIL

WHILE {

// ... dereferences mm ...

}

The MEMORY_ERR macro is defined as:


#define MEMORY_ERR(text) PyErr_SetString(PyExc_MemoryError, text)

It sets the Python exception but does not return. Execution falls through to BN_BEGIN_ALLOW_THREADS, which releases the GIL while a MemoryError is pending (a Python/C API violation), then enters the computation loop which dereferences mm, which is now NULL. The result is a segfault inside GIL-released code, with no meaningful error message.

The fix is to add return NULL; after the macro call:


if (mm == NULL) {

MEMORY_ERR("Could not allocate memory for move_median");

return NULL; // <-- missing

}

Four agents flagged it independently (error-path-analyzer, GIL-checker, git-history-analyzer, NULL-scanner), and I confirmed it triggers a segfault via OOM injection:


import _testcapi

import numpy as np

import bottleneck as bn

_testcapi.set_nomemory(1, 0) # This forces a memory allocation failure (OOM)

try:

bn.move_median(np.random.rand(100), window=10)

except MemoryError:

pass

# Segmentation fault: mm is NULL, dereferenced inside GIL-released code

(Issue #519. The related INIT/INIT_ONE macro issue, affecting ~46 functions, was fixed in PR #543.)

What didn’t work

False positives exist. The guppy3 maintainer identified ~8 false positives in a 30-finding report. These led to concrete scanner improvements (PR #26). The most common false positive categories: types inheriting slots from base types, borrowed refs from immutable containers (safe because tuples can’t be mutated), and APIs that handle NULL arguments defensively.

Frameworks that generate C/C++ code, like mypyc, pybind11, and nanobind, present unique challenges. These tools produce optimized, battle-tested code that is the foundation of many extensions, meaning any changes to their generation logic must be treated with extreme caution. A single flagged pattern might be replicated across every generated module, so the reports can be overwhelming to triage, and overwhelmed maintainers aren’t always able to respond. However, the potential payoff is huge: when a framework is able to incorporate a fix, as Cython recently did, the entire community of downstream projects immediately benefits.

Some reports are noisy. Extensions with generated code (Cython, mypyc) or very large codebases produce many low-confidence findings. I’m working on better filtering.

OOM reproduction is hit-or-miss. _testcapi.set_nomemory hooks PYMEM_DOMAIN_RAW but not PYMEM_DOMAIN_OBJ, so many OOM paths in extensions can’t be triggered from Python. About 25% of FIX findings are “code-confirmed but not reproducible from pure Python”.

Free-threading analysis

With free-threading becoming a priority for Python, I built ft-review-toolkit for analyzing free-threading readiness. It uses ThreadSanitizer integration (via labeille for building TSan-enabled Python and running tests), combined with static analysis for shared state, lock discipline, and unsafe API usage.

So far it has analyzed 11 extensions, finding real data races in multidict (233 TSan warnings, zero synchronization on the hashtable), bitarray (70 warnings, 6 SIGABRTs), zope.interface (LookupBase cache race causing SIGABRT), kiwisolver (non-atomic SharedDataPtr refcount), and others. It also produces migration plans for maintainers who want to adopt free-threading.

I plan to write a post about this work when more extensions are reviewed for their free-threading status. As always, please let me know if you’d like your extension analyzed.

The numbers

Metric	Count
Extensions analyzed (correctness)	44
Extensions analyzed (free-threading)	12
FIX-level bugs found	~575
Bugs reproduced from Python	155+
GitHub issues filed	90
PRs (mine + maintainers’)	62
Fixes merged	49 PRs + ~22 direct commits
Extensions with landed fixes	14
Maintainers contacted	40+
Positive responses	~60%

Questions for the community

I’d appreciate feedback on any of these:

Is this useful? Do maintainers find these reports helpful, or are they noise? The best feedback I’ve gotten was from maintainers who engaged deeply (guppy3, h5py, bottleneck, Pillow, greenlet, APSW), but I don’t hear from the ones who didn’t respond. Given that:
- Would you want this run on your project?
- What would make you trust (or ignore) a report like this?
- At what false positive rate would this stop being worth your time?
Report format: I currently share reports as gists and file grouped issues (one issue per bug family, not one per line change), or umbrella issues. Is there a better format?
False positives: The false positive rate after agent review is roughly 10-15%. Is that acceptable, or should I only report findings that can definitely be reproduced?
AI disclosure: Every PR I submit includes a note that it was “authored and submitted by Claude Code (Anthropic), reviewed by a human before submission.” Is this the right level of transparency?
Prioritization: I’ve been targeting popular extensions with hand-written (non-generated) C code, and extensions maintained by people I know would be receptive. Should I focus differently?
Free-threading: Several maintainers (guppy3, APSW) have asked for help migrating to free-threading. Is there interest in a more structured program for this?

All the tools are open source:

cext-review-toolkit → correctness analysis (13 agents)
ft-review-toolkit → free-threading analysis (8 agents)
labeille → build infrastructure for TSan-enabled Python

I’m happy to run the toolkit on any extension if a maintainer is interested, just let me know.

Thanks to all the maintainers and contributors who have been helping in this effort, with special mention to Clément Robert for welcoming it in multiple projects, APSW maintainer Roger Binns for insightful suggestions that led to significant new capabilities, and YiFei Zhu who not only gave precious feedback on false positives but also spurred the development of ft-review-toolkit.

Thank you for reading this far! As a reward, if you are interested in exploring this work, here are two more tools and one umbrella issue in CPython itself:

code-review-toolkit → pure Python analysis
cpython-review-toolkit → analysis of CPython code
Umbrella issue: bugs found using cpython-review-toolkit #146102

Daniel

JamesParrott · April 7, 2026, 9:35am

Can you run the numbers and share how many of these bugs, using Rust instead of C would have prevented ?

Anyway, this is impressive stuff - great work Daniel. You’re using AI to help open source in the right way. By keeping the human in the loop, putting work in, communicating with maintainers, and being available.

For me personally though, even clearly though you have found lots of real bugs (not just potential bugs, code quality suggestions, and false positives) I don’t want to have to read a huge machine generated report and work out what’s what.

Perhaps those lists make more sense to other maintainers with immediate knowledge of their code bases. But personally I think it would be even better if a test was included for each bug in the report. Especially if you run that test in Github Actions on your fork, and link to the result. If you went one further, including steps to reproduce the bug, actually carrying out those steps yourself to prove it is a real bug, either in CI where I can see it and play with it too or sometimes just locally in a venv or docker container, then you’d be producing the perfect bug reports for me, and making super useful contributions.

devdanzin · April 7, 2026, 10:24am

Thank you!

Makes total sense.

I’d like to tailor the reports to what maintainers need, some like having reproducers and suggested fixes, others would prefer just a short description and code locations. I’ve got some feedback from them that will allow customizing some reports, should make asking for this feedback the SOP. And verbosity is something we can surely adapt.

The reports are meant to separate confirmed bugs from code quality suggestions, and to keep false positives to the lowest possible.

Not all bugs the tools find are testable, but we try to reproduce them all and explain why some aren’t possible: simplejson C Extension Analysis Report · GitHub. When maintainers ask me to open PRs, I add tests where possible. Making the reproducer themselves tests (in the correct style for the project) is better, I’ll do that, thanks!

That’d be very informative and I think simple to implement, I’ll give it a try.

I do that locally and include the steps in the report, but doing it in CI seems like the next logical step.

Thank you very much for your feedback!

wiredfool · April 7, 2026, 12:23pm

Thanks – As a pillow maintainer, this was one of the better sets of reports that we’ve gotten about potential security/correctness issues. Sorry more of the PRs weren’t attributed with thanks, we do appreciate the effort. I’d second your mention that the coverage isn’t complete – I definitely found unmentioned similar bugs in related functions in the response PR that I did, but by inspection it was obvious that there was the same issue.

The issues raised were mostly difficult to test, especially when one would need to have a specific malloc fail without earlier ones failing.

It would be interesting as a test run to have a fuzzer that used coverage guidance to fail mallocs (or c-api python methods) to test the error handling in those cases. It would need to run under valgrind to catch memory leaks or invalid accesses. This could give better code coverage for the repetitive if(ptr==null) {free everything allocated in the function} c level error handling.

devdanzin · April 8, 2026, 4:28am

Thank you! And no worries, attribution is a detail, the important part is getting the fixes landed .

I have a plan for something kinda sorta like that: adapt a non-coverage guided fuzzer (which is what I have) to randomly fail mallocs, running with ASan enabled.

Making c-api python methods fail hadn’t occurred to me, sounds very interesting, Maybe we could LD_PRELOAD something that would do that?

da-woods · April 8, 2026, 9:43am

To try to answer what is probably a non-serious question:

Probably some of the reference counting bugs, since you can encode ownership in a wrapper around PyObject*. C++ would also do this nicely though. Ultimately you are wrapping a C interface though so you need to deal with raw pointers eventually.

A good chunk of the issues in Cython were to do with not handling BaseException (e.g. MemoryError, maybe KeyboardInterrupt) but instead clearing them as if they were a regular exception. I don’t believe Rust would have helped with that.

devdanzin · April 8, 2026, 10:25am

Well, here’s an estimate from the Claude that runs the plugin:

About 20-35%, the memory safety bugs. The majority of what we find are Python/C API logic errors (wrong exception handling, wrong refcount protocol, wrong GIL discipline) that Rust doesn’t prevent. PyO3 helps more than Rust itself, by automating the refcount boilerplate that’s the source of most bugs.

Rust would NOT prevent (~60-70% of our findings):

Reference counting errors (leaked refs, borrowed-ref-across-call) — these are Python/C API semantics, not memory safety. PyO3 helps but doesn’t eliminate them.

PyErr_Clear swallowing MemoryError/KeyboardInterrupt — exception handling logic errors

Error-as-truthy (PyObject_IsTrue returning -1 treated as true) — logic errors

Missing NULL checks after failable API calls — the C API contract, not memory safety

GIL discipline (blocking with GIL held, missing GIL release) — concurrency design

Module state issues (single-phase init, global state) — Python/C API architecture

Version compatibility (deprecated APIs, dead version guards) — API evolution

new-without-init crashes — Python object lifecycle design

Re-init safety (calling init twice leaks resources) — API design

Exception clobbering — logic errors

Free-threading races (missing critical sections) — concurrency design, Rust doesn’t auto solve this

Rust WOULD prevent (~15-25%):

Use-after-free / double-free (the SetItem double-free pattern — 62 sites!)

Buffer overflows (rare in our findings, but when present)

Py_DECREF(NULL) — null pointer dereference

std::bad_alloc through C boundary (Rust panics are at least catchable)

Some heap-type dealloc issues (Rust’s ownership model would enforce cleanup order)

Partially prevented (~10-15%):

Heap type missing Py_DECREF(Py_TYPE(self)) — PyO3 handles this automatically, but it’s a PyO3 feature, not a Rust language feature

Resource leaks on error paths — Rust’s RAII helps but you can still leak via mem::forget

Given LLM’s troubles with numbers and estimates, I wouldn’t trust the percentages too much (I didn’t actually “run the numbers”, just passed your question along). But the bug classes per category seem correct to me.

smurfix · April 8, 2026, 10:55am

I’d contest several items in the “Rust would not prevent” category. GIL discipline, refcount errors, error-as-truthy, NULL checks, object lifecycle, … all are non-issues in Rust — if the Rust API is designed safely (in the Rust sense) instead of literally following the C API.

devdanzin · April 8, 2026, 11:00am

I can’t argue with that, TIL.

Do you think adapting these tools to check Rust extensions would be worth it, or is it too easy to design safely and hence not likely to result in interesting findings?

I’m thinking about having a separate tool to check non-extension C/C++ code, but checking Rust extensions might benefit the Python ecosystem more.

ngoldbaum · April 8, 2026, 12:19pm

I think it might be difficult for you to parse the results and drive the LLM if you don’t already have an understanding of Rust.

That said, as a PyO3 maintainer, we always like getting soundness reports. Writing unsafe Rust is hard and PyO3 is a big ball of unsafe Rust code (calling into any C api is unsafe).

I think ecosystem extensions are probably less likely to have issues, if only because they rely on safety guarantees from PyO3 and Rust itself. Of course any unsafe code in extensions outside the PyO3 implementation might also be problematic.

bwoodsend · April 8, 2026, 2:31pm

That would be so useful. I’ve got unchecked malloc()s that I can’t even bring myself to fix because the only way I’d have to verify their proper cleanup is to edit the source code and temporarily replace each call to malloc() with null.

davidhewitt · April 8, 2026, 9:42pm

You are spot on with this belief, we even have an open issue about this: `BaseException`(`SystemExit`/`KeyboardInterrupt`) is swallowed and converted to a `TypeError` when extracting struct fields · Issue #5457 · PyO3/pyo3 · GitHub

I think it’s possible PyO3 could design an alternative error handling API which encourages users to distinguish BaseException from other exceptions, however that would probably come at a cost of increased complexity at the type level.

wiredfool · April 10, 2026, 10:07am

What I was thinking of would be a deterministic coverage guided fuzzer.

It would require a couple of abilities:

The ability to selectively fail a call to specified functions.
Following code execution coverage. If a change doesn’t result in new code coverage we don’t trigger that one anymore, recursively. However, I think a lot of the fails are going to be quickly followed by correct error exit or a crash of some form.
Tracing if calls occurred within a specific directory/tree. (e.g. pillow/src). Don’t fail calls that are from other source trees (like core python)
ASAN coverage.

What I’d see it doing is run the test harness once for a baseline of code coverage, and then systematically fail one new call per run, in code execution order looking for code coverage changes. If you’ve failed one (new) call, run to completion. There would potentially be issues with calls in a tight loop, bit if it was failing per-code location rather than per-call it would prevent running one test run per iteration, assuming code coverage didn’t change.

That should exhaustively cover error cases. Probably slowly, but it would definitely help getting much higher branch test coverage.

maurycy · April 12, 2026, 3:55pm

Howdy Daniel!

First of all, thank you for creating the thread, and I believe that the tool is truly excellent.

I know that the thread is about C extensions but maybe my experiences with your GitHub - devdanzin/cpython-review-toolkit: A Claude Code plugin for exploring, analyzing, and reviewing CPython's C source code. · GitHub in the CPython realm are applicable. The conversation is now focuses on both how to improve the tool and how to handle the reports. I’d like to post my two eurocents on the latter.

I’m not going to post the exact PRs (not hard to find, only three) created in the python/cpython repository, but even after manually reviewing the reports (some false positives, some just plain duplicates of what you’ve found), my success rate wasn’t great.

I intentionally picked one optimization, one (that I believe) is a nitpick, and one perfectly reproducible. One was rejected, one most likely will make it, and one led to an interesting conversation.

The sample is far from statistically significant, but my conclusion why is not exactly related to code but more to the overall context.

Unless the issue is critical (even if perfectly reproducible), many fixes are just distracting. Maintainers have their own projects, plans, schedules etc., and some pathological refleak is not really that important. I believe that such PRs used to make it in the past, because they were seen as an investment (education) in a potential maintainer, a future colleague. Now, it’s “Contributor” badge hunting.

I don’t like to philosophize too much but it seems to me to be similar to law. For various reasons, societies agree that it’s OK not to abide by the law always. Perfect policing hits dimishing returns. Most likely, various projects will settle on different either explicit or implicit rules with what level of bugs they’re OK with.

Importantly, the metaphor also holds because legal systems prefer policing crimes with victims. In this context, unless someone was hit with a bug or might be realistically hit, it’s a theoretical distraction.

More actionably, I think it’d be interesting to start thinking not only in terms of code, but researching what were the past decisions in the project. I think that `git blame`, github.com etc. are the perfect resources as a start:

Do maintainers rejected similar reports in the past?
What is the policy stated by maintainers?
Are there similar fixes in the code lately?

The actual rules of the community are the filter.

(I don’t know how it solves the problem of victimless bugs, though, as well as the trade-off between open source as a community and extracting status from the community.)

Sorry for the ramble! This is a novel subject and there are no established ways of thinking about it yet.

c-rob · April 14, 2026, 11:10am

That’s a great writeup! With headers and links and everything. It seems in many computer classes they don’t teach the students how to ask a programming question on a forum with enough details.

devdanzin · April 16, 2026, 9:08pm

Oh, I’d be very interested in getting more context about your experience with the findings and the maintainers. Any feedback about the reports is welcome too, I’m often able to improve the prompts to address issues maintainers raise. I’ll search for the PRs later, thanks for the pointer.

I agree that not all findings are worth fixing, but each maintainer has their own rules for what is valuable and what isn’t. For some, pathological refleaks warrant a fix, for others a rare segfault is tolerable. The best I can do is offer a listing of what the tools find and let them decide what to fix.

Oh, I was lured by “easy” issues myself and truly hope some of these findings might make someone interested enough in CPython and extensions that they’ll become a maintainer. It’s always been a sieve situation, where most contributors don’t stick around. I think we should still invest in things that might help to find new maintainers, like easy issues and helping new contributors, even if badge hunters take most of them. I take it you don’t agree?

I’ve had little feedback from CPython devs about whether these tiny PRs targeting nits, leaks, etc. are valuable at all. If they are, I don’t see a problem with people making drive-by contributions, but am open to discuss the desirability of this.

Yes, it’s something each project will figure out. So far, my contributions are like lists of petty crimes I witnessed and the maintainers decide which to pursue and which to ignore. It may well evolve into a situation where I know what is interesting for a project beforehand and only report significant issues.

I try to get a feel about what a project is willing to accept. I could perform this kind of research more carefully for each project I analyze, but I figure it’s easier and more reliable to simply ask maintainers to look at concrete findings and give feedback if they want. For many extensions I already know what is acceptable and what will be rejected.

For CPython, I still don’t know what the acceptance thresholds are (sometimes it seems every little issue will be fixed, but then we try to fix them and some are rejected as you say).

Funny you talk about Are there similar fixes in the code lately?, one of the most successful agents does exactly that, but in the opposite direction: it looks for recent fixes and tries to find similar bugs that weren’t fixed. That leads to many interesting findings.

Indeed, but it’s often the case that the rules are implicit or even undecided. Until there are explicit guidelines to go by, I think erring by reporting too much is the less bad option, even more so when this elicits feedback about what is desirable and what isn’t. What would you suggest in the current situation?

Thanks for engaging and for your valuable perspective, I’ll try to take community standards into account more often and weigh whether issues are significant or just distractions.

Sorry for the delay, I wanted to answer sooner but there were too many things in-flight.