Having troubble debugging a memory leak from CPython extension

juliannguyen · April 2, 2025, 12:11am

Hi all, I was redirected from GitHub - python/cpython: The Python programming language to this forum for help with debugging a memory leak in my CPython extension. Here is my original question: Having troubble debugging a memory leak from CPython extension · Issue #131997 · python/cpython · GitHub

Thank you so much!

davidism · April 2, 2025, 12:52am

Please ask your question in full here, rather than linking to where you were redirected from.

juliannguyen · April 2, 2025, 1:26am

I’m a new user and the website only lets me post at most two links

ZeroIntensity · April 2, 2025, 1:39am

Basically, whenever you see a CPython function in your backtrace, like method_vectorcall_VARARGS_KEYWORDS, it’s almost always a reference leak (i.e., you’re missing a Py_DECREF). Those aren’t very fun to debug. You might want to look into a profiler like Memray to track down which object is leaking, and that might give you a better idea of where to look.

There’s not much we can do to help you other than dive through your code and look for the missing DECREF .

ngoldbaum · April 2, 2025, 2:08am

In the past I’ve used a trick to fix reference leaks when I have a very simple script that triggers the leak. In a C debugger with a debug build of Python, put a hardware watchpoint on the variable in CPython that stores the total reference count. That effectively gives you a breakpoint that is triggered whenever any reference count changes. You then run a loop that triggers the memory leak on each iteration and manually count that every incref is followed by a corresponding decref by looking at the stack trace whenever the watchpoint is hit.

Of course that only works if the total number of reference count changes per loop iteration is manageable.

juliannguyen · April 2, 2025, 1:36pm

What’s confusing to me is the number of Python objects doesn’t seem to change throughout the lifetime of the script. I ran objgraph.show_growth() at the very start of the script and during each iteration after I call the API function, and this is what I see:

(.venv) ➜  CLIENT-3382 python3 aes_batch_write_mem_leak_batch_write.py
# First show_growth() call
function                       3215     +3215
tuple                          1901     +1901
dict                           1607     +1607
wrapper_descriptor             1140     +1140
ReferenceType                   994      +994
method_descriptor               933      +933
builtin_function_or_method      868      +868
type                            598      +598
getset_descriptor               545      +545
list                            438      +438
start check
Starting batch write
# Second show_growth() call
list              444        +6
dict             1612        +5
Write               5        +5
function         3217        +2
BatchRecords        1        +1
done with check

After the second iteration, objgraph.show_growth() doesn’t print anthing, so I assume the number of objects doesn’t grow after the second iteration. The script runs for 1000 iterations

juliannguyen · April 2, 2025, 1:40pm

If I replace objgraph.show_growth() with objgraph.get_leaking_objects(), I see similar results; the number of leaking objects doesn’t change as the script continues to run.

juliannguyen · April 7, 2025, 1:39pm

I was able to find the leak, but it was from a different Python object than the one that valgrind was reporting

ZeroIntensity · April 8, 2025, 12:25am

Probably because the leaking object held a reference to the one that Valgrind was detecting.