Debugging a possibly memory leak

teebr · September 25, 2020, 4:58pm

I think there might be a memory leak when using PyDict_SetItemString versus PyDict_SetItem. (specifically this line which has a dubious comment by it).

My reason for thinking it’s a leak is from using the Memory Profiler package, which I was using to investigate a horrendous C extension I was writing.

I have created a minimal example which tests this out, and if I comment out the above line, the apparent leak does go away. However this isn’t a particularly robust approach, and isn’t helpful for writing tests to confirm a fix does indeed solve the issue. All tests currently pass with and without that line.

I tried using Valgrind but it didn’t seem to show any differences in the various cases I tried, not that I know what I’m doing on that front.

I would appreciate any tips on how to debug this, confirm it is/isn’t a memory leak, create some tests,and also check there are no side-effects to removing that line. Thanks!

brettcannon · September 25, 2020, 5:28pm

I would run the code under a debugger and check that the refcounts make sense.

I don’t think there is one. Looking at that code there is only a single object created via PyUnicode_FromString() and I don’t see a code path where that function exits without that object getting a Py_DECREF called on it.

Now, the reason I bet the memory profiler you’re using thinks there is a leak is interning a string basically makes it live forever. So with a ton of keys in dicts you can lead to a lot of strings being kept around. Whether that is best or not is an open question, hence the comment.

Remove the line
Recompile
Run the test suite
Profit!

Basically there’s not going to be a better way to verify there aren’t any adverse affects. But you will probably want to run https://pyperformance.readthedocs.io/ before and after to see how it affects things.

pablogsal · September 26, 2020, 6:58pm

Also, take into account that using memory profilers with pymalloc activated will yield, at the very least, confusing results. Python by default does not return memory to the OS until some of the least granular internal structures that it uses for managing memory (arenas) are completely free. This means that technially you may see the allocation but only see the deallocation much later (or never).

vstinner · November 27, 2020, 7:43pm

Interned strings at only deleted at Python exit by _PyUnicode_ClearInterned() (since Python 3.10, previously they were never deleted).

You can intern a string manually using sys.intern(). Once all strings are interned, the memory usage should remain stable. But while you are “interning” strings, you can see the memory growing and not going done, which is done on purpose.

PyDict_SetItemString() calls PyUnicode_InternInPlace() to make dict lookups faster. A dict lookup with an intern key is simpler pointer comparison (O(1) complexity), which is more efficient than a string comparison (O(n) complexity).

vstinner · November 27, 2020, 7:46pm

If you call a function 10 times, and the Python memory usage grows by 100 bytes: each function leaks 10 bytes in average. If you call the function 100 functions and the grow is around 1000 bytes, you’re right. But if the grow is still 100 bytes or less, it’s not a leak. A leak is when every call allocates more memory.

I suggest you using tracemalloc.get_traced_memory() to get the exact Python memory usage.

I also suggest you using tracemalloc to see which lines of your code leaks memory. Sometimes, it can be really hard to understand the reasoning, like the strange interned strings dictionary beast.