Can we make `_PyTraceMalloc_NewReference` to support custom hooks?

Like what we discussed in Expose tracemalloc hook into _Py_NewReference for other tracers, I want to perform some custom checks when _PyTraceMalloc_NewReference is called. My requirement is as follows:

In our C++ layer, we are adding some Python interfaces for scripting purposes. However, there is always a chance of accidentally missing Py_DECREF, which can lead to memory leaks. For example:

PyObject* name = PyObject_GetAttrString(obj, "obj_name");
Py_RETURN_NONE;

I forget to release name, it will result in a memory leak. If I discover a leak, I can track it down using take_snapshot and compare_to from tracemalloc. However, usually when we discover the leak, a lot of code has been executed, making it difficult to pinpoint the exact location. But keeping tracemalloc running continuously with PyFrame parsing would result in performance loss.

So, I’m wondering if it’s possible to do something like registering a PyObj when _PyTraceMalloc_NewReference is called and unregistering it when tracemalloc_free is called. This way, when I’m preparing to exit my program, I will know which PyObj are still held. Once I find them, I can enable tracemalloc at the corresponding locations.

To achieve this, there are two issues that need to be resolved:

  1. How to associate the PyObj passed to _PyTraceMalloc_NewReference with the ptr passed to tracemalloc_free. Based on my code inspection, it seems feasible.
    PyTypeObject *type = Py_TYPE(op);
    const size_t presize = _PyType_PreHeaderSize(type);
    uintptr_t ptr = (uintptr_t)((char *)op - presize);
  1. Currently, unlike tracemalloc_free, there is no easy way to register my own function pointer like _PyTraceMalloc_NewReference. Can we do something like this?
Include/internal/pycore_tracemalloc.h

+    /* trace a new reference */
+    void (*new_reference) (PyObject *op);
+
     struct tracemalloc_traceback empty_traceback;

Objects/object.c
-    if (_PyRuntime.tracemalloc.config.tracing) {
-        _PyTraceMalloc_NewReference(op);
+    if (_PyRuntime.tracemalloc.new_reference) {
+        _PyRuntime.tracemalloc.new_reference(op);

Python/tracemalloc.c
+    /* set reference trace */
+    _PyRuntime.tracemalloc.new_reference = _PyTraceMalloc_NewReference;
     /* everything is ready: start tracing Python memory allocations */
     tracemalloc_config.tracing = 1;

This is not directly answering your question, but why not use a library such as pybind11 or https://nanobind.readthedocs.io/en/latest/? These help automate reference counting and avoid bugs like you describe.

Indeed, we have tried libraries such as Boost-Python, pybind11, and others. They can be quite helpful in eliminating a lot of the hassle of reference counting when returning objects from C/C++ to Python. However, we are currently using standard CPython because it offers higher efficiency and is easier to integrate, especially in cases like using CPython with Unreal Engine.

However, even with these libraries, it may not solve all the problems. As long as we hold a PyObj in C/C++, it becomes prone to reference counting bugs, which can be difficult to avoid…

Higher efficiency for which kind of operations? You should be able to use the parts of nanobind that you need and combine them with hand-written CPython C API calls if you want to.

Well, you should not hold a raw PyObject, but instead a RAII wrapper such as nanobind’s object class (you can of course write your own wrapper if you prefer).

Well, as a game programmer, we have a lot of game logic written in Python while doing some low-level work in C/C++. This requires flexible data structures to be passed between the two languages while maintaining high efficiency. After comparing different bindings, we are currently using the native CPython interface.

I understand what you mean. We do use RAII to manage certain objects. However, if we want to access properties from these objects using interfaces like PyObject_GetAttrString, it can become problematic. Unless these properties are completely passed as parameters to C++, it becomes less flexible…

I think there is value on adding two generic hooks for new reference registering and for object being destroyed that tracemalloc can use but that the user can override. Independently of the OP problem, other debuggers and profilers may use these hooks for diverse profiling/debugging tasks such as:

Adding callbacks will add overhead but that overhead is paid only when they are active (as this is the case as the current tracemalloc APIs) and doing the change is not very invasive because the code for the hooks is already there for tracemalloc, we just need to add C level APIs to set/get the callbacks and move tracemalloc to use those.

1 Like

Yes, that’s what I wanted to say. If there are two generic C APIs, the performance impact will not be significant, but it will bring good extensibility to the application layer.

I think the ideal places for calling these two C APIs are _Py_NewReference and _Py_ForgetReference. However, the current implementation may not be very elegant:

  • _Py_ForgetReference only takes effect if the Py_TRACE_REFS macro is enabled, while _Py_NewReference is called continuously. The Py_TRACE_REFS macro only determines whether to call _Py_AddToAllObjects.
  • We hardcoded the call to tracemalloc in function _Py_NewReference.

A clearer way to improve this could be:

  • _Py_ForgetReference and _Py_NewReference are called unconditionally.
  • In these two functions, we use Py_TRACE_REFS to manage whether to add _PyRefChain.
  • These two functions provide two generic APIs, so that various profile codes can be easily integrated. And tracemalloc can also use these APIs.
1 Like

I have done some testing on my branch:

I think that this modification idea seems like feasible and valuable. I can register my two functions with CPython and receive callbacks, allowing me to perform checks and statistics:

start_trace:
    PyTrace_NewReference(_trace_new_reference);
    PyTrace_ForgetReference(_trace_forget_reference);

_trace_new_reference: register to an unordered_map
_trace_forget_reference: unregister to an unordered_map

dump_stats: output the information stored in the unordered_map

Testing result is:

obj = test()
trace_ref.start_trace()
trace_ref.test_leak(obj)
del obj
trace_ref.stop_trace()
trace_ref.dump_stats()

output:
	PyObj count: 1
	PyObj id: 7f72ad7494f0, type: str

The next step is to apply some filters to identify the types I want to capture and output more information.

During the modification process, I noticed that tracemalloc.h might have missed the statement:

#ifdef __cplusplus
extern "C" {
#endif

This could lead to a situation where even though we have declared
PyAPI_FUNC(int) PyTraceMalloc_Track
There may still be symbol errors when third-party libraries link with CPython and use the functions in tracemalloc.h.