Non-frame object returned when walking the frame stack in Python 3.11

Hi there. I would like to ask for some help with an issue related to frame stack walking with Python 3.11 that was reported to us by PyPI. The following traceback

AttributeError: 'code' object has no attribute 'f_code'
  File "ddtrace/internal/periodic.py", line 73, in run
    self._target()
  File "ddtrace/profiling/collector/__init__.py", line 42, in periodic
    for events in self.collect():
  File "ddtrace/profiling/collector/stack.pyx", line 492, in ddtrace.profiling.collector.stack.StackCollector.collect
  File "ddtrace/profiling/collector/stack.pyx", line 353, in ddtrace.profiling.collector.stack.stack_collect
  File "ddtrace/profiling/collector/_traceback.pyx", line 44, in ddtrace.profiling.collector._traceback.pyframe_to_frames
  File "ddtrace/profiling/collector/_traceback.pyx", line 55, in ddtrace.profiling.collector._traceback.pyframe_to_frames

is being generated by the Datadog Python profiler when attempting to sample the current frame stack of each running thread. The stack_collect method uses the following helper to collect the running threads and create a mapping between thread ID and top frame (among other details).

cdef collect_threads(thread_id_ignore_list, thread_time, thread_span_links) with gil:
    cdef dict current_exceptions = {}

    IF UNAME_SYSNAME != "Windows" and PY_MAJOR_VERSION >= 3 and PY_MINOR_VERSION >= 7:
        cdef PyInterpreterState* interp
        cdef PyThreadState* tstate
        cdef _PyErr_StackItem* exc_info
        cdef PyThread_type_lock lmutex = _PyRuntime.interpreters.mutex
        cdef PyObject* exc_type
        cdef PyObject* exc_tb
        cdef dict running_threads = {}

        # This is an internal lock but we do need it.
        # See https://bugs.python.org/issue1021318
        if PyThread_acquire_lock(lmutex, WAIT_LOCK) == PY_LOCK_ACQUIRED:
            # Do not try to do anything fancy here:
            # Even calling print() will deadlock the program has it will try
            # to lock the GIL and somehow touching this mutex.
            try:
                interp = PyInterpreterState_Head()

                while interp:
                    tstate = PyInterpreterState_ThreadHead(interp)
                    while tstate:
                        # The frame can be NULL
                        # Python 3.11 moved PyFrameObject to internal C API and cannot be directly accessed from tstate
                        IF PY_MINOR_VERSION >= 11:
                            frame = PyThreadState_GetFrame(tstate)
                            if frame:
                                running_threads[tstate.thread_id] = <object>frame
                            exc_info = _PyErr_GetTopmostException(tstate)
                            if exc_info and exc_info.exc_value and <object> exc_info.exc_value is not None:
                                # Python 3.11 removed exc_type, exc_traceback from exception representations,
                                # can instead derive exc_type and exc_traceback from remaining exc_value field
                                exc_type = Py_TYPE(exc_info.exc_value)
                                exc_tb = PyException_GetTraceback(exc_info.exc_value)
                                if exc_tb:
                                    current_exceptions[tstate.thread_id] = (<object>exc_type, <object>exc_tb)
                                Py_XDECREF(exc_tb)
                            Py_XDECREF(frame)
                        ELSE:
                            frame = tstate.frame
                            if frame:
                                running_threads[tstate.thread_id] = <object>frame
                            exc_info = _PyErr_GetTopmostException(tstate)
                            if exc_info and exc_info.exc_type and exc_info.exc_traceback:
                                current_exceptions[tstate.thread_id] = (<object>exc_info.exc_type, <object>exc_info.exc_traceback)
                        tstate = PyThreadState_Next(tstate)

                    interp = PyInterpreterState_Next(interp)
            finally:
                PyThread_release_lock(lmutex)
    ELSE:
        cdef dict running_threads = <dict>_PyThread_CurrentFrames()

        # Now that we own the ref via <dict> casting, we can safely decrease the default refcount
        # so we don't leak the object
        Py_DECREF(running_threads)

    cdef dict cpu_times = thread_time(running_threads.keys())

    return tuple(
        (
            pthread_id,
            native_thread_id,
            _threading.get_thread_name(pthread_id),
            running_threads[pthread_id],
            current_exceptions.get(pthread_id),
            thread_span_links.get_active_span_from_thread_id(pthread_id) if thread_span_links else None,
            cpu_time,
        )
        for (pthread_id, native_thread_id), cpu_time in cpu_times.items()
        if pthread_id not in thread_id_ignore_list
    )

The top frames thus collected are passed to the pyframe_to_frames helper, which performs the stack unwinding from each top frame passed to it

cpdef pyframe_to_frames(frame, max_nframes):
    """Convert a Python frame to a list of frames.

    :param frame: The frame object to serialize.
    :param max_nframes: The maximum number of frames to return.
    :return: The serialized frames and the number of frames present in the original traceback."""
    frames = []
    nframes = 0
    while frame is not None:
        nframes += 1
        if len(frames) < max_nframes:
            code = frame.f_code
            lineno = 0 if frame.f_lineno is None else frame.f_lineno
            frames.append((code.co_filename, lineno, code.co_name, _extract_class_name(frame)))
        frame = frame.f_back
    return frames, nframes

The line code = frame.f_code is where we seem to get a Python object that is not a FrameType object on some occasion, which is then responsible for the traceback shown above.

The full sources are available from GitHub - DataDog/dd-trace-py: Datadog Python APM Client (the two relevant Cython sources featured in the traceback are dd-trace-py/stack.pyx at 1.x · DataDog/dd-trace-py · GitHub and dd-trace-py/_traceback.pyx at 1.x · DataDog/dd-trace-py · GitHub)

We have double-checked all our reference counting and decremented them whenever the C API we used returned a strong reference. We haven’t been able to spot any other potential issues with our coding. In particular, we cannot see how the frame variable in the pyframe_to_frames helper could ever be something other than a FrameType (or None). Note that the traceback above is one instance of similar tracebacks, where instead of 'code' we would have some other random Python objects (e.g. 'Project' from SQLAlchemy etc…).

Assuming that we have not overlooked anything in our code, this has left us with essentially two possible conclusion:

  • this is an issue with Cython support for CPython 3.11
  • this is an issue with CPython 3.11 in the way that frame objects are created lazily.

We’re trying to reproduce this issue locally to narrow down which C API call might be returning the random object that we see in the traceback, but so far we haven’t been able to. I’m hoping that perhaps a core dev can think of something that could potentially cause this issue to appear when using certain C API functions around frame stack unwinding, while we try to gather more insight into it.