Python bindings for libpython to access VM state

sliedes · April 7, 2024, 7:08pm

To advance my slightly crazy project, both for understanding CPython internals and verifying my understanding, I’m looking for a way to dump the entire state of a CPython VM running Python-only code without things like files, threading or sockets in a structured format.

I currently have a GDB Python script using heavily modified python-gdb.py to do some of this at a specific point in VM instruction execution—I repurposed lltrace—but it gets unwieldy, so I’m thinking about alternative approaches. I could modify CPython to call a function with the _PyInterpreterFrame at suitable points. This call would transfer control to another Python interpreter running in the same process, and I could use Python with bindings for CPython internals (which I define as everything that affects the future output of the Python program) to crawl the state of the first Python interpreter, outputting it in a structured format. Basically, all reachable objects, call stack, functions, code objects, current instruction, VM stack etc.

First of all, can anyone think of some better approach or existing libpython Python bindings? (Yes, I could also just modify CPython to do the printing, but I’d prefer to work in Python especially until I understand the internals really well.) I’m fine with it being locked to a very specific version of CPython, although the more generic, always the merrier. Also, performance is completely a non-concern.

I’ve used ctypes a lot before, but it’s probably not enough for this. I know many other FFI frameworks exist. My gut would tell me to look into SWIG for wrapping the CPython internals. Would you instead do something different? CFFI? Something more modern?

Or is there some much easier way to do what I want that I have overlooked, essentially to serialize the Python VM state?

sliedes · April 9, 2024, 3:32pm

After some more googling, I decided that Cython makes reading the structs relatively nice. I’m still keeping pointers sort of real in the Python mirror of the structs, since identity does matter.