PEP 768 – Safe external debugger interface for CPython

pablogsal · December 11, 2024, 3:32pm

Hi everyone ,

We are very exited to share with you PEP 768, which proposes adding a safe external debugger interface to CPython. We think this is a really exciting development that would allow debuggers and profilers to safely attach to running Python processes without stopping or restarting them.

The key highlight is that it would enable tools like pdb to attach to live processes by PID (similar to gdb -p), letting developers inspect and debug Python applications in real-time. This capability can be also leveraged by other tools such as memory profilers, performance profilers and other state-inspection tools.

The proposal has already been successfully implemented in PyPy (Thanks @cfbolz ). The target version is Python 3.14.

You can read the full PEP here: PEP 768 – Safe external debugger interface for CPython | peps.python.org

cfbolz · December 11, 2024, 4:16pm

As can probably be seen by the fact that I’ve implemented this in PyPy already, I’m a strong +1 on this. I have wanted such a feature in PyPy since a long time, but never found a way to implement it that I was happy with, until @pablogsal described this idea to me a few weeks ago.

steve.dower · December 11, 2024, 4:54pm

Great approach, I like it.

There’s some effort going towards making PyInterpreterState and PyThreadState structures lighter^[1], so I feel like adding an unconditional 4KB to every instance might be a bit much. Any reason we can’t just make this a pointer (and perhaps allocate a static scratch buffer in PyRuntime)?

I’d also appreciate conditional compilation option to exclude the functionality entirely from a build (one of our $work requirements for production is to disable all debug interfaces), and it should definitely raise an audit event before executing any arbitrary code (ideally including the arbitrary code), though that doesn’t need to be captured in the PEP.

Up against the efforts to make them bigger, often by the same people ↩︎

pablogsal · December 11, 2024, 4:58pm

We want to allow the user to select what Python thread executes what code, so technically the debugger may need to orchestrate different threads executing different code and for that we need 1 scratch buffer per thread.

I think 4k is probably too much so we can try to make it smaller for sure, but unless we don’t want to allow different threads to be able to run different code (which I think is quite important) then we will need as many scratch pads as threads anyway.

steve.dower · December 11, 2024, 5:00pm

One pointer per thread is fine, and assuming there’s only going to be one debugger then it’s up to them to divide up the scratch space. If they can allocate their own within the process (I’m fairly sure this is possible?) then the scratch space isn’t important at all.

pablogsal · December 11, 2024, 5:01pm

Makes sense. In any case note that there is not a lot of extra security to be gain here because if you can write to the process memory you already lost. We can look into that for sure. Maybe even a runtime option would also be useful.

Yeah this is an excellent point. I will actually think I want to include it in the PEP

pablogsal · December 11, 2024, 5:02pm

But the pointer gives you nothing because we need the memory to be allocated already because the debugger cannot allocate memory remotely. It needs to write somewhere that already exists. So at that point there is no advantage between a pointer an an embedded scratch memory in the structure. Indeed, is worse because you need an extra malloc/free call at creation time.

steve.dower · December 11, 2024, 5:04pm

Technically, yes, but in practice, not necessarily. An exploit that lets you write one block of data into a running process could let you reliably gain full execution with this feature, but I suspect you’d struggle without it. It’s not really defense-in-depth, but it is a case of not making things easier for an attacker.

Hence the global scratch space that can be written to and then used as the pointer. But we would only have one 4KB space per process, rather than 4KB per thread.

colesbury · December 11, 2024, 5:19pm

This is great!

Set _PY_EVAL_PLEASE_STOP_BIT in the eval_breaker…

The _PY_EVAL_PLEASE_STOP_BIT is currently used for stop-the-world requests in the free threading build. I think it’s fine to also use it for external debuggers and disambiguate the request in _Py_HandlePending.

However, I think we’ll need 8 bits for _PY_EVAL_PLEASE_STOP_BIT. Modifications to the eval breaker currently require atomic compare-exchanges, even with the GIL. I don’t think it’s robust to having some bits being accidentally cleared and I think process_vm_writev() can only write at the granularity of bytes. If _PY_EVAL_PLEASE_STOP_BIT has it’s one byte in eval_breaker then I think it’ll be safe to overwrite it.

Use the offsets to locate the desired thread state

How are you planning to access the thread state in a safe way? Will you acquire HEAD_LOCK() out of process?

pablogsal · December 11, 2024, 5:21pm

You actually need 2 writes: one for the code and one to activate the other to ask the interpreter to read it and run it. And this happens in a very specific area of memory (so technically the requirements would be “write at least two blocks of memory on a very specific fixed area that is randomised per process”). I think this is virtually equivalent to the status quo but I am happy to have the compilation option or the runtime option in any case

If you want, we can have an API (+ env var + whatever) that marks the memory as not writable so writing to it even if you have write permission will fail. That would need the application itself to call mprotect to remove the protections so you cannot do it from the outside.

I penally would prefer to still allow one scratch pad per thread because I think the cost makes sense (and we can make it much smaller than 4k if we need) and also makes attaching easier because it needs less pointers to follow but if we think that is too costly I am happy with the scratchpad in PyRuntime instead.

pablogsal · December 11, 2024, 5:23pm

If the debugger wants to attach without races, it should first send SIGSTOP to the process or use ptrace directly (as normal debuggers do), which requires similar permissions as process_vm_readv and friends.

Will you acquire HEAD_LOCK() out of process?

Notice that in general we cannot execute code remotely (and the whole proposal is precisely to overcome this limitation) so we cannot call HEAD_LOCK or any other thing. Tools can only read and write to memory.

pablogsal · December 11, 2024, 5:27pm

We could have its own a separate field if setting the whole byte non-atomically is a problem but in general debuggers that attach needs to stop the process first via the regular methods, so the atomic write problem is not a problem in that case.

colesbury · December 11, 2024, 5:46pm

If the debugger wants to attach without races, it should first send SIGSTOP to the process or use ptrace directly (as normal debuggers do)…

Ok, that simplifies things. It might be worth specifying that in Attachment Protocol.

If the assumption is that the process is stopped, then I don’t think _PY_EVAL_PLEASE_STOP_BIT needs it’s own byte or field. (You don’t need atomic compare-exchanges if nobody else is writing the field).

Re the HEAD_LOCK: it’s just a byte (_PyRuntime.interpreters.mutex) - you can read it out of process and you don’t have to execute any code. If the mutex is locked by CPython (the least-significant-bit is one), then the linked list may be in an inconsistent state. For example, some thread might be stopped in the middle of add_thread_state or tstate_delete_common.

pablogsal · December 11, 2024, 5:49pm

Absolutely! Great point

Oh, then this is a very interesting point because we could allow an even better way to attach to the process then without the need to send SIGSTOP to the process. We will investigate this idea in case we can leverage it. Thanks for pointing this out!

godlygeek · December 11, 2024, 6:49pm

I don’t see how. We can see if the lock is held from out of process, but we can’t lock it ourselves, as far as I can see. Without the ability to do an atomic compare-and-swap or something, there’s a TOCTOU issue there that I don’t see any way around: we check if it’s locked, we see that it isn’t, and we go to lock it ourselves, but whoopsie, someone else beat us to it and locked it from a different thread of execution and now both the debugger and a thread in the application think they own the lock. CAS lets us say “take this lock as long as no one else holds it”, but process_vm_writev doesn’t.

rchiodo · December 11, 2024, 7:01pm

I just want to say thanks. I maintain debugpy at the moment and this would eliminate a lot of headaches we have with attaching. We wouldn’t even need any C code anymore = no more shipping exes/dlls/so files.

cfbolz · December 11, 2024, 7:44pm

I just remembered some prior work that could maybe be mentioned in the PEP: Pyrasite is using GDB to do pretty much the same. The docs have some cool example payloads.

pablogsal · December 11, 2024, 7:53pm

I am happy to mention it for sure but note that the approach is fundamentally different. Indeed the strategy that pyrasite uses has the same problems as the other tools mentioned in the PEP and those problems are covered as well in the text.

The gist is that that approach is fundamentally unsafe and can crash the process. The reason is that you cannot just inject code at arbitrary points of execution because the process can be in an inconsistent state. Technically you can only execute async safe stuff and not even that is fully safe. One motivator of this work is that I have seen too many segfaults when these tools attach

Also pyrasite is particularly unsafe because it just picks up the GIL immediately:

github.com

lmacken/pyrasite/blob/519d2b6369bc572bf9dcf1b0a05dc020155d09d3/pyrasite/injector.py#L26


      
          # Copyright (C) 2011-2013 Red Hat, Inc., Luke Macken <lmacken@redhat.com>
          
          import os
          import subprocess
          import platform
          
          def inject(pid, filename, verbose=False, gdb_prefix=''):
              """Executes a file in a running Python process."""
              filename = os.path.abspath(filename)
              gdb_cmds = [
                  'PyGILState_Ensure()',
                  'PyRun_SimpleString("'
                      'import sys; sys.path.insert(0, \\"%s\\"); '
                      'sys.path.insert(0, \\"%s\\"); '
                      'exec(open(\\"%s\\").read())")' %
                          (os.path.dirname(filename),
                          os.path.abspath(os.path.join(os.path.dirname(__file__), '..')),
                          filename),
                  'PyGILState_Release($1)',
                  ]
              p = subprocess.Popen('%sgdb -p %d -batch %s' % (gdb_prefix, pid,

Memray and other tools try to stop in “safe enough points” and then spawn a thread that picks the GIL but this can also crash the process.

cfbolz · December 11, 2024, 8:35pm

if it’s not prior work, then a cautionary tale . I mainly liked the usage examples, honestly.

pablogsal · December 11, 2024, 8:51pm

I will still mention it somewhere in the PEP. Maybe we can also mention the examples as a way to showcase what’s possible