Could we give `pdb` a better awaitable story?

rtpg · September 12, 2024, 2:54pm

Motivation

async def f():
  x = g()
  y = await h()
  return x + y

def g():
  breakpoint() # B1
  do_effect()
  return 1

async def h():
  breakpoint() # B2
  y = await j(3)
  return y

async def j(x):
  return x + 1

In the above code, breakpoints work. of course. But once you’re in a pdb shell, there’s some trickiness. Though I’m testing async-heavy code, I can’t simply type something like await j(2) into pdb.

But, of course, I would like to be able to await a value inside the debug shell.

If I’m outside of the the event loop, I can get by with a call to asyncio.run. But if I’m already in the event loop, I can’t call back into it, because event loops are not re-entrant!

So this leads to a very awkward scenario where breakpoint() does give you inspection capabilities with async work, but you likely won’t be able to explore async results very well within a pdb REPL!

I use REPLs in running systems all the time to try and work through issues, and as I work more and more with async I’m finding myself not getting the full Python experience.

So, how could we get full support for working with awaitables in a pdb shell?

Difficulties

(I’m focusing on CPython for implementation discussions here)

Let’s imagine I run:

asyncio.run(h())

I end up at the breakpoint B2. How could I do something like await j(2)?

I’m in an event loop, so I can’t call into the event loop again. The top frame is the coroutine (so, effectively for CPython, a generator). await j(2) is, effective, yield from j(2).

So maybe all that is needed is a way to point at a frame and say “OK, you now are yielding from this value”.

Debuggers are already able to outright change the line number in a frame by setting f_lineno, so “direct frame manipuation” is a debugging utility. But implementing yield from j(2) would be a bit tricky.

One way to do it would involve outright having a special value on a frame, like “before moving forward, just yield from this”. This feels very weird. A variant on this idea would be to lean on f_trace or have a similar trace-like function that could pre-empt the actual bytecode execution on a frame. This more interesting to me because f_trace is already doing something like this. A f_patch property that could replace the frame’s single stepping while it is set?
Another idea that I don’t think really goes anywhere would involve being able to patch code objects to, effectively, insert some bytecode right after the breakpoint to “do the yield”. There’d be a lot of things you could do with that kind of capability but it also would be super error prone.

But all of what I said is in the case that my breakpoint was in a coroutine to begin with. In that case the caller to the coroutine expects a coroutine, can consume values yielded by the coroutine.

A harder thing:

asyncio.run(f())

with the above call, we enter f(), and in there, call g() and hit a breakpoint at B1. Now we have a problem. The top frame is not a coroutine, it’s just a function! We can’t yield values, and the caller wouldn’t understand values being yielded anyways!

In that scenario, how could we support await j(2)? We really would want to use the event loop to execute j(2), but at the same time we can’t get back to the event loop without finishing the execution of g().

So either through a return (going through do_effect() in the process, or … jumping over do_effect()? But that’s no longer a real debugging session), or raising an exception (tearing down a bunch of frames in the process). Obviously these are both immensely unsatisfying.

How can I have my cake and eat it too? How could I yield back to the event loop (hopefully to give it a chance ot deal with a future I care about), without me losing my valuable context?

A Proposal

Here is an idea for CPython to make this debugging story better. Please forgive the naming

Intro a new opcode, CALL_SUSPEND. It works like CALL, but before going into the caller’s frame, it first pushes True to the stack (for reasons that will become clear).
Intro another new opcode, SUSPEND. This opcode, when executed, does the following:
- traverses back into the stack frames, looking for a frame currently in CALL_SUSPEND. If no frame exists in that state, an exception is raised.
- Pops the True currently on the top of the CALL_SUSPEND stack, and the pushes False, then the current frame onto the stack.
- Sets the current frame to the CALL_SUSPEND frame.
Intro a final opcode, RESUME_SUSPEND. It expects a frame to be on the stack. Like CALL_SUSPEND it will first push True on the stack. It then sets the frame on the stack to be the current frame.

In practice, what does that mean?

CALL_SUSPENDing a function that executes normally, will end up with [True, retval] on the stack. The first value indicates that the call completed. The second is the return value
CALL_SUSPENDing a function that SUSPENDs leaves you with [False, frame] on the stack. The first value indicates that the call did not complete/was suspended. The second value is the frame that was suspended.
RESUME_SUSPEND just sets the frame back in place. But SUSPEND could be called again, so we should still prep the stack in the same way, to be able to tell if suspensions happened.

On top of these opcodes, asyncio could provide a function, suspend, that just does SUSPEND:

asyncio.suspend() # frame gets suspended

We would then have call_with_suspension, as an interface to CALL_SUSPEND:

result = asyncio.call_with_suspension(func, *args, **kwargs)

result could either be Completed(done=True, result=retval), or Suspended(done=False, frame=frame).

And from there, we could also have resume_suspense(suspense)

result = asyncio.resume_suspension(suspense)

(Passing in the Suspended object instead of the frame directly feels useful for “disincentivising weird programs” reasons)

If an event loop slightly modifies its callback handling to use call_with_suspension/resume_suspension (which, a bit awkwardly, kind of looks like generators…), and a debugger smartly uses create_task and friends, then pdb could support the following kind of flow:

you hit a breakpoint
you await j(2) (even in a sync frame!), which would create the coroutine from j(2), create a task from it, hold onto a reference to the task, then SUSPEND
the event loop gets control, probably schedules your in-progress task to later, handles other tasks (including j(2))
if RESUME_SUSPEND is called, the debugging frame is brought back into existence. the trace function kicks in, and the debugger could then check the state of j(2). If it’s not done yet, SUSPEND again! Since you’re a debugger you could probably be a bit smart with timeouts or the like.
If j(2) is complete, of course then the value could be inspected as you would expect

So concretely, your debugger code could look like the following:

def do_await_command(expr):
  """
  (pdb) await j(2) 
  -> do_await_command("j(2)")
  """
  coro = eval_in_frame(expr, current_frame)
  task = asyncio.create_task(coro)
  while not task.done():
    # maybe put some timeout logic here
    asyncio.suspend()
  # do what you want to do with the task
  print(task.result())

Of course the above is a very simplified version, but I think it holds the core.

Objections

There are many objections, to this idea, of course.

This gives continuations!

This is obviously super powerful for doing really weird control flow. Python’s generator story (and coroutine story!) has a static quality to it. A function body determines if it can be suspended (via async or presence of yield). This provides an arbitrary suspension point that totally breaks the idea that calling f() calls f to completion.

I think it would be good present this as a debugging utility (in the same way that f_lineno exists but to my knowledge isn’t used to write weird libraries). But people will try weird stuff

This is too many extra parts for one thing!

This is 3 opcodes to be able to call await in pdb. I think it’s super important to be able to call await in pdb (or, more concretely, to be able to wait for a task to complete). And I think this sort of functionality would likely unblock other ideas in debug tooling.

This is not at all detailed enough to provide a good impression of the work involved! I need a POC!

I have spent a while looking over parts of CPython to try and figure this out, and I hope this is a detailed enough sketch to make people at least have an opinion. But I have no idea how this would play with things like the JIT work.

Conclusion

I want to have a better debugging story for seeing async task results. I think if we have a way to suspend a frame (including sync frames!), then debuggers could use that infrastructure to provide a good debugging story.

But I do not hold a preference to how one can get better debugging. I just think better debugging is very important, and I did not successfully find prior art.

mikeshardmind · October 3, 2024, 9:54pm

While I agree with trying to make pdb play more nicely with async applications, I don’t think this is the right way. It should be possible to get a better starting point by enabling top level await with pdb and see if that’s sufficient, and in what ways it isn’t.

For a direct comparison, you can see the effects of this with the asyncio REPL (python -m asyncio)

This would allow “just use await coro_func() directly”. There might need to be some decisions around this in terms of the expected behavior of that and the underlying event loop from pdb as well, but these decisions shouldn’t need new opcodes (it might mean python -m pdb --with-async or that pdb knows to check for a running event loop the first time it sees “await …” and inform you if there isn’t one running)

This is a good thing in working code, but I do understand how in this very specific case, it appears not to be.

rtpg · October 3, 2024, 10:45pm

Thanks for the reply. There could be a first step here of simply having pdb support top level await outside of an event loop run. Though if I’m perfectly honest, this would not be usable for any of my pdb runs in recent history (usually with a Django request context or other in-event-loop run).

One thing I am considering on that front is the idea of saying await f() in pdb would call loop.call_soon in a way to try really hard to still have the current frame accessible when that call actually resolved (even if in the meantime all of your state has changed). Just feels like it would be really magical behavior and unobvious UX to type await f() and have the result of that be “we are going to continue executing as-is, and then the event loop will pop you back into pdb after we’ve gotten back to it”.

I do still posit that without re-entrant event loops it’s not possible for set_trace-hooked tooling to wait for a future in a way that doesn’t move you forward in the traced function an uncomfortable amount, but maybe I lack imagination here!

I guess a mininal way of placing the question is:

async def secret(key):
  if key != "MY_KEY":
    raise ValueError("Key mismatch")
  return "MY_SECRET"

async def get_key():
   return "MY_KEY"

async def decrypt_files(secret):
  if secret != "MY_SECRET":
    raise ValueError("Incorrect Secret")
  return f"file contents"

def get_secret(key):
   ...

async def do_decryption():
    key = await get_key()
    secret = get_secret(key)
    return await decrypt_files(secret)

def run_decryption():
  ...  # do something to "run" `do_decryption`

Is there an implementation of run_decryption and get_secret such that I can run do_decryption to completion? If so, then there is likely also a way for pdb to get nice await behavior (with cooperation from the event loop runner). If not… I don’t know if pdb.set_trace could handle most usages of await clearly (because if pdb could handle await, then you could resolve this problem by calling pdb.set_trace in get_secret).

mikeshardmind · October 3, 2024, 10:57pm

yeah, this falls under the part where I wasn’t sure if this should be a default.

Technically, in a scenario where you have a running event loop already, this should work:

$future = asyncio.create_task(coro())
$future.set_done_callback(lambda f: pdb.set_trace())
continue

and then once you get popped back to it, you’ll be able to use up (at least if I remember correctly, and it’s possible I’m not)

The problem with anything short of that is coroutines implemented with the expectation of running in a specific event loop may not work properly if you were to try and manually step through while preventing other execution within the event loop. Some cases would entirely hang while waiting on something not allowed to advance.

rtpg · October 4, 2024, 1:54am

The problem with anything short of that is coroutines implemented with the expectation of running in a specific event loop may not work properly if you were to try and manually step through while preventing other execution within the event loop.

To be clear about what I think is reasonable, I think that, in my magical ideal, while my traced task shouldn’t move forward while I’m await-ing some future within pdb, other tasks in the event loop moving forward does not bug me at all.

If you were limiting yourself to tracing in async functions, and didn’t want to step forward, then if you did:

async def foo():
    do_thing_1()
    await pdb_session()
    do_thing_2()
    do_thing_3()

You could, inside of a session, await a bunch (though I don’t you would be able to step to do_thing_2()/do_thing_3() and then await stuff). You could imagine this being “I can await, but once I step forwards I lose await capabilities”. Weird, but honestly would maybe be enough for me and requires no real event loop shenanigans (I think).

I might at least experiment with that for now.

Another thing this is making me consider… just writing a re-entrant event loop for this. I already am doing some work on a custom event loop for debugging some other issues, and of course in that universe I can just subclass pdb and do what I need to do.

Looking at BaseEventLoop there is a bit of a feeling that whoever wrote the class did leave just enough abstraction in place to let somebody show up with such an event loop…

In any case I’ll explore this space some more, and I will at least look a bit into the amount of work involved in at least getting top-level await to work outside of the event loop.

Thanks for looking over what I’ve written, it’s helpful to think about this.