Add asyncio monitoring hooks

Similar previous discussion

As an APM vendor (Datadog), I’d love to be able to provide deeper insights/metrics about the asyncio event loop for our customers.

Some of the metrics I have been thinking about so far are:

  • Over the lifetime of the process
    • total number of tasks created/finished
    • total number of context switches
    • min/avg/p95/p99/max/etc blocking time per-task run
  • Per-task
    • total number of context switches
    • total blocking vs idle time

I’d also like to be able to associate the per-task metrics on any active APM tracing spans (“span” = duration + metadata measuring a function or arbitrary code block). This way I can show when a traced function is responsible for blocking the event loop for a certain period of time (similar concept to the existing debug logging in asyncio).

I wasn’t able to figure out a great way to get these metrics from the existing implementation/API (if I missed something, please let me know!).

I made an attempt at a reference implementation to validate one potential approach.

This approach adds sys.set_async_monitoring_hooks/sys.get_async_monitoring_hooks which allows users to add hooks for the following events:

  • task_enter(task) - Called when Modules/_asynciomodule.c:enter_task is called, a task is being context switched to.
  • task_leave(task) - Called when Modules/_asynciomodule.c:leave_task is called, a task is being context switched away from.
  • task_register(task) - Called when Modules/_asynciomodule.c:register_task is called.
  • task_unregister(task) - Called when Modules/_asynciomodule.c:unregister_task is called.

task_register and task_unregister don’t appear to be called as part of “normal” standard library only asyncio usage, so I am not convinced of their value here, but I added them to play with them.

Example usage

This is a very rudimentary example to show the high level API usage.

import asyncio
import sys


def task_enter(task: asyncio.Task):
    print("Entering task:", task.get_name())


def task_leave(task: asyncio.Task):
    if task.done():
        print("Task completed:", task.get_name())
    else:
        print("Leaving task:", task.get_name())


sys.set_async_monitoring_hooks(
    task_register=None,
    task_enter=task_enter,
    task_leave=task_leave,
    task_unregister=None,
)


async def task():
    pass


async def main():
    await asyncio.gather(*[task() for _ in range(100)])


if __name__ == "__main__":
    loop = asyncio.new_event_loop()
    loop.run_until_complete(main())

Why event hooks vs built-in metrics?

An obvious alternative here is to have asyncio maintain a set of metrics and expose them via an API that can be polled when the data is needed.
I think this is a good idea in the long run, but I thought adding the hooks could be a good intermediate step which allows the community to try generating their own metrics which we can then validate exactly which are valuable to have built-in.

As well, for some of my own specific use cases built-in, metrics would not be enough. For an APM tracing product, I’d like to be able to compute/access these metrics on a per-task basis while the task is still active to associate with any current active spans (I basically need to access contextvars from the currently active task).

Prior Art

Greenlet offers a settrace hook that allows you to register a function that gets called every time a context switch occurs.

Example:

import greenlet

def callback(event, args):
    if event in ('switch', 'throw'):
        origin, target = args
        print("Transfer from %s to %s with %s"
              % (origin, target, event))
              
greenlet.settrace(callback)

I may have missed something, but I had trouble consistently getting the current and next tasks to be able to offer a similar API with a single hook (e.g. callback(from_task, to_task)).

2 Likes

Hey, @orf, @dimaqq, @steve.dower, I hope you don’t mind me pinging you on this, since you all proposed/discussed a similar topic before I’d be curious to hear your feedback!

Yes, please, I’d like to see async/await instrumentation possible :slight_smile:

The approach you highlighted makes sense to me, it’s reasonably light-weight, yet potentially provides just enough information.

Debuggers would need other form of instrumentation, but that’s OK.

I think, perhaps, some restrictions may be necessary on the callbacks. For example, is it OK for the callback to cancel some task synchronously? What if it’s the task that’s being switched to? I’d rather ban task modifications in the callbacks entirely :thinking:

Hey @dimaqq what type of instrumentation do you have in mind?

Do you have any examples of how you would utilize these hooks?\

This is a good call out. I am trying to think of use cases where you need to modify the tasks (outside of reading/writing contextvars data), but I don’t have anything concrete.

My little attempt at instrumentation is now dead, due to changes at how pc is updated in the stack frame between py3.9 and 3.10.

It’s archived here GitHub - dimaqq/awaitwhat: Await, What? (shamless plug).

The idea was to take a snapshot of all tasks and coroutines and show what is waiting for what.

It was originally inspired by trying to debug a large async/await program that would be stuck or semi-stuck in production only (much traffic, long time to lockup).

A debugger may be interested in the very same – suspend the user program, and show the equivalent of all stack traces in a multithreaded program, except for asyncio, for all tasks.


Wrt., your proposal, all I can think of right now is, in fact, telemetry.