Adding monitoring hooks to the event loop

orf · February 24, 2020, 9:11pm

After running a few Asyncio applications in production I think there is an opportunity to improve how we monitor asyncio applications, specifically around gathering statistics about how the event loop is performing.

A key indicator of trouble with your asyncio application might be slow callbacks, which is currently only logged as a warning only if the event loop has debugging enabled. Another one might be the number of tasks that are currently registered with the loop, or the total time the loop took to run an iteration.

Right now it’s impossible to get these metrics without doing something similar to what is suggested here. I don’t hate the ideas in that post but I feel that there must be a better and more generic way of measuring the performance of an event loop, and I think that this kind of thing should be built into asyncio itself.

A lot of companies have custom monitoring systems that need to be integrated with Asyncio - for example if you use statsd you might want to increment a counter every time the event loop stalled, or even log a message (but that might backfire horribly). These metrics would help you with diagnosing production asyncio issues - for example callers are seeing increased latency from a service, and you can see that loop callbacks started to exceed the threshold around the same time, or the number of tasks in the loop went through the roof.

So I’m proposing adding a couple of callbacks to the event loop to help with this:

a callback that is triggered when a callback took more than slow_callback_duration to execute.
a callback that is triggered after at some point with the event loop lifecycle - maybe every iteration, or when a task is added/removed?

I think the first one would be simple to implement and I’m imagining an API something like this:

def stall_callback(callback, time_taken):
    statsd.increment("service.loop.stalls", time_taken)

def tick_callback(loop):
    statsd.guage("service.loop.total_tasks", len(loop.something?))

loop = asyncio.get_event_loop()
loop.register_stall(stall_callback)
loop.register_tick(tick_callback)

The contract would state that the callbacks need to be fast (in the example above the statsd library might push onto an unbounded queue that’s drained by another thread to prevent blocking).

Does anyone have any comments about this? How have people monitored things like this in production already?

dimaqq · October 17, 2020, 1:22am

I like this idea, maybe start a discussion at bugs.python.org or submit a concrete proposal PR on github?
I have seen these warnings, and now I know where they come from, so thank you
Some kind of hook would be nice, even something as simple as:
if dt >= self.slow_callback_duration: warn_long_callback(handle, dt)
so that asyncio.base_events.warn_long_callback can be patched for run and profit?

steve.dower · October 17, 2020, 8:39am

Using sys.audit to raise the event is probably best. Those are also intended to work like what you suggest, and are already used for collecting execution statistics (though not widely yet… I think DataDog uses/intends to use them).