After running a few Asyncio applications in production I think there is an opportunity to improve how we monitor asyncio applications, specifically around gathering statistics about how the event loop is performing.
A key indicator of trouble with your asyncio application might be slow callbacks, which is currently only logged as a warning only if the event loop has debugging enabled. Another one might be the number of tasks that are currently registered with the loop, or the total time the loop took to run an iteration.
Right now it’s impossible to get these metrics without doing something similar to what is suggested here. I don’t hate the ideas in that post but I feel that there must be a better and more generic way of measuring the performance of an event loop, and I think that this kind of thing should be built into asyncio itself.
A lot of companies have custom monitoring systems that need to be integrated with Asyncio - for example if you use statsd you might want to increment a counter every time the event loop stalled, or even log a message (but that might backfire horribly). These metrics would help you with diagnosing production asyncio issues - for example callers are seeing increased latency from a service, and you can see that loop callbacks started to exceed the threshold around the same time, or the number of tasks in the loop went through the roof.
So I’m proposing adding a couple of callbacks to the event loop to help with this:
- a callback that is triggered when a callback took more than
- a callback that is triggered after at some point with the event loop lifecycle - maybe every iteration, or when a task is added/removed?
I think the first one would be simple to implement and I’m imagining an API something like this:
def stall_callback(callback, time_taken): statsd.increment("service.loop.stalls", time_taken) def tick_callback(loop): statsd.guage("service.loop.total_tasks", len(loop.something?)) loop = asyncio.get_event_loop() loop.register_stall(stall_callback) loop.register_tick(tick_callback)
The contract would state that the callbacks need to be fast (in the example above the
statsd library might push onto an unbounded queue that’s drained by another thread to prevent blocking).
Does anyone have any comments about this? How have people monitored things like this in production already?