Evaluating the state of the art of Python asyncio with the "Cancellation in Systems" paper

This thread is a more detailed comparison of asyncio with the paper I mentioned in a prior thread. I’m not going to insist to add something right now to asyncio based on this paper, but just want to share how we are doing well and potential points to improve in the future.

Although the paper does not mention Python asyncio at all but comes with case analysis of cancellation-related bugs in other languages (C#, Java, Go) and applications (Cassandra, Hadoop, etc.), the meaning of the paper is that it provides a structured base to discuss what kinds of issues should be considered when we are going to improve asyncio.

TLDR: Thanks to the core contributors and the community, Python asyncio has constructed a fairly good baseline for writing complex concurrent applications. I think it would be insightful to retrospect asyncio’s achievements with the arguments of the paper as it serves as a reference point.

Reference in the paper Short desc. Status of asyncio
§5.1.2, §5.1.3 Broken trigger checks, Excess cancel (unexpected overlapping of tasks without cancelling previously launched ones which may happen with async timers) :warning: No intrinsic utility for timers or periodic tasks[1] – it may be controversial to include timer APIs in asyncio, though. / :white_check_mark: Guaranteeing scoped cancellation in complex task trees is partly covered with TaskGroup.
§5.2.1 Untimely delivery - cancel race :warning: asyncio.shield() could be used to circumvent duplicate cancel requests and protect ongoing cancellations[2], but still the event loop still unexpectedly shuts down all ongoing tasks regardless whether they are shielded or not.
§5.2.1 Untimely delivery - late polling :white_check_mark: asyncio is based on inherently “cooperative” coroutines and it should be fine as long as all dependencies are async-aware (i.e., no hidden blocking calls) / :thinking: Programmer’s caution is still the main mean of preventing this issue and this is the reason of difficulty when writing asyncio apps.
§5.3.1, §5.3.2 Cacnel not checked/carried out :white_check_mark: The situation has improved a lot after promoting asyncio.CancelledError as BaseException (since Python 3.8) / :warning: A structured way of ensuring clean-up of long-running tasks are still missing (particularly TaskGroup)[3]
§5.4 Cancel mechanisms - cancellable task interface :white_check_mark: asyncio.Task and asyncio.Future
§5.4 Cancel mechanisms - uninterruptible interface :white_check_mark: asyncio.shield() partly corresponds to this, in that it blocks the cancellation signal.
§5.4, §5.3.3 Cancel mechanisms - task dependency tracking :white_check_mark: asyncio.TaskGroup :tada:

  1. aiotools’ timer module has TimerDelayPolicy to express “cancel prior task if it’s not complete until the next wait interval finishes”. ↩︎

  2. I have an experience that I had to shield all database transactions in web request handlers because cancelling in the middle of __aexit__() of aiopg’s transaction blocks had caused unreleased database connections and exhaustion of the connection pool. ↩︎

  3. This is why I’ve suggested PersistentTaskGroup, to cover the use case of asyncio.gather(..., return_exceptions=True). ↩︎

What is “sota” (a word you used in the thread title)?

Looking at the paper a bit more it seems heavily focused on cancellation in thread-based systems, where race conditions are more common than in asyncio.

You claim that asyncio doesn’t have a utility for timers. I wonder if you aren’t aware of call_later or if you have some reason to discount it? It would seem to be simple enough to create a periodic task from this.

Regarding your own timer code, it seems a bit over-complicated: if the delay policy is CANCEL, fired_tasks will have 0 or 1 element, while if the delay policy is DEFAULT, fired_tasks is never needed. (Maybe this is a remnant of an earlier version of your code?) In any case there seem to be different ways to time a cat. :slight_smile:

Regarding the issue you mention about shield() not protecting against loop shutdown, maybe you can just file an issue about that? (With a small example that reproduces the problem, please.)

3 Likes

He seems to write it in the meaning of State-Of-The-Art.

2 Likes

As @corona10 clarified, I meant SOTA (state-of-the-art). Sorry for confusion.

Thanks for having a look at it. That code is a very old, relatively not updated code. :see_no_evil:
I had a production issue that accumulated an infinite number of tasks, causing server unresponsiveness, when each task created upon the timer tick takes more time than the interval of the timer. That’s why I wrote this rudimentary abstraction. I just want to share my experience with asyncio in production systems.[1]

I’m not directly claiming that “we need to add some new abstraction for timers right now”, but this is just a provoking of ideas by comparing what we have in asyncio with a survey paper which highlights the importance of careful handling of cancellations. An academic paper serves as a good reference point.

Yes, that’s true. I also wanted to see more discussions in the paper related with coroutine-based systems such as C#, Kotlin, Python, and Javascript. But still, I think the nature of cancellation issues discussed in the paper, such as what kind of abstractions and supports are required from the language/runtime/library, reflects the same concerns about task cancellation in asyncio. The paper is more about cancellation, not race condition issues.


  1. My main project, Backend.AI, is now 7 years old and earning real money from enterprise customers from multiple countries, and it heavily relies on the asyncio ecosystem. ↩︎

Okay, I’ve updated the title yet again (I haven’t seen that acronym ever before) but also, I think I’ve given you all the discussion I have to give. I’ll leave it to others to continue the discussion.

1 Like