Cancelable tasks cannot safely use semaphores

mikeshardmind · November 13, 2024, 11:23am

This will be a bit of a longer post. I realized after what I wrote yesterday, that calling the entire ecosystem wrong about cancellation wasn’t explained in enough detail to justify such a drastic statement. It’s wrong that it’s documented that try/finally is enough or that cancellation is an easy problem, but the existence of task cancellation is not itself a problem.

I do have some issues with the concept of uncancel as well as TaskGroups, but I’m going to put a pin in those for a moment, and just clarify the problems of cancellation for now, and why it isn’t enough or even desirable to try and make all code handle cancelation itself.

First, a thought experiment

How would you write an async websocket library that doesn’t know if the user wants cancellation?

To test how your design does, consider the following two applications:

Imagine that one application using your library makes non-mutating websocket requests, and that the expected response is data that doesn’t change to show in a GUI. When this application closes, the websocket connection is meaningless, and the most important thing is that it doesn’t feel like the application is sluggish to respond or close.
Imagine another application that wants to use your websocket library to log application errors and crash reports to a crash/log analytics server. These should send even as the application is closing, and it should be reasonably possible for an application to catch an unhandled error, log, close its remaining resources, then inform the user that there was an unexpected error that the application cannot recover from.

In this lower level (further down the application’s call stack) scenario, you don’t have the information to decide whether or not cancelation is the better outcome for the application, so you have two options. 1.) Provide two separate apis or a configuration flag, or some other way to pick between an implementation that does and doesn’t shield ongoing work, or 2.) trust that the caller has the information to decide whether they would rather structure their application around waiting on your library or canceling your library at any given point.

While it’s true that any asyncio task can be canceled at any unshielded yield point (and I would not argue to change this), the important distinction here is that determining whether or not cancellation is the correct (And safe in the context of the application) option is the responsibility of the person canceling, not the task being canceled.

On to the ability to break synchronization now

Where this becomes a problem is with async context managers, async generators that hold resources (see pep533), and OS interrupts thrown as exceptions. The last of those is a problem even without asyncio, which comes up on occasion surprising people, a recent example of this I remember seeing someone else discovering is quoted here:

Interlacing ContextManager

Here are the results:
# Python 3.12.5 on MacOS
Total 10000 tests                                                                                                                                                                                                                                                                                                                                                                       
lock1 failed 0 times (0.00%)
lock2 failed 430 times (4.30%)
lock3 failed 514 times (5.14%)
This result shows that ContextManager works (and ONLY works) with raw, unwrapped Python Lock primitives to provide robust interrupt handling. I guess this will cause some very intricate confusions if someone assumes ContextManagers are always immune to interrupts.

The quoted post has some other interesting observations about the design of context managers and the difference between managers implemented in extension code and in python, but for more details on this, it’s documented that critical sections can be broken by this as part of the signal module’s documentation here

So, how can people handle this and even leverage this?

Once you get past knowing all of the above, there’s a lot of freedom in python’s concurrency, and the high level tools provided start working for you. Handling OS signals isn’t actually difficult in a highly concurrent system, but it is something you may need to do more than you are expecting if you come from being used to just letting python raise an exception anywhere and are now designing something where you want to intercept that and handle it without raising an exception

If you want a graceful shutdown that results in finishing the work you started, but no more, it typically means structuring the application into multiple phases:

setup
{check for new work and shutdown signals, do work that’s been enqueued already}
{stop receiving new work, finish started work}
choose an appropriate action for work that is held by your process now, but had not been started at the time of receiving a signal^[1]
shutdown

This gives you pretty strong guarantees and even bounded job failure rates due to worker shutdown (bound by the number of concurrent jobs handled by a worker and the number of shutdowns)

A typical description of reliability:

“If the worker terminated with a normal exit code, all jobs it took are finished, if it exited with an abnormal exit code, at worst, the most recent N jobs taken are possibly unfinished”.

Paired with good observability, this can even allow either automated or manual recovery of failed jobs, but that gets into larger system design questions.

Sure, but which part of this would you like an example for? KeyboardInterrupt breaking context managers and other critical sections is covered above, but I don’t mind putting an example together for any of the other parts later.

Put it back in the central job queue? Log it? Raise a user-facing error and tell them to try again, assuming a new worker will be responsible for it? Do the work before shutdown? Only your application’s design can decide what’s right here. ↩︎

Tinche · November 13, 2024, 11:31am

Ah, sorry, this was more directed to @bcmills for the original issue of tasks not releasing their semaphores correctly when cancelled.