Tracking the source of cancellation in tasks

This is a split-out thread from a prior thread.

From the perspective of a task, cancellation is observed as asyncio.CancelledError raised from an await statement or async structs such as async for or async with.

I had some hard-to-debug errors with long-running tasks silently cancelled in production systems. While debugging, I wanted to keep track of the source of cancellation – e.g., is it from mis-triggered shutdown signals or mis-propagation of internal cancellations in the task tree?
Another case is to distinguish whether the task is cancelled by its own reason or by a taskgroup when a sibling task has propagated an unhandled exception to the taskgroup.

Usually when I write a “cancellable” task, I often use the following pattern:

async def mytask():
  try:
    async with some_resource():
      ...
  except Exception:
    log.exception("unexpected error in mytask")
  except asyncio.CancelledError:
    pass

This makes it hard to debug the siutations mentioned above, because the task “silently” does nothing and terminates upon any cancellation error from the inside.

I’d like to change it something like:

async def mytask():
  try:
    async with some_resource():
      ...
  except Exception:
    log.exception("unexpected error in mytask")
  except asyncio.CancelledError as e:
    log.debug("mytask is cancelled by %s", e.source)

By inspecting the exception object and traceback, we can determine which await or async-struct statement in the body has raised the cancellation (i.e., where it is cancelled), but not who ultimately triggered the cancellation. The information “where it is cancelled” is less important as long as all the resources are properly released upon cancellation, and Python asyncio provides good means to this: context managers. When debugging, the information “who triggered cancellation” is often much more important.

My initial idea is to add a source attribute to asyncio.CancelledError which represent the name of module/function path which called Task.cancel(), e.g., "async_timeout.timeout", "myapp.server.shutdown", etc. Of course, I think the core devs would have better realization ideas.
We may need to consider stacking the source of cancellations like exception’s __cause__ attribute, when cancellation triggers another cancellation.

What about your thoughts?

In most other exceptions in Python, usually the location of where the exception is raised is the most important information for debugging.

asyncio’s cancellation is a special case here: the cancellation error is often “injected” by something outside of the current task’s stack, and the exact location where it is raised inside the cancelled task is a floating information and less important for debugging.

The same argument may extend to the generic coroutine protocol in Python: the exceptions raised by coro.throw() should include the information who threw the exception because the reference to the coroutine may be passed around in many places and yield from will expose it to the upper stack of the caller.

I suppose if you could convince us that this is an important feature we could just stop deprecating the msg argument to Task.cancel() an Future.cancel(), and change all .cancel() calls in asyncio to include the caller’s function name. There are about 60 such calls. Some probably don’t need this (because the cancellation is purely internal) but it might require a lot of analysis to decide which calls don’t need it.

I have to admit that the reason to deprecate it was not very strong – it was felt that it causes needless complexity in the API and its implementation, and there are ambiguities if a task is cancelled multiple times. I also believe that we found places in the code where the msg value was dropped – I think the pattern was that some code would catch CancelledError and then pass the cancellation on to some other task by calling its .cancel() method without passing the msg value. (Presumably this was code written before 3.9, when the msg argument was added, and not updated. Though it could also be written later by someone who wasn’t aware of the msg convention.)

You can read more about this in gh-90985.

I do still have doubts about the usefulness of your proposal. There are plenty of other situations where the origin of an asynchronous interruption is not revealed – for example, KeyboardInterrupt, and signal handlers. In the discussion, Serhiy also brought up other examples where the origin of some data that causes an exception is lost.

You bring up gen.throw() – I am not sure the situation is similar there, the exception argument should already record from where it is being thrown. If you want to argue about this more please start a new thread in a different topic, it has nothing to do with async.

PS. Possibly @cjerdonek is interested in reviving this discussion.

2 Likes

If possible, it would be better to automatically catch the reference name of .cancel() caller (maybe inspecting the stack frames?) instead of manually fixing all existing .cancel() calls. Even though we fix all stdlib usage, 3rd party libraries are not covered. So I’d like to say, I don’t mind the deprecation of msg argument as long as we have some other means to keep track of cancellation trigger sources.

I agree with the decision for its deprecation in that it may be abused and requires an extra care by 3rd-party library authors to add it to custom task wrappers or task implementations. My suggestion is to add an auto-populated source attribute to the CancelledError class, when it is injected by Task.cancel().

KeyboardInterrupt is a global showstopper and we just don’t care of who raised it – usually it’s the user hitting Ctrl+C – and we just need to clean up everything as soon as possible. :wink: GeneratorExit, yet another BaseException, is related to the generator function itself and has clear intentions and semantics about where it comes from and what it affects (raising up automatically changes it to RuntimeError).

However, CancelledError requires more attention because it is a scoped showstopper which affects the stack of a task, not the entire application (process or thread). Since tasks may cancel other tasks in various contexts, I think we should be able to inspect the full interactions between them.

Please forget my argument about coro.throw() for now. Let’s focus on task cancellation.

BTW, I have an alternative suggestion if you feel it’s too much to modify the CancelledError interface: How about adding sys.audit() calls (PEP-578) to asyncio in general?
It’s because my goal is to make “who triggered cancellation” visible to the debuggers and sys.audit() is what we already have for this purpose. Potential audit targets would include asyncio.create_task() / TaskGroup.create_task(), Task.cancel(), upon successful termination of a task, loop.stop(), and so on.

I say if we’re going to keep some form of cancellation message around we should build on the existing API, not replace it, if we can. In particular we should keep it so that the first argument to CancelledError is a human-readable string. We should probably also keep it so that calling fut.cancel(msg) sets that string to msg. But we could make it so that if fut.cancel() is called without argument it produces the default argument as you suggested.

However you haven’t provided a particularly strong argument for the added complexity of a cancel message in the first place. You’ve just said that you think it would have helped in debugging. And I haven’t heard from anyone else who confirmed that they have been having trouble debugging “mystery cancellations”. So I am still skeptical that this is worth it. And I am pretty sure that there are still plenty of bugs lurking in the existing asyncio code base where the cancellation message is accidentally dropped.

Regarding usining sys.audit() instead: AFAIK that is a security interface and I don’t think that the security audit logs need an entry for every task creation or cancellation.

Please forget about the cancel message but focus on the title of this thread: “Tracking the source of cancellation in tasks”. That is what I’d like to have support in asyncio.

It may be implemented using the cancel message, a new default argument/attribute of CancelledError, or sys.audit() calls, or whatever else that would be more appropriate.
I’m asking for the core devs and community’s opinion about what could be the best option.

The motivation of this request is as follows:

  • We need to increase debuggability and visibility of how coroutines and tasks work when debugging complex systems written in asyncio.
  • In thread/process-based concurrency, the location of stack is the most important information of an exception.
  • In coroutine-based concurrency (asyncio), both the location of stack where the exception is raised AND another stack where the exception is injected are important.
  • I’d like to have the full visibility of interactions between coroutines that involves logic-flow control of a task: currently we have the only one: cancellation.
  • When writing my own PersistentTaskGroup and its test cases, using the cancel message parameter for this purpose helped me a lot to debug whether the task is cancelled from outside or sibling or itself.

Example: (with a monkey-patch to async_timeout as well)

diff --git a/src/aiotools/taskgroup/persistent_compat.py b/src/aiotools/taskgroup/persistent_compat.py
index 357580f..1d56c1d 100644
--- a/src/aiotools/taskgroup/persistent_compat.py
+++ b/src/aiotools/taskgroup/persistent_compat.py
@@ -145,7 +145,7 @@ class PersistentTaskGroup:
         self._aborting = True
         for t in self._tasks:
             if not t.done():
-                t.cancel()
+                t.cancel(msg='explicit shutdown')

     async def shutdown(self) -> None:
         self._trigger_shutdown()
@@ -166,7 +166,7 @@ class PersistentTaskGroup:
             return ret
         except asyncio.CancelledError:
             if fut is not None:
-                fut.cancel()
+                fut.cancel(msg='cancel taskgroup scope')
             raise
         except Exception as e:
             # Swallow unhandled exceptions by our own and
diff --git a/async_timeout/__init__.py b/async_timeout/__init__.py
index 4188a98..a5866fc 100644
--- a/async_timeout/__init__.py
+++ b/async_timeout/__init__.py
@@ -216,7 +216,7 @@ class Timeout:
         return None

     def _on_timeout(self, task: "asyncio.Task[None]") -> None:
-        task.cancel()
+        task.cancel(msg="timeout")
         self._state = _State.TIMEOUT
         # drop the reference early
         self._timeout_handler = None

I’d like to have the same effect without manually patching all existing .cancel() calls in the asyncio stdlib and 3rd-party libraries by fixing one place: the cancel() method itself. Moreover, a human-provided message string does not guarantee any kind of structure, so if we could have a structured data (such as the caller’s module:function path string) that may be used by codes, programmatically, it would be much more useful.

I have asked for others’ opinions in a more popular topic (Can I get a second opinion on an asyncio issue?).

2 Likes