Server-oriented task scope design

achimnol · May 21, 2024, 5:42pm

As a part of PyCon US 2024 sprint, I’d like to make a discussion thread for task scope design improvements as I’ve been suggesting since last year.

Reference

Revisiting PersistentTaskGroup with Kotlin's SupervisorScope
- Suggestion of a non-cancelling “persistent” task group variant (like SupervisorScope)
Add `task_factory` to asyncio.start_server and friends
- Question about setting a specific task factory for all tasks spawned inside a task group

Last year, although @guido suggested me to write and submit a PR to extend the existing TaskGroup API, I could not finalize my own thoughts about the design of the exception handling interface. To continue the discussion and catch up the new things in Python 3.13, I’d like to open a new thread.

There are already people who understand and agree with the motivation of this issue, but I’m re-phrasing it for others in the community. It would be ideal to implement all of the suggested ideas and request for reviews by myself, but unfortunately it would take a long time for me as I cannot dedicate enough time for in-depth programming in recent years (I’m running a company as CTO). I’d like to strongly encourage other people to participate in the development and discussion.

My experimentations in 2023

aiotools.TaskScope and aiotools.TaskContext
- refactor: Introduce `TaskScope` and `TaskContext` with `cancel_and_wait()` by achimnol · Pull Request #58 · achimnol/aiotools · GitHub
- TaskContext is a simple weakset container of a set of tasks, which can be cancelled together explicitly. It is the base class of TaskScope and TaskGroup.
- TaskScope is a TaskGroup variant that keeps running even when individual subtasks are cancelled or fails.
aiotools.as_completed_safe()
- It is guarded by a task scope, and generates a series of results from tasks.
- Now Python 3.13 has a new, almost same, async-generator-based asyncio.as_completed().
aiotools.race()
- A happy-eyeball primitive with two modes of exception handling, written using as_completed_safe():
  - Store and defer errors; continue until we have the first successful result
  - Stop and raise upon the first error

The overall tone here is to improve support for long-running server applications.
asyncio.TaskGroup does its job well on a short-lived group of “run-to-completion” tasks.

Abstracting server applications as a tree of nested task “scopes”

An example web server:

The root task scope
- Appliaction “A” task scope serving /myapp-a
  - Handler task scope: API handler coroutines for /myapp-a/*
  - Timer task scope: periodically executed tasks (timers)
  - Background task scope: long-running async operations
- Application “B” task scope serving /myapp-b
  - Handler task scope: API handler coroutines for /myapp-b/*
  - Timer task scope: periodically executed tasks (timers)
  - Background task scope: long-running async operations
- …

In this design, the lifecycle of a database connection pool may be bound to application task scopes or the root task scope. We could insert more task scope levels to make explicit ordering of shutdown.

When shutting down, we need to shutdown the task scopes from the innermost ones to the outermost ones in order. It would be nice to be able to signal the root task scope and perform this in a single call. We also need to set timeouts on shutting down individual task scopes and the entire tree, before moving to a “forced” shtudown.

aiohttp partially implements a concept like this by separating shutdown / cleanup signal handlers and context managers. (ref: Web Server Advanced — aiohttp 3.13.2 documentation) I believe this should be expanded and generalized to structure all asyncio server apps.

A handling interface for continuous stream of results and errors

Currently, we only have the global fallback exception handler set for the entire event loop: loop.set_exception_handler()

For long-running task scopes, we need to be able to set up different exception handlers by each scope.

Tasks in task scopes for server apps usually do not return results as their job is to reply back to the clients or complete a background work. Though, there may be other use cases to retrieve the results and exceptions in a structured manner like spawning a group of long-running run-to-completion jobs together.

Generator-based as_completed()-style API could help here, but I’d like to hear more opinions from language experts about the abstraction/design of such streamed results and failures.

TODO: check if the 3.13’s asyncio.as_completed() supports continuation upon exceptions.

Ensuring all directly/indirectly spawned tasks inside a task scope belong to their parent task scope

As another thread mentioned above reveals, to control the tasks indirectly spawned in 3rd party libraries with nested task scopes, we should be able to customize the task factory (low-level), have a task-scope-wide context variables, or let all indirectly spawned tasks belong to the “current” task scope.

This feature is best implemented in the stdlib asyncio to control all 3rd-party libraries, but I think it may be possible to implement as a task factory. The problem is that it is difficult to register and compose multiple different task factories in a single asyncio loop. (e.g., combining aiomonitor) This leads to another discussion like Request: Can we get a c-api hook into PyContext_Enter and PyContext_Exit - #5 by brettlangdon, or having finer-grained hooks on the task lifecycle.

This will enable us to keep track of the tasks in a more structured way and provide more useful statistics and telemetry of asyncio applications by categorizing tasks. The current asyncio.all_tasks() is too “inclusive”.

I’d like to hear other async library author’s opinions about making this a mandatory default or an optional flag, and potential breakage that could happen.

achimnol · May 21, 2024, 8:57pm

So, I think we need to break down the issue to a set of more approachable tasks.

There are two big directions:

Implementing the intrinsic “task scope” entirely in the stdlib asyncio (e.g., an option to TaskGroup as some suggested, refactoring TaskGroup into TaskScope and TaskContext, etc.)
Splitting out the “task scope” implementation as a 3rd-party library after adding minimal supporting facilities in the stdlib asyncio

Practically, the second approach would sound better and be more useful in terms of reusability in aiomonitor and other libraries. Using hooks, it will be possible to embrace existing 3rd party libraries to work with task scopes seamlessly.

Things to do in the stdlib asyncio

Add a hooking interface to task creation and termination.
- We could keep the vanilla task factory, while adding callbacks to the task lifecycle events.
  - Task factories should be the responsibility of event loop implementations.
  - Task hooks should be the main interface for libraries to implement task tracking.
- This will allow multiple libraries to add their own custom task trackers.
Add fine-grained tracing.
- e.g., hooks/callbacks for _step()
- This is not strictly required for “task scope”, but will be useful for observability libraries.

Things to do in a 3rd-party “task scope” implementation

Attach a task creation hook:
- Query the “current task scope” contextvar.
- Add the reference of the created task to it.
Attach a task termination hook:
- Query the “current task scope” contextvar.
- Remove the reference of the terminated task from it.
- (This may be replaced with using WeakSet in the task scope, but it would be better to be explicit and keep possibility to add other actions here.)
The “current task scope” contextvar could be a tree of instances where we can access the parent and child scopes when needed.
Implement the task scope concept
- Concurrently shut down the child task scopes when explicitly closed.
- Do this recursively until the target task scope is shut down.

Things to think

What is the “natural” representation of task scopes in the code?
- async with blocks
- A class with tree-manipulating methods (imagine the DOM API)
- Both?
How should the exception callback for each task scope look like?
- Just copy the event loop exception handler interface?

Though, it is still nice to put task scopes in the stdlib, so that observability libraries could expect that it is always there (e.g., asyncio.task_scope_tree or asyncio.root). Maybe this can be done after the initial experimentation in a 3rd-party.

A potential concern is that the hooking interface may incur some extra performance overheads, but I think it is worth to trade off.

How about your thoughts?

mikeshardmind · December 22, 2024, 5:12pm

Following over from the other thread you linked this to as I think the feedback is more appropriate here, than in that thread.

I don’t particularly think the task-group machinery is a good abstraction for the problems people are trying to solve with it, and I don’t think making task creation in general have more overhead to support an even more complex case of it is a good idea.

Recently (it’s part of 3.14a) simply changing the structure tasks were stored in resulted in a roughly 10% performance uplift for asyncio applications. Task creation overhead is extremely sensitive to performance for every asyncio application, and I don’t think the additional complexity here needs to be owned by event loops, it should be applications choosing which tasks to block on, and properly ordering a mix of cancellation and signaled graceful shutdown as appropriate.

I do think we can make low-cost abstractions that can help both library and application code here, and I have some that are public and some that I cannot share yet^[1], but I don’t think it should increase the cost for creating tasks that aren’t using that abstraction, and that’s part of the issue I have with task groups as currently designed, and as proposed to extend further.

The other problem I have with task groups is I think they get cancellation wrong in two ways, one of them is “obvious”^[2] with known issues, the other is a philosophical design difference I’ll be happy to discuss in more depth after the holidays if you’re interested in an alternative, less-intrusive way of structuring this for applications, but the basics of this is that it is enough for applications to have a ordered set of sets of tasks to wait on with the order corresponding to the lifecycle and not necessarily a level of nesting for where the task is launched.

I have written a bit about the issues people run into with cancellation vs cooperative shutdown before here, and it’s partially relevant in how I would suggest designing a good-general purpose abstraction here that doesn’t require hooking into tasks in general, but I’ll only be able to show what some people might view as a toy example for now^[1:1]

Some production code I can’t currently share is in holiday-limbo right now, but I’m in the process of waiting on official permission to share the relevant bits of how asyncio is used at my day job, and why we avoid most APIs that implicitly cancel ↩︎ ↩︎
At least for those who have been following various issues, especially those spelled out in the rationale for PEP 533 ↩︎