Context: written for internal distribution at Anthropic and subsequent publication on the Python forum, to summarize lessons learned from a recent migration. I hope this writeup is of interest and perhaps use to the async user and maintainer community!
Background
We have a lot of async code at Anthropic. While not everyone is a fan, this has made implementing high-performing systems for sampling and reinforcement learning – where Python is acting as a glue layer and fancy scheduler for more compute-intensive pieces – much easier than would otherwise be the case.
We build this on Trio, which makes it possible (though not always easy!) to write concurrent programs which support timeouts and cancellation without ever dropping errors on the floor. I believe we’re among the larger-scale users of Trio and earliest adopters of ExceptionGroup
and Trio’s strict_exception_groups
setting, and so I thought it worth writing up a short experience report following our migration from MultiError
to ExceptionGroup
.[1]
Implementation
Update the pinned version of Trio, git grep MultiError
, and the required fixes are both mechanical and easy – our codebase had few enough use sites to fully update in a single PR. Incremental migration would work fine too for very large repos.
asyncio.TaskGroup
supports a similar programming style from Python 3.11 onwards, but I haven’t used that at any significant scale. While out of scope for this experience report, I hope some of the lessons are also useful for asyncio users and projects.
Expected and unexpected benefits
Clarified tracebacks with the PEP-654 tree structure was our leading motivation for this change; and it paid off. The flat/serial MultiError
reporting style could be confusing when viewing a tree of exceptions, or when viewing logs from multiple subprocesses, so we’ve been very happy with this change.
Consistent use of ExceptionGroup, even for single errors has substantially improved the robustness of our concurrent code. With MultiError
passing through single exceptions unchanged, the easy path meant failing to handle concurrent errors and the correct path was substantially less ergonomic.
We expected enabling strict_exception_groups=True
[2] to improve our code, but it substantially exceeded our expectations – after some teething issues as latent bugs were found and fixed we’ve reduced the crash rate of both our tests and production services. Highly recommended!
Problems and lessons learned
We hit and helped fix some bugs, which is not terribly unusual for new code. Our sincere thanks to the OSS community for helping to rapidly identify and fix them! Details for the interested in a footnote.[3]
Handling concurrent errors takes careful thought. For example, what should you do if you get a concurrent sys.exit(0)
and ConnectionClosedError
? Intuitively you might want to exit cleanly if the connection had been closed because the program is exiting, but fail if they were unrelated – but I’ve found that this information is not usually available, leaving us to rearchitect a great deal of code or simply guess.
Testing for expected errors requires unwrapping them, which was pretty easy with a new unwrap_exception()
context manager to place inside pytest.raises(...)
. If we don’t have a single “leaf” exception group, or a leaf of an unexpected type, just leave the group as-is. We don’t yet test for expected exception groups; for that see Design discussion: how should we test for expected `ExceptionGroup`s? · Issue #10441 · pytest-dev/pytest · GitHub
Error-safe handling of async generators is much harder than you think, even if you think it’s hard. The core difficulty is that GeneratorExit
will be wrapped up into a BaseExceptionGroup
if it’s coming out of a nursery; but this breaks its control-flow effects. I emphasize that this is not a new error; but rather a latent error which we were now forced to handle. My first response was to add our unwrap_exception()
context manager; but this restores the status quo ante instead of truly fixing the bug.
We’re now using an unwrap_generatorexit()
context manager which logs the group and does raise GeneratorExit from group
if it contains one or more GeneratorExit
exceptions. This restores the expected control flow, but at the cost of discarding whatever other exception(s) were in flight.
- GitHub - python-trio/async_generator: Making it easy to write async iterators in Python 3.5 doesn’t help with this; it backports async generators to Python 3.5 and related introspection tools to Python 3.6; and both are end-of-life.
- I haven’t dug deep enough into add @trio_async_generator by belm0 · Pull Request #9 · groove-x/trio-util · GitHub to know whether it would fix this.
- Our only substantial use of async generators is in order to implement async context managers
– but even here if you nest these you end up needing to[update: I now think I misdiagnosed this]unwrap_generatorexit()
and drop errors. - I’m inclined to ban async generators anywhere in our codebase, except for implementing
@asynccontextmanager
functions. Fortunately we’re very light users already, so this won’t be too hard.
I don’t feel like we understand good design patterns here, so these are low-confidence recommendations.
I misdiagnosed several problems between upgrading our dependencies and enabling strict_exception_groups
, where one upgrade triggered a latent error (involving GeneratorExit). That we’ve chosen to tolerate dirty shutdown of some services has saved substantial engineering effort, but also left a lot of misleading clues in various logs. I think it’s a good tradeoff but still paid for it here.
Conclusions
- Async Python offers us a productive toolkit for writing correct and efficient concurrent code
- Moving to ExceptionGroup has improved the readability of our logs and allowed us to identify and fix several latent errors which were regularly biting us in tests and in production. Thanks to everyone who contributed to making this possible!
- Async generators are very difficult to use correctly; consider avoiding them entirely in favor of memory channels between tasks.
-
As further context, I (Zac) have been pretty heavily involved with
ExceptionGroup
through my OSS work: as leader of Hypothesis (motivating me to write PEP-678), core maintainer of Pytest, and Trio committer, I’ve got a pretty strong interest in the existence of a well-supported standard. I’ve also written a bunch of ExceptionGroup support code in Pytest and pytest-trio; and Anthropic sponsored Trio’s existing MultiError-to-ExceptionGroup porting effort to help get that merged and released. It’s not all or even mostly me though; Anthropic chose this stack before I joined, and if this sounds like your kind of fun, yes, we’re hiring. ↩︎ -
This will be the default behavior in a future version of Trio, and
asyncio.TaskGroup
has never supported anything else. We enable it globally by monkey-patchingtrio.run = partial(trio.run, strict_exception_groups=True)
, and likewise fortrio.open_nursery
. ↩︎ -
The hard bug was that the
exceptiongroup
backport package could crash when displaying the traceback of an exception raisedfrom
an ExceptionGroup, only on Python 3.10 – and we only hit this insys.excepthook
on certain remote jobs! More prosaically,pytest-trio
would (with strict exceptiongroups) always raiseskip()
orxfail()
exceptions into a group, which Pytest then treats as an ordinary failure; it now unwraps groups containing exactly one of these magic exceptions. Thanks to everyone who helped identify and fix these issues ↩︎