Using ExceptionGroup at Anthropic (experience report)

Zac-HD · November 9, 2022, 7:21am

Context: written for internal distribution at Anthropic and subsequent publication on the Python forum, to summarize lessons learned from a recent migration. I hope this writeup is of interest and perhaps use to the async user and maintainer community!

Background

We have a lot of async code at Anthropic. While not everyone is a fan, this has made implementing high-performing systems for sampling and reinforcement learning – where Python is acting as a glue layer and fancy scheduler for more compute-intensive pieces – much easier than would otherwise be the case.

We build this on Trio, which makes it possible (though not always easy!) to write concurrent programs which support timeouts and cancellation without ever dropping errors on the floor. I believe we’re among the larger-scale users of Trio and earliest adopters of ExceptionGroup and Trio’s strict_exception_groups setting, and so I thought it worth writing up a short experience report following our migration from MultiError to ExceptionGroup.^[1]

Implementation

Update the pinned version of Trio, git grep MultiError, and the required fixes are both mechanical and easy – our codebase had few enough use sites to fully update in a single PR. Incremental migration would work fine too for very large repos.

asyncio.TaskGroup supports a similar programming style from Python 3.11 onwards, but I haven’t used that at any significant scale. While out of scope for this experience report, I hope some of the lessons are also useful for asyncio users and projects.

Expected and unexpected benefits

Clarified tracebacks with the PEP-654 tree structure was our leading motivation for this change; and it paid off. The flat/serial MultiError reporting style could be confusing when viewing a tree of exceptions, or when viewing logs from multiple subprocesses, so we’ve been very happy with this change.

Consistent use of ExceptionGroup, even for single errors has substantially improved the robustness of our concurrent code. With MultiError passing through single exceptions unchanged, the easy path meant failing to handle concurrent errors and the correct path was substantially less ergonomic.

We expected enabling strict_exception_groups=True^[2] to improve our code, but it substantially exceeded our expectations – after some teething issues as latent bugs were found and fixed we’ve reduced the crash rate of both our tests and production services. Highly recommended!

Problems and lessons learned

We hit and helped fix some bugs, which is not terribly unusual for new code. Our sincere thanks to the OSS community for helping to rapidly identify and fix them! Details for the interested in a footnote.^[3]

Handling concurrent errors takes careful thought. For example, what should you do if you get a concurrent sys.exit(0) and ConnectionClosedError? Intuitively you might want to exit cleanly if the connection had been closed because the program is exiting, but fail if they were unrelated – but I’ve found that this information is not usually available, leaving us to rearchitect a great deal of code or simply guess.

Testing for expected errors requires unwrapping them, which was pretty easy with a new unwrap_exception() context manager to place inside pytest.raises(...). If we don’t have a single “leaf” exception group, or a leaf of an unexpected type, just leave the group as-is. We don’t yet test for expected exception groups; for that see Design discussion: how should we test for expected `ExceptionGroup`s? · Issue #10441 · pytest-dev/pytest · GitHub

Error-safe handling of async generators is much harder than you think, even if you think it’s hard. The core difficulty is that GeneratorExit will be wrapped up into a BaseExceptionGroup if it’s coming out of a nursery; but this breaks its control-flow effects. I emphasize that this is not a new error; but rather a latent error which we were now forced to handle. My first response was to add our unwrap_exception() context manager; but this restores the status quo ante instead of truly fixing the bug.

We’re now using an unwrap_generatorexit() context manager which logs the group and does raise GeneratorExit from group if it contains one or more GeneratorExit exceptions. This restores the expected control flow, but at the cost of discarding whatever other exception(s) were in flight.

GitHub - python-trio/async_generator: Making it easy to write async iterators in Python 3.5 doesn’t help with this; it backports async generators to Python 3.5 and related introspection tools to Python 3.6; and both are end-of-life.
I haven’t dug deep enough into add @trio_async_generator by belm0 · Pull Request #9 · groove-x/trio-util · GitHub to know whether it would fix this.
Our only substantial use of async generators is in order to implement async context managers ~~– but even here if you nest these you end up needing to unwrap_generatorexit() and drop errors.~~ [update: I now think I misdiagnosed this]
I’m inclined to ban async generators anywhere in our codebase, except for implementing @asynccontextmanager functions. Fortunately we’re very light users already, so this won’t be too hard.

I don’t feel like we understand good design patterns here, so these are low-confidence recommendations.

I misdiagnosed several problems between upgrading our dependencies and enabling strict_exception_groups, where one upgrade triggered a latent error (involving GeneratorExit). That we’ve chosen to tolerate dirty shutdown of some services has saved substantial engineering effort, but also left a lot of misleading clues in various logs. I think it’s a good tradeoff but still paid for it here.

Conclusions

Async Python offers us a productive toolkit for writing correct and efficient concurrent code
Moving to ExceptionGroup has improved the readability of our logs and allowed us to identify and fix several latent errors which were regularly biting us in tests and in production. Thanks to everyone who contributed to making this possible!
Async generators are very difficult to use correctly; consider avoiding them entirely in favor of memory channels between tasks.

As further context, I (Zac) have been pretty heavily involved with ExceptionGroup through my OSS work: as leader of Hypothesis (motivating me to write PEP-678), core maintainer of Pytest, and Trio committer, I’ve got a pretty strong interest in the existence of a well-supported standard. I’ve also written a bunch of ExceptionGroup support code in Pytest and pytest-trio; and Anthropic sponsored Trio’s existing MultiError-to-ExceptionGroup porting effort to help get that merged and released. It’s not all or even mostly me though; Anthropic chose this stack before I joined, and if this sounds like your kind of fun, yes, we’re hiring. ↩︎
This will be the default behavior in a future version of Trio, and asyncio.TaskGroup has never supported anything else. We enable it globally by monkey-patching trio.run = partial(trio.run, strict_exception_groups=True), and likewise for trio.open_nursery. ↩︎
The hard bug was that the exceptiongroup backport package could crash when displaying the traceback of an exception raised from an ExceptionGroup, only on Python 3.10 – and we only hit this in sys.excepthook on certain remote jobs! More prosaically, pytest-trio would (with strict exceptiongroups) always raise skip() or xfail() exceptions into a group, which Pytest then treats as an ordinary failure; it now unwraps groups containing exactly one of these magic exceptions. Thanks to everyone who helped identify and fix these issues ↩︎

njs · November 9, 2022, 9:53pm

Hmm, can you give an example? I’m not thinking of any cases where GeneratorExit could come out of a nursery in correct Trio code. Because that would mean you need to have a yield inside the nursery block, and that’s only valid when you’re using @asynccontextmanager, and if you’re using @asynccontextmanager then your async generator should always be run to completion, not closed, so there is no GeneratorExit…?

I also would have expected @asynccontextmanager to work fine with ExceptionGroups, so I’m puzzled here too!

Zac-HD · November 11, 2022, 1:48am

Unfortunately we have known-unsafe Trio code which yields from inside a nursery block, so no mystery there! Some slipped in before we developed flake8-trio to defend against inevitable-at-scale human error; most have since been fixed and the rest are for a generator-based API in a key component which we’ll be replacing soon.

njs · November 11, 2022, 5:47am

Oh sure that makes sense. But you also said that even async context managers were broken, and that’s the part I’m confused about

Zac-HD · November 15, 2022, 4:45am

I spent the afternoon digging into this, and it looks like I was just mistaken: a non-contextmanager async generator threw a GeneratorExit out through one of our context managers, and in addition to fixing the actual problem I added my with unwrap_generatorexit(): hack to a context manager that didn’t actually need it. Apologies for the confusion, thanks for the prompt to investigate more fully, and post edited!