Dealing with unbounded recursion in logging handlers

duaneg · March 29, 2025, 8:30am

The standard library queue log handler is implemented using multiprocessing.Queue, however that class (and the multiprocessing module in general) has built-in logging itself. If that is enabled log messages will be logged in the course of processing log messages, and we deadlock (see gh-91555 and gh-90321).

Options under consideration for dealing with this are:

Treat it as user error and close the issues as such
Remove the logging from multiprocessing.Queue
Suppress logging (on a per-thread-and-logger basis) while handling log messages

The first option seems sensible at first glance. However, this is a nasty and non-obvious latent bug to leave lying around. Sentry’s logging integration, for example, monkey-patches logging internals in order to avoid it.

The second option was suggested in discussion of gh-91555: it would fix those two bugs, but would remove useful (and publicly documented) functionality from the module, and change logging behaviour for any existing code that uses it.

The third option is a small and simple fix (see the PR linked from gh-91555), and prevents this entire class of bug. However there may be subtle problems with using TLS, or other concerns I’m missing.

Thoughts?

MegaIng · March 29, 2025, 8:53am

I have not checked the code to see how feasible this is, but: Wouldn’t it be possible to add a parameter to multiprocressing.Queue that prevents all attempts to log something that logging would then use? So essentially option 2, but locked behind a flag in case people consider it useful?

pitrou · March 29, 2025, 8:58am

The third option sounds fine to me.

gpshead · March 29, 2025, 4:28pm

Agreed with the third option (which is basically your PR).

Q: Does use of contextvars rather than threading.local() make more sense for that?

pitrou · March 29, 2025, 5:08pm

AFAUI, this is really about physical recursion, not about the logical execution context. So thread-local state sounds like the appropriate answer.

duaneg · March 29, 2025, 9:57pm

The Queue is passed to the handler not created by it, so it couldn’t be done with a constructor parameter. It wouldn’t be hard to add a flag member variable to disable logging, but the handler would have to set it when it is constructed and the queue passed in. It all feels a bit hacky to me though, and it would not help other handlers using third-party libraries with the same issue.

MegaIng · March 29, 2025, 9:58pm

Aha yeah. That doesn’t sound like a good solution then. Then I am also in favor of option 3.

duaneg · March 29, 2025, 10:27pm

I haven’t used contextvars before, but having a quick look I agree with Antoine: TLS is appropriate here. The logging module itself is synchronous by design, and an async context switch during message handling would not be safe since it might be holding arbitrary threading locks.

vstinner · March 30, 2025, 6:26pm

I don’t know if it’s a good pattern to follow, but Lib/multiprocessing/resource_tracker.py uses an RLock and lock._recursion_count() method to detect recursive call:

    def _reentrant_call_error(self):
        # gh-109629: this happens if an explicit call to the ResourceTracker
        # gets interrupted by a garbage collection, invoking a finalizer (*)
        # that itself calls back into ResourceTracker.
        #   (*) for example the SemLock finalizer
        raise ReentrantCallError(
            "Reentrant call into the multiprocessing resource tracker")

    def _stop(self, ...):
        with self._lock:
            self._stop_locked()
        ...

    def _stop_locked(...):
        # This shouldn't happen (it might when called by a finalizer)
        # so we check for it anyway.
        if self._lock._recursion_count() > 1:
            return self._reentrant_call_error()
        ...

duaneg · March 30, 2025, 10:33pm

Yes, something like this would be an option: detect the recursion and raise an exception. That would be much better than a deadlock and marginally better than a stack overflow.

IMO though the best option is still to just ignore and silently drop re-entrant log messages: in practice that is what affected code will need to do to fix the problem anyway (e.g. see Sentry’s logging integration, referenced above).

steve.dower · March 31, 2025, 2:28pm

I haven’t looked at exactly what state is being managed here, but we ought to be able to use contextvars anywhere we might use a thread local. They provide the same functionality, except contextvars give you task local, which is more “local” than thread local.

It doesn’t matter if the state being stored never persists outside of the call (e.g. if after the logging call returns, it’s never going to use the info again), and it never yields within the call.

But even if these are the case, it should still be perfectly safe to use contextvars instead of thread locals, and we should generally prefer them unless there’s a really specific reason (to share state across tasks). I don’t see any really specific reason here.

graingert · March 31, 2025, 11:05pm

We can’t use contextvars because a handler could copy the context into a thread and drag with it the disabled state of the logger into that new thread

steve.dower · March 31, 2025, 11:13pm

Hmm, yeah, I see your point. That’s unfortunate (did this come up during the logging issue that was being worked on recently?)

Arguably the handler should just not copy the context and just start with a default one, but there’s no way for it to know that. It might think that it’s the core dispatcher, especially given how mixed up sync and async code tends to get. (In any case where contextvars (or TLS) matter at all, there must be some central controller that’s spinning up new tasks - a connection handler or whatever - and anything that isn’t that ought to be starting tasks fresh rather than copying task-local state.)

graingert · April 5, 2025, 5:03am

a fresh task, eg asyncio.run, asyncio.create_task, asyncio.to_thread (not a task, but an async function that starts a thread) etc by default copy the current context. contextvars yean to be copied into new threads, and only make sense when you have a tree of call stacks where some context needs to wrap all of them.

there’s even movement to make contextvars drag across threads by default: gh-128555: Add 'context' keyword parameter to Thread. by nascheme · Pull Request #128209 · python/cpython · GitHub

contextvars are not just a better TLS - each option has different use cases

graingert · April 5, 2025, 5:09am

I don’t know what this issue was

steve.dower · April 7, 2025, 2:02pm

graingert · April 7, 2025, 2:41pm

Ah yes, in this case we do want the warnings filters to be shared across to new threads and tasks

vsajip · April 14, 2025, 7:33am

Another possible solution is to log events relating to multiprocessing queue internals to a different “meta” logger such as one named multiprocessing.meta. Then those who need to monitor/debug those internals can add handlers to that logger (I imagine it wouldn’t be a common case).