Architectural Discussion: Could Runtime Sandbox Isolation improve Python's resilience?

Hi everyone,

I am researching ways to improve Python’s resilience when dealing with unrecoverable infinite loops or runtime paradoxes, specifically in long-running or high-availability processes.

Currently, such cases often require a full process restart, which can be costly in terms of data retention and system overhead. I am exploring a theoretical concept called Dynamic Sandbox Isolation (DSI) to see if it’s feasible to isolate a trapped thread into a ‘micro-sandbox’ instead of terminating the process.

My current hypothesis involves:

Detection: Monitoring threads for extreme instruction pointer frequency (N > 1000 consecutive hits within <1ms).

Isolation: Re-mapping memory pages of the trapped thread to a read-only ‘sandbox’ via mmap.

State Preservation: Moving the thread to an ‘Active Idle’ state to nullify CPU usage without dropping the execution context.

I am in the early stages of preparing a simulation environment using QEMU/Gem5 to test the performance impact of such a mechanism.

Before I go deeper into the implementation details, I would love to hear the thoughts of core developers: Has any approach similar to this been explored in Python’s past, and what are the major architectural hurdles in implementing memory isolation at the interpreter/runtime level?

Any guidance or pointers to relevant PEPs or discussions would be greatly appreciated.

Best regards,

Pyae

If your goal is to use isolation to detect software bugs and do some sort of recovery from those failure cases that doesn’t involve restarting the process you first need to figure out on an architectural level what parts of the process you actually think can be “saved” and kept and which parts you need to throw away and reset to actually accomplish recovery from the failure state.

If the program is in an infinite loop or crashes it’s almost certainly because some memory contains variables and data that tells it to loop.

At the language level there is no way to know what parts of a program’s memory is data you want to keep and what needs to be reset to actually recover.

Without specific application logic the only boundary that could be meaningfully reset without being equivalent to restarting the process is probably the interpreter state and that you need no isolation for (and while in theory that should be possible to restart in practice between extensions with static variables, Daemon threads that can’t be killed without process stop and other resources that are hard to restore much can go wrong and very little would be gained Vs just restarting)

Hi @tapetersen,

Thank you for this incredibly insightful feedback. You’ve hit on the exact core challenge of runtime state recovery.

To clarify my approach, the primary goal of Dynamic Sandbox Isolation (DSI) at this stage is not immediate state recovery/repair, but rather blast containment.

In high-availability, multi-threaded environments, if one worker thread enters an unrecoverable infinite loop, it can completely starve the CPU or block shared resources, eventually bringing down the entire process (and other healthy threads along with it).

My thought process behind using a page-table-level sandbox is:

Quarantine First: Once the paradox detection threshold is crossed, we immediately modify the page tables for that specific thread to strip write permissions or redirect execution. This prevents the rogue thread from further corrupting the global heap or exhausting CPU cycles.

Post-Mortem / Safe Degradation: Instead of trying to automatically guess which variables to reset at the language level (which, as you rightly pointed out, is nearly impossible without application logic), the DSI model treats the isolated thread as a “frozen snapshot.” The application can then safely degrade, trigger an alert, or allow an external supervisor to inspect the state without crashing the whole application.

You are entirely right that dealing with C-extensions and static variables introduces massive edge cases. I see DSI more as a hardware-assisted defense mechanism to keep the rest of the interpreter alive rather than a magic undo-button for logic bugs.

I’d love to know your thoughts—do you think focusing strictly on containment/quarantine (rather than automatic recovery) makes it more viable from an interpreter architecture standpoint?

Best regards,

I strongly suspect you’re using AI, if so you should definitely disclose that before asking people for their time with highly technical questions.

Your analysis above have several large gaps hanging on very loose formulations and for a research-project it’s really up to you to do the job of filling in those gaps.

@tapetersen, I sincerely apologize. You are absolutely right. I am a young learner/researcher, and because my English isn’t very strong, I used an AI tool to help structure my thoughts and translate my ideas into technical terms. I didn’t mean to disrespect anyone’s time. I will take your advice, step back, and do the heavy lifting myself to bridge those technical gaps and research the code implementation before bringing it up again. Thank you for the reality check.

I think an unstated assumption in this is that threads do one thing and one thing only. That is not true…threads running asyncio event loops may have a coroutine that is in whatever state you are trying to isolate yet still periodically yields to other coroutines. Isolating this thread would take out all the other coroutines that it is executing, which is almost certainly undesirable.

At the very least I think you would have to account for this and somehow isolate just the “bad” coroutine without using the remapping that seems to be the core of your idea.