Cancelling threads

I don’t want this to sound confrontational, but what type of solution would you support, which addressed the use case of wanting to terminate all outstanding threads in order to shut down the program? If your only answer is “use the OS to forcibly terminate the process”, then that isn’t sufficient for at least some of the use cases I’ve had in the past (to give a concrete example, terminating because the main thread detected some condition, and wanting to write a brief message explaining why the termination occurred before shutting down).

2 Likes

Daemon threads already do what you want here if the goal is to have all outstanding work be abandoned on main thread exit, so I’m not entirely sure why we need something that forcibly interrupts something which may not be suitable for that to happen in.

I’ve stated multiple times already that I’m fine with new things being interruptible such that old code’s assumptions aren’t violated with it.

I’m also fine with new abstractions or new keyword args on existing abstractions that wrap “dispose of this without caring about it’s status on exit”

Anything that would break an explicit indication that something isn’t interruptible (such as the idea of both having a way to protect a critical section, but then choosing to ignore that protection some of the time) is a complete non-starter.

1 Like

Well, I thought I was done, but I had a thought about a relatviely safe way to do this today that shouldn’t need any new support from the language under the assumption that the person using the wrapper knows it is safe to abandon the work. This would be usable even in cases where it isn’t actually safe, but because of the use pattern would be clearly delineated as to the intended safety, use case, who is responsible for errors caused by the use of it, and that it’s only for abandoning work at shutdown, not arbitrary cancellation of threads.

Specifically abandon, not interrupt. The thread once abandoned will park itself efficiently waiting on something that won’t ever happen.

api includes:

  • a context manager that patches threading.Thread, making all threads created within the context daemon threads.
  • a method on the context manager to stop all patching after that point (irreversible by public api)
  • patching of threading.Thread to wrap the given target function with:
    • a threading rlock per thread, accesible as a contextvar, usable from the thread to still be able to mark critical sections
    • temporary replacement of sys.excepthook
    • Upon recieving an unhandled exception of this type, park on waiting for a newly created threading event not shared back to the main thread, completely halting any work in the thread and fully abandoning it rather than doing more work that has been marked as uncared about.
  • exiting the context manager is what triggers abandoning any not-done threads. It is not manually available in a public API (but isn’t so hidden that people couldn’t use it with big warning signs) by use of PyThreadState_SetAsyncExc to throw a new exception that doesn’t inherit from RuntimeError or SystemExit

Seems like it should apply equally in any kind of critical section, because they all have the same issue of a thread becoming uninterruptible otherwise.

The purpose of having critical sections isn’t to make things
uninterruptible. There will be other facilities for doing that if you want.

Critical sections exist to make it possible to ensure that when a
thread is interrupted, it always cleans up after itself properly.

The key word here is “possible”. Critical sections won’t automatically
guarantee proper cleanup in all situations; sometimes you will have to
write a bit of code to make it happen.

My conjecture is that you already have to write that code to deal with
non-interrupt failure conditions, so allowing the I/O to be interrupted
doesn’t make anything any worse. If it makes your code buggy, then it
was buggy to begin with.

To give a concrete example of what I mean, consider this context manager:

class OpenConnection:

 def __enter__(self):
     self.f = open_a_network_connection(interesting_address)
     return self.f

 def __exit__(self):
     self.f.close()

Here we don’t have to do anything special if open_a_network_connection
raises an exception, we can just let it propagate. But suppose we’ve
done something else before opening the connection:

class OpenConnection:

 def __enter__(self):
     important_lock.acquire()
     self.f = open_a_network_connection(interesting_address)
     return self.f

 def __exit__(self):
     self.f.close()
     important_lock.release()

Now we have a bug, because if open_a_network_connection fails, the lock
never gets released! So we need to do something like

 def __enter__(self):
     important_lock.acquire()
     try:
         self.f = open_a_network_connection(interesting_address)
     except BaseException:
         important_lock.release()
         raise
     return self.f

Here’s the kicker: we’ve also made it safe in the face of the I/O
getting interrupted, without even trying. Because it doesn’t matter if
the exception raised is ConnectionRefused or ThreadCancellation, things
get cleaned up either way.

It’s a bit annoying that we have to write the same cleanup code twice,
once in the except clause and then again in the exit method. I have
some ideas on how to fix that, but I’ll leave them for another post.

This won’t handle the case of interrupting a thread pool executor. This is because exiting the executor’s context manager joins all worker threads - and the join will hang because the threads never terminate.

1 Like

The difference here is that the interrupt can mask either success or failure, and there’s no reliable way to handle this. Interrupts raised external to the designed control flow, even if python does this via synthesized exceptions, is inherently different from exceptions raised from the source of a concrete case within the control of the code in question.

Take for instance IO writing a system file (After all, python is used deep inside multiple distribution’s package management, as well as orchestration tooling like ansible).

There is a very clear difference between handling success or specific failure, and having the result be interrupted and not knowing the state to do cleanup from

I’ve poked a few other holes in this since then, to the point where it probably needs an ast transformation, which is much more invasive, but yeah, won’t help with that case either.

In terms of things that are more achievable today, Maybe some use cases would benefit from a daemon executor that, as the name suggests, creates a pool of daemon threads for the work and doesn’t join on exit?

The other issue is that daemon threads (and especially in conjunction an executor that doesn’t wait for the thread pool to be idle before exiting) will abort work on program exit, so the developer has to manually do all the work to ensure a clean exit in the absence of cancellation. So daemon thread based executors will be harder for the average user to use correctly.

IMO, what’s really needed here is for people to let go of the idea that threads are an expert-only feature. Particularly now that free threading is becoming mainstream, “Python supports threads, but you’re not clever enough to use them” isn’t really an acceptable position to take.

People write single threaded code all the time that isn’t robust. Keyboard interrupts can leave partially written files and no-one thinks that’s a huge issue. Not everyone carefully protects the integrity of their data structures against every possible unexpected exception. And it’s fine - the code is being used in situations where this level of reliability is sufficient and acceptable. We shouldn’t be demanding that people with code like that can’t add a simple thread pool executor to run their network fetches in parallel (or to fire off a bunch of file copies or data processing jobs). We shouldn’t insist that they add reliability that they don’t need just to get extra speed that they do need. Nor should we be telling them that threads “aren’t the right tool for the job” - when the thread pool executor documentation uses exactly these sorts of examples.

Obviously any solution must allow people who do need high levels of reliability to get it. But the people who need that are the ones who should be assumed to be experts, and who are capable of writing extra code to get the reliability they need. But let’s remember that right now, even single-threaded try...except and context managers aren’t safe - there are sections where a keyboard interrupt can cause problems. Experts have to write code to deal with that, too.

Your point on backward compatibility is important. Having KeyboardInterrupt affect threads other than the main one is a breaking change[1]. But having a new cancel method on a thread, which raises a new exception in the thread, is allowed by the backward compatibility policy. It still has an impact, which we shouldn’t ignore, but it’s not technically backward incompatible. But that’s not what matters. We shouldn’t be quoting rules here. What matters is benefit vs harm. And people have different opinions there, which is why it’s ultimately the SC that decides.


  1. Although I’ll note that breaking changes aren’t absolutely prohibited, they just have to have an extremely strong justification, and a deprecation period. ↩︎

7 Likes

To be clear, if this new daemon executor were to be added, I would hope it was documented to only use this executor in cases where the work is truly abandonable/ephemeral/etc.

What worries me about a lot of the cancellation proposals is that people seem to want a one-size fits all solution, and I don’t think such a solution by very nature of competing needs can exist.

With that in mind, it seems to me like new means makes more sense than trying to make the existing means cover incompatible needs.

I really do view adding cancel as breaking. I know it’s allowed by the current policy, but there are plenty of things we know are allowed by the current policy that we call breaking anyhow because of the context.

Just like the collections abcs, threading.Thread is explicitly documented for user subclassing. If users have already implemented their own cancel methods with different semantics, this is probably something to consider as breaking for the same reason the collections abcs has the giant block comment, adding things is breaking when considering public methods on intended to be subclassed types.

Beyond the semantic argument about breaking though, it’s adding a capability that people writing concurrent code can generally assume currently won’t ever exist (It does via the C-API, but that invokes the same level of experts arguments), and telling them it’s now on them to either handle that or tell their users that even though cancel exists, it should never be called because the work in the thread assumes it won’t be interrupted abruptly. Is that really a positive direction that makes it easier for users to handle cancellation for threads, or would it be better to actually document the things that work today and add new abstractions that assist with cases that are simpler, but still would benefit from a high level abstraction? I think I made my own view on that balance clear by now.

Alright, after a first round of discussion, I wanted to give my opinionated opinions about the main points of discussion that have been floating around, especially w.r.t. my earlier idea about protected scopes.

I wanted to be thorough in addressing the various concerns raised here, so please bear with this very long wall of text.

Automatically interrupt all running threads on SIGINT by default

While this may be a useful change in a number of scenarios, it is also backwards incompatible.

Suppose for the sake of argument that version 3.15 implements this change, then all Python programs that were using threads in Python <= 3.14 are required to consider that KeyboardInterrupt can now show up in all threads, not just the main one. So a refactoring needs to be performed before upgrading to 3.15. Whereas, having a Thread.interrupt() method that isn’t used implicitly by Python by default, allows individual codebases to designate their own paths towards using interrupts in threads, should they desire to do so.

Don’t have uninterruptible scopes as a language feature, instead have a defer_interrupts context manager

Something along these lines:

with threading.defer_interrupts():
    dont_interrupt_me()
maybe_interrupt_here()

First off, note that this context manager must be implemented in C, without the additional language support proposed above. That is, to implement it, it must be that the code in the __enter__ and __exit__ methods themselves must not be interrupted, otherwise it would fail to provide the feature.

Secondly, however it’s implemented, the VM must be informed about the beginning and end of those deferred interruptions scopes, so the VM needs a way to keep track of them. This, I think would end up recreating the previously described stack of protected scopes, or something else which still needs to be stored in PyThreadState. Therefore, providing a language feature for this use-case would be no more costly (implementation-wise) than providing a special decorator.

Furthermore, a language-level feature would allow to apply the protection to all existing context managers already in use. I think that eventually, assuming the change is carried out, the inverse problem would apply: “Did you remember to wrap that context manager with threading.defer_interrupts?”

Instead, it should be fairly straight-forward to implement this context manager in Python with the additional language support, so that the protection can be extended outside the __enter__, __exit__, and finally scopes:

@contextlib.contextmanager
def defer_interrupts():
    try:
        pass
    finally:
        yield

Add a Thread.interrupt method, but don’t add interrupt-protected scopes

I think it would be hard for this change to get accepted, because:

  • this behavior in other languages has been already painfully deprecated;
  • the “naive misuse” alluded to in the docs of PyThreadState_SetAsyncExc is exactly the case in which necessary cleanup code never gets executed because an exception (interrupt) gets raised out-of-thin-air at just the wrong time.

In essence, this would likely become a footgun-code printing machine, that programmers must be warned against misusing. The addition of protected scopes does not magically make all programs safe, but it provides a language-level mechanism to guarantee the execution of necessary clean up code, should there be any.

Instead of adding Thread.interrupt, add an interruptible thread subclass

I agree with what @gcewing has already said.

While this can be very useful to distinguish which code was written with interruptions in mind or not, I think it would also become problematic in a future where libraries always have to consider whether they’re being passed a Thread or a special InterruptibleThread, and have to avoid using interrupts in the first case.

Furthermore, libraries may get passed callables directly, in which case it’s up to the library to decide whether to dispatch the callable to a Thread, or an InterruptibleThread. And the library would essentially be forced to explicitly state in its documentation which one will be used. That would be no different than having an interrupt method in Thread and having the library explicitly state whether or not it will be used.

OTOH, if a codebase manages threads directly without third-party libraries, then it is up to the codebase’s own chosen standards whether to allow calling .interrupt() or not. Possibly, an existing codebase would never allow using that method if the cost of refactoring was too high. Regardless, the problem of upholding codebase-wide standards wouldn’t change if the codebase had to both use InterruptibleThread and then call .interrupt(), or just call .interrupt() on a regular Thread.

A keyword argument passed to the Thread constructor would be no different than having an InterruptibleThread class, in this sense.

User-defined Thread subclasses may already have an interrupt method

Then, those subclasses need not be changed. The existing interrupt method would get called instead of Thread.interrupt.

If the existing subclasses wanted to additionally use the new Thread.interrupt method they would only need to add a call to super, using the existing idioms.

This holds regardless of the name of the added method, although Thread.interrupt has come up repeatedly by multiple people, so I’ll take that as a sign that it is a good name.

Only have interruptions as a package on PyPI

Without language-level protections implemented with cooperation from the VM, this would be no better than what already outlined above in Add a Thread.interrupt method, but don’t add interrupt-protected scopes.

Don’t defer interruptions at all, prioritize stopping the thread

Interruptions need not necessarily correlate with program shutdown. They may also be used as a way to handle timeouts, or other cancellations, possibly in the context of structured concurrency. In such context, it may be important to a program that a thread stops for the sake of efficiency, when e.g. an operation is no longer necessary.

It may very well make sense for a thread to get interrupted, and decide to carry on doing something else. Thus, it is important that the thread maintains a consistent state, possibly by making use of the newly defined protected scopes.

ExitStack

This may already be the case if ExitStack was being used by the main thread and a KeyboardInterrupt was raised at just the wrong time inside enter_context(). With the proposed defer_interrupts() context manager it would be possible to fix the existing issue:

    def enter_context(self, cm):
        """Enters the supplied context manager.

        If successful, also pushes its __exit__ method as a callback and
        returns the result of the __enter__ method.
        """
        # We look up the special methods on the type to match the with
        # statement.
        cls = type(cm)
        try:
            _enter = cls.__enter__
            _exit = cls.__exit__
        except AttributeError:
            raise TypeError(f"'{cls.__module__}.{cls.__qualname__}' object does "
                            f"not support the context manager protocol") from None
        with defer_interrupts():
            result = _enter(cm)
            self._push_cm_exit(cm, _exit)
        return result

As a general rule, I’d keep these stdlib usages outside the scope of a possible draft PEP, instead keeping the PEP about the language change itself and leaving further uses of the change to specific issues on the cpython repo.

Have the context manager constructors also be part of the protected scope

This can be very useful to allow open to also enjoy this protection. In this snippet, the file is opened before the __enter__() call:

with open("spam.txt") as f:
    f.read()

I’d have to start writing a reference implementation to understand how hard it is to do, but I’m in favor.

Thread.interrupt should interrupt pending I/O, in addition to raising an exception

I also agree with this one. I think it should also be the current behavior, I’ll check this. If it is, then there’s little to be done in way of change here. Otherwise, I’ll weigh in the implementation cost.

with and finally aren’t special enough to deserve special treatment

That may be so.

Let’s suppose, though, that we did want to implement thread cancellation, and also had the problem of cleanup handlers at heart. What do others do to implement this feature?

Let’s take POSIX which has a pthread_cancel API. It is noted that:

When a cancelation request is acted on, the following steps occur for thread (in this order):

  1. Cancelation clean-up handlers are popped (in the reverse of the order in which they were pushed) and called. (See pthread_cleanup_push(3).)
  2. Thread-specific data destructors are called, in an unspecified order. (See pthread_key_create(3).)
  3. The thread is terminated. (See pthread_exit(3).)

So to implement something similar we need a stack of clean-up handlers, to be pushed by the thread as it executes its code, and possibly popped by the OS if the thread gets cancelled. Or, popped by the thread if it doesn’t get cancelled.

This strongly resembles the with statement.

Maybe it’s not special enough to deserve special treatment, but it is no more and no less than what is needed to cleanup a thread that’s being cancelled.

Regardless of the fact that with is exactly what’s needed to implement safe thread interruptions, it already is de facto the recommended best practice to handle resource cleanup. So providing extra protections to this statement may very likely improve the safety of already existing code, at the cost of slightly delayed KeyboardInterrupt exceptions.

What precisely are the semantics being proposed for thread cancellation?

The concern was raised here.

I would propose thread interruption, rather than cancellation, to be the ability granted to any thread to call a new Thread.interrupt method, which acts in a similar way to the current semantics for KeyboardInterrupt with additional guardrails for protected scopes.

Specifically, a call to Thread.interrupt would change the thread’s state to track that an exception (subclass of BaseInterrupt) is to be raised at some point. (Keeping it intentionally vague as it is now for KeyboardInterrupt.) Additionally, all subclasses of BaseInterrupt (including KeyboardInterrupt) may not be raised while a protected scope (__enter__, __exit__, and finally) is active. Thus, the interrupt sent to a thread will be deferred until the thread leaves all protected scopes.

Unsurprisingly, this doesn’t magically guarantee that the program is bug-free or infinite-loop-free. It is the job of the programmer to make sure that the interruption is received without bugs and as promptly as desired.

I heavily agree with Paul’s points here. There’s been a few situations where I wanted to cancel threads and every time I got really frustrated with the currently existing workarounds. Yes, there are deep technical difficulties in specific situations, but at the very least in all of my use cases those would never have come up. Of course, a solution that cleanly handles resource management would be perfect. But if that’s not doable, I’d rather have something that can be unsafe in the wrong circumstances than nothing at all.

Most execution time is spent in the actual working code, but for these issues to even come up the cancelling exception needs to be raised in a cleanup block. So even with no guardrails, things should work out fine the vast majority of the time. And in the cases where it doesn’t, something like a context manager that delays the raising of the exception until it’s over will be enough. Having to add safety code manually isn’t a great design principle, but if it’s not needed all that often, that should be fine.

5 Likes

If it’s done properly, it shouldn’t mask anything. A read or write
system call only returns EINTR if it was interrupted before any data was
transferred. Otherwise, it returns normally and reports the number of
bytes transferred. The state of the system is known in either case.

And if the caller needs to distinguish between an interrupt and a failed
I/O operation, they can always inspect the exception raised.

I had totally missed that this existed:

Seems worthwhile to pick it up again.

2 Likes

I really think the concerns about new interrupts breaking uninterruptible code are misguided. There are loads of signals that can kill a process, some of which can’t be stopped (those wonderful antiviruses firing sigkills around at random) and of those that can be stopped, some can only be stopped via ctypes (Windows’s CTRL_CLOSE_EVENT when the terminal is closed or CTRL_SHUTDOWN_EVENT when the machine is shutdown) and have very short timeouts before the OS jumps in and force kills the process anyway.

If you’re adding a sigint handler to protect critical code then your code is already broken. Robust, uninterruptible code should focus on recovery, using some form of set restore point, write, commit, clear restore point process like sqlite which can withstand anything from sigkill to power failure without corrupting itself. The most that should be expected from a SIGINT handler is an opportunistic, more graceful abort like getting to a next checkpoint or flushing the last few seconds of user’s work into the recovery file.

2 Likes

I don’t entirely disagree, but the people who have to care about this also tend to run such code on systems where they actually get to control the timeout. Things that receive a signal to shutdown, should shutdown, but there is a massive difference in tossing a new interrupt into the works vs proper scheduling with checking for signals when writing code like this. Proper scheduling allows for significantly more reliable handling of that shutdown, and also code that is significantly easier to reason about and doesnt break when a signal is sent twice for whatever reason.

1 Like

You missed the point here. If you can interrupt this with a new synthesized cancellation exception, you aren’t getting the exception that would inform you of this or the number of bytes, you’re getting the interrupt/cancellation exception. You’ve interrupted the place where the program would get that with something else.

If some bytes have been read or written, you don’t get an exception at
all, you just get a normal return that indicates the number of bytes.
The interrupt exception remains pending until the critical section has
finished.

Some finesse may be required to handle the case where the critical
section goes on to make another blocking I/O call. What probably needs
to happen is that if you’re about to make a blocking call with an
interrupt exception pending, the exception gets raised at that point, as
if the call had returned EINTR.

I’m wondering whether interrupt exceptions should be disabled when an interrupt occurs until explicitly re-enabled. It’s not really helpful to get a second interrupt when you’re busy cleaning up after the first one.