One other piece of advice: if you need to scrape a lot of web pages, I would use a semaphore or some other form of limiting concurrency so you’re only fetching a limited number at a time (although that number can still be high). I remember spawning a thousand coroutines to do a similar workload at once, and a bunch failing due to various timeouts.
@pf_moore When I’ve had to do this in the past, I’ve used aiosqlite. The main advantage (for me) is that because everything is async, you don’t have to deal with cross the sync/async boundary from the perspective of application design.
edit: oops I see you’re already aware of it, carry on.
It doesn’t interrupt the sleep in the background thread, so if I increase the sleep to 20 seconds, the process continues running for 20 seconds after I hit Ctrl-C.
For sqlite I may be able to get around this by clever use of connection.interrupt() but the general problem remains. (It’s not an asyncio issue as such, it’s basically a fundamental issue with threads, and why “defer to a thread” isn’t always as useful an approach as you might want.)
Thanks, that’s a good tip, I’ve been getting the odd timeout and didn’t realise why.
The reason control-C manages to interrupt time.sleep is because there’s some platform-specific machinery in CPython’s low-level signal handling code and in time.sleep that manually hooks them together. If you search for _PyOS_SigintEvent in the CPython source you can see the details. It’s a very manual kind of integration. I think if you wanted to make something similar work for the sqlite3 module, you’d have to change both upstream sqlite and CPython’s low-level signal handling code. It’s one of those things where it seems like it ought to be simple, but then you open the box and this ocean of gnarliness spills out.
A regular exception inside a user task is totally fine of course. But KeyboardInterrupt is special and weird, because it can suddenly materialize at any arbitrary bytecode instruction in your program. And I’m pretty sure that a KeyboardInterrupt at the wrong time can in fact corrupt asyncio’s internal data structures. It probably won’t, like, burn down your house or anything, but if asyncio is in the middle of manipulating some internal data structure, then a KeyboardInterrupt in the middle of that will generally leave the structure in an inconsistent state.
That’s what Trio does, but accomplishing this requires deep wizardry and hooks throughout Trio’s internals to detect when a control-C happens at an “unsafe” moment, and delaying the KeyboardInterrupt until it’s safe to handle. asyncio doesn’t have anything similar.
In asyncio IIUC you want to guarantee fully defined behavior on control-C, then the official way is to use loop.add_signal_handler to convert control-C into a regular asyncio event that you can handle however you like. Unfortunately, this isn’t implemented on Windows…
When calling sqlite synchronously in the main thread: you want to make it so that control-C causes sqlite3_interrupt to be called immediately, without waiting for the current operation to finish. How can you wire these together? You can’t use the signal module to register a signal handler, because Python doesn’t run signal handlers while the main thread is blocked in C code…
When calling sqlite in a worker thread from asyncio: as described above, asyncio doesn’t guarantee that the event loop keeps functioning at all after control-C, so there’s not much point in trying to make stronger guarantees about specific operations… Also, just in general, I’m not sure how to define custom cancellation code in asyncio, because when integrating with other systems like threads you have to use Future. And Future's cancellation semantics kind of hard-code that when a Future is cancelled, the work actually keeps going in the background. In particular, Future.cancel immediately resolves the Future as cancelled, so you can’t wait for the work to clean up, and there’s no easy way to detect when Future.cancel has been called so you can issue a sqlite3_interrupt or anything.
Wow, that seems pretty bad to me. You’re basically saying that if I write an asyncio program, and the user hits Ctrl-C, then I may not even get control back to my application in a way that will allow me to sanely exit? In the sense that I do all of my tidying up, whatever that may involve, but that’s still not good enough?
Worse and worse…
I shall read that with interest (and a fair amount of fear and trepidation )
None of this gives me any sense of security when it comes to writing a database-backed application with asyncio. If I can’t get enough control back to cleanly tidy up my database connection, I need to assume any Ctrl-C is going to act as a connection abort. Databases are robust by design, so they can survive this, but it feels like subjecting them to unnecessary levels of abuse…
I’m very carefully trying not to look at this in terms of framework comparisons, but this would be a big selling point to me for trio. Whether it’s enough to counterbalance the “every async library supports asyncio” question, I’m not sure I can judge yet.
Thinking about it, should this be raised as a bug against Python? Something like “asyncio doesn’t protect its internal structures against the end user hitting Ctrl-C while the event loop is running”.
I’m not particularly comfortable raising a bug where I can’t offer a test case to demonstrate the problem, and I can’t clearly express what I want to see happen, beyond "please write your code to protect against KeyboardInterrupt". But conversely, I don’t like the idea that this isn’t recorded anywhere as being an issue.