I’m trying to build an application that scrapes webpages and loads the resulting data into a sqlite database. The synchronous version of it is basically as follows:
for page in get_webpage_data():
insert_into_db(connection)
There’s basically no point in making the database insert asynchronous as sqlite doesn’t support multiple readers. And putting the database work into a thread has problems as well (tidying up the thread on errors, interrupt handling) which are basically just unnecessary work and trouble. (And in case anyone suggests it, aiosqlite puts the database on a thread, and the threading problems I hit from this is why I’m looking for alternatives…)
But getting the webpages is absolutely the sort of thing I’d want to use asyncio for - an async version of that code runs far faster than a synchronous version.
What I’d ideally like to do is to have a way of getting the asyncio loop to “run long enough to return one value” then pause while I process that value, then continue, etc. That way, I could write the internals ofget_webpage_data()
asynchronously, without needing to change the rest of my code. Yes, this would mean my web scraping can’t happen while my database updates are running, but I’m willing to pay that price (at least initially - if the price turns out to be too high, I can still switch to running the DB in a thread if I need to).
Is it possible to do something like this? I’ve tried checking the documentation, and couldn’t see anything, but honestly I don’t know if my lack of understanding means I can’t frame the question well enough, or if there’s a fundamental reason why what I want to do isn’t possible.
Can anyone help?