I have a frustrating problem. I’m writing an application that downloads files, unzips them and extracts a single file from the archive. I want to use asyncio to manage the downloading. See below for the background on why I’m using asyncio, but for now let’s take asyncio as given.
Because I’m ignoring 99% of the downloaded zipfile, I’d rather not download what I don’t need. Pip includes a rather nice “lazy zipfile” implementation that handles this using a filehandle that’s backed by a temporary file where chunks of data are downloaded on demand using range requests.
I’d like to use that if I could. It’s easy to implement the downloads as async code, but the problem is that the downloads are triggered by calls to the file object’s read()
method, which is not async. And I don’t really have control over that - the stdlib zipfile module isn’t async, so it expects to make sync calls.
So the control flow is:
My code (async) → zipfile.read(name) (sync) → download (async)
What I’d like to be able to do is something like:
async def getfile(url):
with LazyZipfile(url) as z:
await somehow_tell_sync_code_its_being_awaited(z.read(name))
class LazyHandle:
def __init__(self, url):
self.url = url
self.fh = make_tempfile()
def read(self, n):
start, end = self.calculate_range(n)
somehow_await_in_sync_code(self.download(start, end))
return self.fh.read(n)
async def download(self, start, end):
# Make a range request to self.url
class LazyZipfile:
def __init__(self, url):
self.lazyhandle = LazyHandle(url)
self.prepare() # Problem - see below, this really wants to be async too
return ZipFile(self.lazyhandle)
But of course this doesn’t work as read()
isn’t async.
Short of rewriting zipfile to add await
before all read calls, and cascading that upwards (anything with an await
now needs an async def
and itself needs to be called with await
) is there anything I can do to make this work? I fear the answer is “no”, and I’m going to have to abandon asyncio for this project
I could defer the sync code to a thread somehow, but I’m actively trying to avoid threads here, because if I’m going to use them, I might as well just do the whole thing with threading. Also, it’s not at all obvious to me how I’d defer just the middle bit of that async → sync → async “sandwich” to a thread.
The only other thing I can think of is to wrap read in asyncio.run_until_complete
, but that defeats the object, as then (I assume) all other downloads will be blocked until the inner event loop completes.
Going back to why I’m using asyncio, there are fundamentally two reasons. First, to work out when asyncio is a good fit for a problem. My feeling is that this experience says “as long as you’re willing to rewrite any part of the stack that is currently sync, but might need to call back to async code”, which is a rather big blocker for anything that might want to treat files and web requests similarly. My second reason is that interrupt handling is so much cleaner in async code - handling Ctrl-C in threaded code is an utter pain in my experience, and not being able to stop massive download jobs is hardly ideal. If it wasn’t for the inability to cancel threads when I get a Ctrl-C, I’d probably have just used threads for this project and never looked at asyncio (apart from the “I wonder what all the fuss is about” learning exercise).
I don’t think any of this is particularly new - I’ve read lots of articles that talk about “async all the way down” - but I think I’d hoped it would be less of a problem in practice, and I’m finding that might not be the case, unfortunately.
As something of a side issue, I also discovered that __init__
can’t be declared as async. Given that my prepare
method above does some downloading, what’s the correct approach to writing classes that want to call async code during initialisation? I really don’t like the pattern of creating incomplete classes that need a finish_init
call before they are usable. All I can think of is to use a factory function like the following, but that just hides the problem rather than solving it…
class Example:
def __init__(self):
# do sync init
async def async_init(self):
# do async init
async def make_example():
e = Example()
await e.async_init()
# Now, always use this:
my_obj = await make_example()