I asked Google to dump a couple labels from my Gmail messages using their Takeout service. I had it drop the files into my Google Drive and just mounted it locally. Despite me thinking the result would be in multiple mboxes of no more than 4gb, I got one of 98gb and one of 40gb, one for each label. “Okay, let’s see what happens anyway.” I eventually want to filter messages out which match certain criteria (who the message was sent to or from, date range, etc). For now though, I would be content to just successfully iterate over them. I tried this:
>>> from mailbox import mbox
>>> mailbox = mbox("bikes-004.mbox", create=False)
>>> it = mailbox.iteritems()
>>> msg = next(it)
That last statement took approximately forever. I killed it after a couple minutes. I think it was actually trying to ingest the entire file instead of just identifying the next message and handing that back to me. I checked the file and it does indeed have the canonical From_
initial line preceeding each message, which I associate with mbox files.
It seems the code in the mbox
class descends eventually to _generate_toc
, which does, indeed, slurp in the entire file to generate a table of contents. Are there other options for reading a Unix mbox file more incrementally?
The environment is a Mac running 3.12.3.