Iterating over huge mbox file not working as I expected

smontanaro · May 6, 2024, 2:08pm

I asked Google to dump a couple labels from my Gmail messages using their Takeout service. I had it drop the files into my Google Drive and just mounted it locally. Despite me thinking the result would be in multiple mboxes of no more than 4gb, I got one of 98gb and one of 40gb, one for each label. “Okay, let’s see what happens anyway.” I eventually want to filter messages out which match certain criteria (who the message was sent to or from, date range, etc). For now though, I would be content to just successfully iterate over them. I tried this:

>>> from mailbox import mbox
>>> mailbox = mbox("bikes-004.mbox", create=False)
>>> it = mailbox.iteritems()
>>> msg = next(it)

That last statement took approximately forever. I killed it after a couple minutes. I think it was actually trying to ingest the entire file instead of just identifying the next message and handing that back to me. I checked the file and it does indeed have the canonical From_ initial line preceeding each message, which I associate with mbox files.

It seems the code in the mbox class descends eventually to _generate_toc, which does, indeed, slurp in the entire file to generate a table of contents. Are there other options for reading a Unix mbox file more incrementally?

The environment is a Mac running 3.12.3.

cameron · May 6, 2024, 10:53pm

Your best best is probably to scan the mbox yourself into a str
containg a single message, then make a message object using
email.parser.Parser().parsestr(msgtext). Mbox files are easy to split
up on the From_ lines. See: email.parser: Parsing email messages — Python 3.12.3 documentation

Notes:

you might want the BytesParser class if you get encoding issues with
Parser
you might want to pass the headersonly=True parameter if you don’t
care about the message body

smontanaro · May 6, 2024, 11:50pm

Thanks. I realized pretty quickly that creating my own reader made sense and wouldn’t be difficult (< 10 lines). No comment on the current iterator implementation.

blhsing · May 7, 2024, 6:22am

The mailbox module can use a major rewrite to use as much lazy iteration as applicable indeed.