Pulling Data From Multiple Files

Hi everyone, im new to Python and this is ky first post. I work in finance, and there is a task i would love to automate somehow, but I am frustrated trying to figure it out. I’d be grateful if someone could confirm whether it can even be automated. I’d be amazed if someone could provide a script.

So, in my role, i receive remittance advices via email. These are usually either in word or PDF format. I then have to open each file to see which invoice numbers are being paid and their values. I enter these into a spreadsheet so i can reconcile payments received on my bank statement. I also like to rename the file using the customer name and total amount as reference.

Is there a way to automate any of this? Many thanks in advance.

I assume the customer name is also in the remittance advice. And the total amount is coming from where?

In principle you can extract this info automatically, but it depends on what the format of those remittance advises is. Are they free unstructured text? Or a fixed format? Assuming an unstructured format, you can still do this, but it becomes a bit trickier and the tool must be able to detect when it is unsure if it can parse the text.

Subtasks:

  • Scan downloaded emails for attachments
  • Determine attachment format (text, pdf, word doc)
  • Convert attachment to plain text[1]
  • Try to parse the text as remittance advice.

How difficult the parsing is depends on the general/usual format of those remittance advices. If those usually conform to a rigid format, you should be able to do this just with a regex. Otherwise, you need more advanced tokenization methods.
If you have quite a few samples where you already extracted the invoice number etc, you could also train a neural net/classifier on your samples (or use an existing, pretrained LLM model). But I’d first see how far you can get with one (or a few) regexes. Let us know if you need more hints about how to do that.


  1. For pdf-to-text see: PyPDF2 · PyPI. For word-to-text – apparently can be done with pypandoc, but I’ve never used that, so cannot tell more; there are probably also other off-the-shelf tools for this. ↩︎

Btw - It turns out there are some parsers around, especially for medical remittances.
For instance: edi-835-parser · PyPI
And I also found this - which seems a very similar problem + parts of a solution:
119|PYTHON – Remittance Rodeo - Challenges - TechHub.training

Really, the first step in getting better in Python is getting better in googling :slight_smile: