How to extract invoices from PDF file?

c-rob · October 4, 2024, 1:22pm

I have Python 3.11 on Windows 10.

I have a PDF file with about 50 invoices in it. However some invoices are 1 page, some are 2 pages, some are 3 pages, etc. I’d like to extract and write 1 invoice per PDF file regardless of how many pages the invoice has. Basically we are separating each invoice from the big pdf.

I know some python but have never done this before.
I do not have a budget for this, the solution would have to be free.
I do not want the raw data. I would like to split up the PDF file itself. The process must extract the PDF pages themselves and write a new PDF file. I.e. I do not have the ability to extract just the data and write and format a PDF from raw data.

Can this be done with Python or some other tool?

Thank you.

abessman · October 4, 2024, 1:34pm

It should be possible to:

Parse the PDF with pypdf
Detect each page where an invoice starts using some heuristic. This heuristic will depend entirely on what your invoices look like.
Split the PDF at those pages

Step 2 is of course going to be the tricky part. If all the invoices are in the same format, the heuristic could be fairly simple, like looking for a specific keyword. On the other hand, if the invoices look widely different from each other, it may be difficult to automate the detection with anything less than a language model trained specifically for that purpose.

c-rob · October 4, 2024, 4:19pm

Yes all the invoices have the same format with the same info in the same place on the page. They were all produced by the same software. Stuff/data like the invoice number, dates, number of line items, and prices change.