How to extract data from a PDF I didn't make?

c-rob · April 23, 2024, 10:27am

I’m using Python 3.12 on Windows 10 Pro. I’m still somewhat new to Python. But I’ve been practicing writing different programs which do different things.
I didn’t make the PDF, it’s an 800 page bill from a vendor (Fedex). I don’t know what’s in it but it’s fairly tabular in structure.
I can’t post an example because it has private data in it.
I will start by extracting data from the first 3 pages. That will be my test case.

I’ve found these options that I will be researching:

Camelot: This library excels at extracting tabular data from PDFs. It identifies tables and extracts them into a structured format like a Pandas DataFrame.
PDFQuery: This library allows you to extract data using CSS-like selectors to target specific elements within the PDF’s structure. It’s useful for PDFs with a consistent layout.

But does anyone have any experience doing this? Do you have any tips or caveats?

Thank you! You’ve all been very helpful in my learning experience.

elis.byberi · April 23, 2024, 11:49am

The PDF format is fairly complex. You can refer to the PDF Extract API How to to get a sense of its structure.

I haven’t extensively used any PDF Python library since I usually refer directly to specifications and implement algorithms of interest.

kknechtel · April 24, 2024, 12:08am

PDF is completely and utterly not designed for this purpose. The most important piece of advice is to prepare to use very complex libraries and often get wrong or inaccurate results anyway. It’s basically the same sort of task as OCR, or attempts to solve CAPTCHAs automatically. In theory the internal structure of the document should make it a little easier than those other tasks, but this is often a false hope. Two characters that visually appear next to each other in the PDF could well have been represented as separate strings, translated into place. Or something that looks like it should be a Unicode superscript character might be an ordinary letter shrunk and shifted into position. Or any given supposed character might actually be a bitmap image. Or the document could be embedding its own pseudo-font using vector graphics primitives. Etc.

cameron · April 24, 2024, 1:51am

Horrendous though the PDF spec is, there is structure there in most
documents. I imagine the libraries Chuck cited may work quite well for
documents not deliberately obfuscated.

Text is normally text (so no OCR needed) unless the document is just a
scan of some piece of paper.

Recognising tables might require some “are these things nicely aligned”
logic, but the text in the cells will probably be string objects.

So I remain hopeful for Chuck, parsing a special purpose document like
his Fedex example.

c-rob · April 24, 2024, 9:43am

PDF is completely and utterly not designed for this purpose.

Karl is correct. I’ve worked with extracting text (from PDF books) for about 4 years. The books I worked with are a bunch of text paragraphs with no embedded images or tables, and I would consider that the easiest case of extracting data from a PDF. Sometimes the PDF had scanned images of each page (common for Google books and archive.org and for older books in general), which mean OCR had to be used on a page. Sometimes the PDF was created in another method, like from a word processor, and the text is in there, in this case the extraction is a bit more accurate.

However there is more. The internal structure and order of the text in a PDF can be random, as we found out when we had Acrobat read the PDF. Having the computer read the PDF is required by US state standardized testing for people with disabilities so we (and me specifically) had to go into PDFs and reorder all the text nodes to be in the correct order. (The nodes looked like XML to me.) It was very time-consuming and expensive, and I think some states just stopped paying for it.

PDF is an endpoint, and it’s very difficult to extract data from it accurately, which is why I was asking for ideas. And I don’t have another choice. I’m stuck with a PDF.

cameron · April 24, 2024, 11:17am

However there is more. The internal structure and order of the text
in a PDF can be random, as we found out when we had Acrobat read the
PDF. Having the computer read the PDF is required by US state
standardized testing for people with disabilities so we (and me
specifically) had to go into PDFs and reorder all the text nodes to be
in the correct order.

The objects (strings, images etc) in a PDF document can be in an
arbitrary order. I believe this is an artifact of authoring systems
where later versions of a document are made by adding replacement
objects (the new or modified pages etc) to the end of the file, and
updating the index to refer to these new objects. (The new index also
starts at the end of the file. Yay.) In this way you can make a new
revision of the document entirely by appending to it.

These days files are mostly rewritten from scratch, so this “append new
data to the end and then write a new index” isn’t something you
encounter in most other formats.

Reading the text in a PDF in order is a matter of:

reading the index at the end to locate all the objects
read the page index (it’s one of the objects)
render each page in numeric order
There’s a heap of mechanical stuff under those broad headings.

For added fun, each page is effectively a little programme which is a
list of instructions to draw objects on the page.

Because of all these indices you can do random access to a PDF (once
you’ve accessed the indices), so you don’t have to render pages in
order.

(The nodes looked like XML to me.)

The object specifications are not XML. You’re probably seing things like
/Contents, but XML and HTML etc are littered with < and >
delimiters.

PDF is an endpoint, and it’s very difficult to extract data from it
accurately, which is why I was asking for ideas. And I don’t have
another choice. I’m stuck with a PDF.

So Fedex can’t sen you billing information in a better structured
format? Alas.

I’d try the libraries you mentioned, particularly the one which you say
is good at extracting tabular data, since it sounds like that’s what
you’ve got.

I have never used any of those libraries, so I don’t know how well they
will work for your purpose.

In principle, I’d expect the Fedex thing to have been written by a
programme and to have a regular order. In theory, if you’ve got
something to parse the PDF structure you might recognise the regular
content. Probably quite a lot of work.

There are several PDF parsing things in PyPI:

I don’t know how good any of them are though. Lots seem for making PDF
instead of decoding it.

c-rob · April 24, 2024, 11:45am

Not right now they can’t. Their PDF bill does not match the Excel file we can pull from them even if we use the same date range for both. That’s our problem. Someone else at my employer is working on getting a more accurate spreadsheet from Fedex that will actually have the same charges as their PDF invoice.

There are reasons why we have to verify every charge from Fedex which I cannot get into. It’s a big mess. We have verified we are doing all processes correctly on our end. The problem is on the Fedex side.

cameron · April 24, 2024, 10:22pm

Just wondering. Are there charged on the PDF which never show in any
Excel file? Or is it more an alignment problem (getting an Excel with
the exact list as in the PDF)?

I don’t suppose things are as simple as identifying charges with
particular package numbers? Sounds too easy.

c-rob · April 29, 2024, 8:41am

I’m writing this software to help another person do their job. So I’m not actually checking the ship charges myself. I may have gotten some details wrong in my earlier messages.

3 years ago we could download a spreadsheet from the Fedex website with all the information we wanted. Then they changed their software on the website and we could not get any Excel file at all. Then they updated their software and it had all the info we needed but the tracking number. The tracking number is the link between their spreadsheet and out database so this spreadsheet at this point was useless.

So we had to check shipping charges using another method, which was the PDF bill that is human-readable.

Now I get an update from my coworker and our Fedex customer rep may be pushing to update their software so we can get the tracking number and all the other data that we need into the spreadsheet.

It’s a real mess at Fedex and has been for 5+ years.

So using the PDF bill may be a moot point here.

FelixLeg · April 29, 2024, 9:25am

Could this make your life easier: bash - How to find and replace text in a existing PDF file with PDFTK (or other command line application) - Stack Overflow

c-rob · April 29, 2024, 12:59pm

I don’t think so. I’ve used PDFTK before and it seems this is just a CLI version of the Acrobat find and replace feature. Thank you though.