However there is more. The internal structure and order of the text
in a PDF can be random, as we found out when we had Acrobat read the
PDF. Having the computer read the PDF is required by US state
standardized testing for people with disabilities so we (and me
specifically) had to go into PDFs and reorder all the text nodes to be
in the correct order.
The objects (strings, images etc) in a PDF document can be in an
arbitrary order. I believe this is an artifact of authoring systems
where later versions of a document are made by adding replacement
objects (the new or modified pages etc) to the end of the file, and
updating the index to refer to these new objects. (The new index also
starts at the end of the file. Yay.) In this way you can make a new
revision of the document entirely by appending to it.
These days files are mostly rewritten from scratch, so this “append new
data to the end and then write a new index” isn’t something you
encounter in most other formats.
Reading the text in a PDF in order is a matter of:
- reading the index at the end to locate all the objects
- read the page index (it’s one of the objects)
- render each page in numeric order
There’s a heap of mechanical stuff under those broad headings.
For added fun, each page is effectively a little programme which is a
list of instructions to draw objects on the page.
Because of all these indices you can do random access to a PDF (once
you’ve accessed the indices), so you don’t have to render pages in
order.
(The nodes looked like XML to me.)
The object specifications are not XML. You’re probably seing things like
/Contents, but XML and HTML etc are littered with < and >
delimiters.
PDF is an endpoint, and it’s very difficult to extract data from it
accurately, which is why I was asking for ideas. And I don’t have
another choice. I’m stuck with a PDF.
So Fedex can’t sen you billing information in a better structured
format? Alas.
I’d try the libraries you mentioned, particularly the one which you say
is good at extracting tabular data, since it sounds like that’s what
you’ve got.
I have never used any of those libraries, so I don’t know how well they
will work for your purpose.
In principle, I’d expect the Fedex thing to have been written by a
programme and to have a regular order. In theory, if you’ve got
something to parse the PDF structure you might recognise the regular
content. Probably quite a lot of work.
There are several PDF parsing things in PyPI:
I don’t know how good any of them are though. Lots seem for making PDF
instead of decoding it.