I am using the pypdf package to exract text from PDF files obtained from an external source. The examples given on using pypdf have provided me the necessary framework for exraction however there are some subtlies in the PDF that are not reflected in the extracted text. Leads me to ask the question where is the best place to discuss how to deal with these subtlies and provide a better extraction?
What is “subtlies”?
It’s a typo for things that are subtle.
@glimfeather - logically I would usually go look at the source repo and from there you’ll see there is a section Q&A which points to discussions here:
So seems worth giving that a try?
Have you tried using another library? Does the problem persist?
Out of curiosity I tried using a program that relies on a Java equivalent (Apache’s Tika) to extract text from PDF files and it gives a much better approximation to the underlying text. The text can be recovered in the way that I want but for workflow reasons I would prefer to use python (and pypdf seems to be the only actively developed and undeprecated package).
And in answer to your other question a typo for subtles; I blame my dylsexia for that one.
Thanks for that pointer.
However, embedded deep in those github discussions is a pointer to using stackoverflow to obtain support. I so love indirect addressing.
There isn’t a lot you can do about this. PDF is not designed to store structured data in a way that other programs can used. It’s designed to make things that look nice when printed. As far as the PDF file is concerned, text characters are just another kind of graphical element.
As I cannot see
pypdf in Python Module Index, I would discuss it on its GitHub and if not available, than here or in any other Python general discussion like FB groups.