Where to discuss using particular packages to best advantage?

glimfeather · January 4, 2024, 12:52pm

I am using the pypdf package to exract text from PDF files obtained from an external source. The examples given on using pypdf have provided me the necessary framework for exraction however there are some subtlies in the PDF that are not reflected in the extracted text. Leads me to ask the question where is the best place to discuss how to deal with these subtlies and provide a better extraction?

elis.byberi · January 4, 2024, 1:29pm

What is “subtlies”?

nmstoker · January 4, 2024, 1:31pm

It’s a typo for things that are subtle.

@glimfeather - logically I would usually go look at the source repo and from there you’ll see there is a section Q&A which points to discussions here:

So seems worth giving that a try?

elis.byberi · January 4, 2024, 1:38pm

Have you tried using another library? Does the problem persist?

glimfeather · January 4, 2024, 2:48pm

Out of curiosity I tried using a program that relies on a Java equivalent (Apache’s Tika) to extract text from PDF files and it gives a much better approximation to the underlying text. The text can be recovered in the way that I want but for workflow reasons I would prefer to use python (and pypdf seems to be the only actively developed and undeprecated package).

And in answer to your other question a typo for subtles; I blame my dylsexia for that one.

glimfeather · January 4, 2024, 2:56pm

Thanks for that pointer.

However, embedded deep in those github discussions is a pointer to using stackoverflow to obtain support. I so love indirect addressing.

kknechtel · January 4, 2024, 10:40pm

There isn’t a lot you can do about this. PDF is not designed to store structured data in a way that other programs can used. It’s designed to make things that look nice when printed. As far as the PDF file is concerned, text characters are just another kind of graphical element.

Juandev · January 5, 2024, 10:11am

As I cannot see pypdf in Python Module Index, I would discuss it on its GitHub and if not available, than here or in any other Python general discussion like FB groups.

Topic		Replies	Views
Pprint styling options? Ideas	7	1034	May 24, 2022
Packaging Vision and Strategy - Next Steps Packaging	34	4834	May 30, 2023
Python packaging documentation feedback and discussion Packaging	113	6194	June 12, 2023
Standard library package lifecycle policy Ideas release	16	2346	June 5, 2019
Seeking a consensus about the purpose and future of `pyproject.toml` Packaging	23	2274	December 5, 2023

Where to discuss using particular packages to best advantage?

Related Topics