Convert PDF into TXT

olivalej · April 12, 2023, 1:46pm

Good day community,

I’m trying to compile some code to convert PDF to text, but the result is not what I expected. I have tried different libraries such as pytesseract, pdfminer, pdftotext, pdf2image, and OpenCV, but all of them extract the text incompletely or with errors. The last two codes that I used are these:

CODIGO 1
import pytesseract
from pdf2image import convert_from_path

Configurar pytesseract

pytesseract.pytesseract.tesseract_cmd = “/usr/bin/tesseract”
pytesseract.pytesseract.tessdata_dir_config = ‘/usr/share/tesseract-ocr/4.00/tessdata’

Ruta del archivo PDF

pdf_path = “/content/drive/MyDrive/PDF/file.pdf” # Asegúrate de cambiar ‘tu_archivo.pdf’ por el nombre real de tu archivo

Convertir PDF a imágenes de alta calidad

images = convert_from_path(pdf_path, dpi=300, fmt=“PNG”, thread_count=4)

Extraer texto de las imágenes

texts = [pytesseract.image_to_string(img, lang=“eng”, config=“–oem 1 --psm 11”) for img in images]

Imprimir el texto extraído

for i, text in enumerate(texts):
print(f"Texto de la página {i + 1}:\n{text}\n")

CODIGO 2
from pdfminer.high_level import extract_text
def convert_pdf_to_txt(path):
text = extract_text(path)
return text

Cambia la ruta del archivo según la ubicación de tu archivo PDF

pdf_path = ‘/content/drive/MyDrive/PDF/file.pdf’

Convertir el PDF a texto

texto = convert_pdf_to_txt(pdf_path)

Imprimir el texto en la consola

print(texto)

However, when I use online PDF to text converters, the conversion comes out very well, almost perfect, without the errors that I encounter in both codes. Here I attach the PDF that I want to convert to text and the results that I get from both codes when I try to convert my file.

rob42 · April 12, 2023, 2:31pm

I’ve done some Python coding with PDF files, but not so much with the text side of things – I needed a way to get photo images (selectively) from multiple PDF files and write them to a new PDF file, which I’ve done.

While I was working on that project, I did discover a way to extract ‘text’ and I have still got some code that does that.

It’s far from a working solution, but if you can use it and develop it to do what you need to do, then you’re welcome to it:

import PyPDF2

filename = "lab.pdf"

pdfObj = open(filename, 'rb')

reader = PyPDF2.PdfFileReader(
    pdfObj,
    strict=True,
)

print(reader.getNumPages())
print(reader.getPage(0).extractText())
pdfObj.close()

To be clear: the above code is simply a working demo and I don’t even know if what I have here is the correct way to tackle the task – I did not need this feature for my project and as such I did not take it any further than this rough bit of dev code.

Link: PyPDF2 · PyPI

olivalej · April 12, 2023, 3:01pm

Hello Bob,

Thank you for your response. I have attempted to compile the code you provided previously, but the results are the same as before. Would it be possible for me to share my results and the PDF file I am trying to convert with you, so that you can review the issue?

rob42 · April 12, 2023, 3:15pm

You are welcome and by all means link-up your PDF file, so long as it does not contain any sensitive information.

I don’t know if I’ll be able to add anything more to what I have already added, but maybe, if I can’t, someone else can.

olivalej · April 12, 2023, 4:17pm

could you share your email with me so I can forward my results to you?

rob42 · April 12, 2023, 4:26pm

I’ll PM you.

olivalej · April 12, 2023, 4:27pm

Thank you!!

rob42 · April 12, 2023, 6:55pm

Bringing this back to this thread…

It could be that the on-line converters are using OCR rather than pulling the PDF file apart.

Trying to code this kind of app without a deep understanding of PDF files, is going to be ‘a challenge’.

Have a read of this:

… to get a better understand of what is involved.

olivalej · April 12, 2023, 10:03pm

Thank you, Rob. The provided solution doesn’t work as it should. I think that this code may be improved because it almost compiles properly, but we need to do improvements to get a result more accurate.

import pytesseract
from pdf2image import convert_from_path

Configurar pytesseract

pytesseract.pytesseract.tesseract_cmd = “/usr/bin/tesseract”
pytesseract.pytesseract.tessdata_dir_config = ‘/usr/share/tesseract-ocr/4.00/tessdata’

Ruta del archivo PDF

pdf_path = “/content/drive/MyDrive/PDF/file.pdf”

Convertir PDF a imágenes de alta calidad

images = convert_from_path(pdf_path, dpi=300, fmt=“PNG”, thread_count=4)

Extraer texto de las imágenes

texts = [pytesseract.image_to_string(img, lang=“eng”, config=“–oem 1 --psm 11”) for img in images]

Guardar el texto extraído en un archivo de texto

with open(“texto_extraido2.txt”, “w”) as f:
for i, text in enumerate(texts):
f.write(f"Texto de la página {i + 1}:\n{text}\n\n")