Good day community,
I’m trying to compile some code to convert PDF to text, but the result is not what I expected. I have tried different libraries such as pytesseract, pdfminer, pdftotext, pdf2image, and OpenCV, but all of them extract the text incompletely or with errors. The last two codes that I used are these:
CODIGO 1
import pytesseract
from pdf2image import convert_from_path
Configurar pytesseract
pytesseract.pytesseract.tesseract_cmd = “/usr/bin/tesseract”
pytesseract.pytesseract.tessdata_dir_config = ‘/usr/share/tesseract-ocr/4.00/tessdata’
Ruta del archivo PDF
pdf_path = “/content/drive/MyDrive/PDF/file.pdf” # Asegúrate de cambiar ‘tu_archivo.pdf’ por el nombre real de tu archivo
Convertir PDF a imágenes de alta calidad
images = convert_from_path(pdf_path, dpi=300, fmt=“PNG”, thread_count=4)
Extraer texto de las imágenes
texts = [pytesseract.image_to_string(img, lang=“eng”, config=“–oem 1 --psm 11”) for img in images]
Imprimir el texto extraído
for i, text in enumerate(texts):
print(f"Texto de la página {i + 1}:\n{text}\n")
CODIGO 2
from pdfminer.high_level import extract_text
def convert_pdf_to_txt(path):
text = extract_text(path)
return text
Cambia la ruta del archivo según la ubicación de tu archivo PDF
pdf_path = ‘/content/drive/MyDrive/PDF/file.pdf’
Convertir el PDF a texto
texto = convert_pdf_to_txt(pdf_path)
Imprimir el texto en la consola
print(texto)
However, when I use online PDF to text converters, the conversion comes out very well, almost perfect, without the errors that I encounter in both codes. Here I attach the PDF that I want to convert to text and the results that I get from both codes when I try to convert my file.