Question on: Scanned PDF, OCR, metadata, naming and saving in folders

Dear all,
I have some spare time and have decided to start learning Python with practical experiences I have in my daily life:
Whenever i receive a paper mail, my first action is to scan it to save a soft copy of it on my network.
I then run a software to OCR the document (with actually poor results :frowning: ), then rename it based on some criteria, for example the company from which it originates, the date it was sent, keywords, …

My idea is the following one, automate the process as much as possible:

Python code objective :

  • Transform the scanned PDF in an OCR PDF
  • Extract some information – header/footer – keywords
  • Save the PDF as PDF with characters with a document naming based on the header/footer information and add some keywords in the metadata for future search functions
  • Nice to have: automatically save to dedicated folders the documents based on the architecture of the document naming

I have tried many ways to have what i was looking for, but since i do not have all the skills and knowledge about this, I manage to have only limited functions done.

The most promising one I did until now but with poor OCR results was this:

#test OCR with pdfplumber - poor results

#This project is to OCR documents and automatically rename them based on prededined parameters

#Importing Libraries

import os
import requests
import pdfplumber

#defining and Checking the working directory for Python
`%cd "C:\Users\XX\Documents\Scanned documents\test environment"`

>C:\Users\XX\Documents\Scanned documents\test environment

#checking the content of the directory

OCR_PDF = 'mydocument.pdf'
with as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
lines = text.split('\n')
#provides the result of lines ending with comma

Seeing it was not recognizing the text properly enough, I banged again my head on the wall and decided to call for help, so here i am :slight_smile:
Could you please help me find a way to fulfill my requirements?
thank you for your support,