Question on: Scanned PDF, OCR, metadata, naming and saving in folders

Ronandeblois · March 6, 2021, 2:59pm

Dear all,
I have some spare time and have decided to start learning Python with practical experiences I have in my daily life:
Whenever i receive a paper mail, my first action is to scan it to save a soft copy of it on my network.
I then run a software to OCR the document (with actually poor results ), then rename it based on some criteria, for example the company from which it originates, the date it was sent, keywords, …

My idea is the following one, automate the process as much as possible:

Python code objective :

Transform the scanned PDF in an OCR PDF
Extract some information – header/footer – keywords
Save the PDF as PDF with characters with a document naming based on the header/footer information and add some keywords in the metadata for future search functions
Nice to have: automatically save to dedicated folders the documents based on the architecture of the document naming

I have tried many ways to have what i was looking for, but since i do not have all the skills and knowledge about this, I manage to have only limited functions done.

The most promising one I did until now but with poor OCR results was this:

#test OCR with pdfplumber - poor results

#This project is to OCR documents and automatically rename them based on prededined parameters

#Importing Libraries

import os
import requests
import pdfplumber

#defining and Checking the working directory for Python
`%cd "C:\Users\XX\Documents\Scanned documents\test environment"`

>C:\Users\XX\Documents\Scanned documents\test environment

#checking the content of the directory

os.listdir()
>mydocument.pdf
OCR_PDF = 'mydocument.pdf'
OCR_PDF
'mydocument.pdf'
with pdfplumber.open(OCR_PDF) as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)
lines = text.split('\n')
lines
#provides the result of lines ending with comma

Seeing it was not recognizing the text properly enough, I banged again my head on the wall and decided to call for help, so here i am
Could you please help me find a way to fulfill my requirements?
thank you for your support,
Ronan