Dear Python community,
I want to convert several pdf files into csv. I would like my csv to have the name of the file and the content of the pdf. For example:
title; content;
john; blabla;
mary; bla bla bla;
…
I wrote the code for transforming one pdf into text, that I found online. Here is below the code:
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)
with open(‘C:/mydirectory/myfile1.pdf’, ‘rb’) as fh:
for page in PDFPage.get_pages(fh,
caching=True,
check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
converter.close()
fake_file_handle.close()
print(text)
I prefer pdfminer3 to pypdf2 or pdfPlumber because I compared the results with the 3 different packages and pdfminer3 seemed to be the best for my type of text (some of my pdfs have the text in columns).
The problem is that I do not know how to adapt this code for several pdf files and save the text in csv. I am currently learning python.