Right now I have to copy paste one file in the folder then rename it as input.pdf and then execute the script, then for new file, replace the input.pdf file with new file and execute the program and so on…
Now I want to iterate through all the (.pdf) files in a folder and perform the task specified in a function without changing the name of the file and save the file with the same name but with .xlsx or .csv
from PyPDF2 import PdfReader
import pandas as pd
import tabula
def main(pdf_file):
name=''
full_address=''
contact_no=''
email_id=''
# creating an pdf object
# pdf_obj=PdfReader('input.pdf')
pdf_obj=PdfReader(pdf_file)
pdf=pdf_obj.getPage(0) #to get the first page of the PDF
pdf_text=pdf.extractText() # now to get the text from the page
#print(pdf_text)
address=pdf_text.split('Contact No.') #spliting the complete text upto the Contact No.
address=address[0] #to get the first part of the split
address=address.split('\n') #now splitting the address by new lines
for line in address: # iterating through each line and get the name if any would be there
if 'MR.' in line or 'MS.' in line or 'MRS.' in line:
name+=line+'\n'
continue
full_address+=line+'\n'
lines=pdf_text.split('\n') #spliting the pdf text by new lines
# print(pdf_text)
for line in lines:
if 'Contact No.' in line: #taking the line with contact no
contact_no=line.split('Contact No.:')[-1].strip()
if 'Email Id' in line: #taking the line with email id
email_id=line.split('Email Id:')[-1].split()[3]
# print(line)
print('**************')
lines=pdf_text.split('Lien in Favour of')[-1] #spliting the oage text and taking 2nd part
lines=lines.split('\n') # now taking lines as a list by splitting for newlines
# for line in lines:
# print(line)
# replacing unnecessary characters
booking_id=lines[0].replace(':','')
ref_no=lines[1].replace(':','')
tower=lines[2].replace(':','')
appart_no=lines[3].replace(':','')
floor=lines[4].replace(':','')
super_area=lines[5].replace(':','')
if 'Applicant Ledger' not in line[7]: #checking if the "Lien in Favour of" is in multiple lines
lien_favour=lines[6].replace(':','')+'\n'+lines[7]
else:
lien_favour=lines[6].replace(':','')
# print(name)
# print(full_address)
# print(contact_no)
# print(email_id)
# print(booking_id)
# print(ref_no)
# print(tower)
# print(appart_no)
# print(floor)
# print(super_area)
# print(lien_favour)
#storing the data in dictionary format
data={
'Name':[name],
'Full_Address':[full_address],
'Contact_No':[contact_no],
'Email_Id':[email_id],
'Booking_Id':[booking_id],
'Ref_No':[ref_no],
'Tower':[tower],
'Apartment_No.':[appart_no],
'Floor':[floor],
'Super_Area':[super_area],
'Lien_in_Favour_of':[lien_favour],
}
df=pd.DataFrame(data) #creating a pandas dataframe
table = pd.DataFrame(tabula.read_pdf(pdf_file, pages ='all', stream= 'True')[0])
df_new = pd.concat([df, table],axis = 1)
df_new.to_excel('data.xlsx',index=False) # saving the data in data.xlsx outout file
#print(df_new)
#print(table)
#print(type(table))
#print(type(df))
main(pdf_file='input.pdf')
Right. So you’ve got main(pdf_file) - you’re already set up to process
a file with an arbitrary name. I’d rename main() to something more
meaningful like process_pdf or the like.
What remains is:
iterate through all the PDF files
compute the resulting output filename as pdf_file but with a
different extension
You can iterate through the names in a directory with os.listdir, or
match the *.pdf names with the fnmatch function.
You can use the os.path.splitext function to break a filename into a
prefix and the extension. You can use that to get the prefix of the PDF
filename and append the appropriate .csv or other extension according
to the type of output you’re creating.
The other thing you may want to consider is: does copy/pasting a file
into a directory use a temporary name and only rename to the final name
when all the data have been copied into the file?
If that is not the case, it is possible that you might try to process a
file before it is complete.
There are various ways to address that issue, if it is an issue. I’d
tackle it after other things are already working.
OK. I see in your main function that every output file is named ‘data.xlsx’. Change the final line of the function to instead reuse the same name as the input file:
(So far at least four different approaches have been suggested: pathlib, os.listdir, fname, and glob. So much for “there should be one-- and preferably only one --obvious way to do it.” )