How to iterate through all the files in the directory

arvin_90 · November 15, 2022, 7:49am

Right now I have to copy paste one file in the folder then rename it as input.pdf and then execute the script, then for new file, replace the input.pdf file with new file and execute the program and so on…

Now I want to iterate through all the (.pdf) files in a folder and perform the task specified in a function without changing the name of the file and save the file with the same name but with .xlsx or .csv

from PyPDF2 import PdfReader
import pandas as pd
import tabula

def main(pdf_file):

    name=''
    full_address=''
    contact_no=''
    email_id=''

    # creating an pdf object  
    # pdf_obj=PdfReader('input.pdf')
    pdf_obj=PdfReader(pdf_file)


    pdf=pdf_obj.getPage(0) #to get the first page of the PDF
    pdf_text=pdf.extractText() # now to get the text from the page

    #print(pdf_text)

    address=pdf_text.split('Contact No.') #spliting the complete text upto the Contact No.
    address=address[0]  #to get the first part of the split
    address=address.split('\n')  #now splitting the address by new lines


    for line in address: # iterating through each line and get the name if any would be there 
        if 'MR.' in line or 'MS.' in line or 'MRS.' in line:
            name+=line+'\n'
            continue
        full_address+=line+'\n'


    lines=pdf_text.split('\n')  #spliting the pdf text by new lines
    # print(pdf_text)
    for line in lines:
        if 'Contact No.' in line:  #taking the line with contact no
            contact_no=line.split('Contact No.:')[-1].strip() 
        if 'Email Id' in line:     #taking the line with email id
            email_id=line.split('Email Id:')[-1].split()[3]
            # print(line)


    print('**************')

    lines=pdf_text.split('Lien in Favour of')[-1]  #spliting the oage text and taking 2nd part

    lines=lines.split('\n') # now taking lines as a list by splitting for newlines
    # for line in lines:
    #     print(line)

    # replacing unnecessary characters 
    booking_id=lines[0].replace(':','')
    ref_no=lines[1].replace(':','')
    tower=lines[2].replace(':','')
    appart_no=lines[3].replace(':','')
    floor=lines[4].replace(':','')
    super_area=lines[5].replace(':','')

    if 'Applicant Ledger' not in line[7]: #checking if the "Lien in Favour of" is in multiple lines
        lien_favour=lines[6].replace(':','')+'\n'+lines[7]
    else:
        lien_favour=lines[6].replace(':','') 


    # print(name)
    # print(full_address)
    # print(contact_no)
    # print(email_id)
    # print(booking_id)
    # print(ref_no)
    # print(tower)
    # print(appart_no)
    # print(floor)
    # print(super_area)
    # print(lien_favour)

    #storing the data in dictionary format
    data={
    'Name':[name],
    'Full_Address':[full_address],
    'Contact_No':[contact_no],
    'Email_Id':[email_id],
    'Booking_Id':[booking_id],
    'Ref_No':[ref_no],
    'Tower':[tower],
    'Apartment_No.':[appart_no],
    'Floor':[floor],
    'Super_Area':[super_area],
    'Lien_in_Favour_of':[lien_favour],
    }

    df=pd.DataFrame(data) #creating a pandas dataframe
    table = pd.DataFrame(tabula.read_pdf(pdf_file, pages ='all', stream= 'True')[0])
    df_new = pd.concat([df, table],axis = 1)
    df_new.to_excel('data.xlsx',index=False) # saving the data in data.xlsx outout file
    #print(df_new)
    #print(table)
    #print(type(table))
    #print(type(df))

main(pdf_file='input.pdf')

abessman · November 15, 2022, 8:08am

Iterating over all files ending in .pdf in the current directory:

from pathllib import Path

pdf_files = Path(".").glob("*.pdf")
for pdf in pdf_files:
    # Do something.

Getting the filename without file type suffix:

filename = pdf.stem

arvin_90 · November 15, 2022, 8:25am

thank you for your prompt response
can you confirm me where I have to write the above code ?

abessman · November 15, 2022, 8:46am

Does your main function already do what you want for a single pdf-file? If so, you can just call it inside the for loop.

cameron · November 15, 2022, 9:00am

Right. So you’ve got main(pdf_file) - you’re already set up to process
a file with an arbitrary name. I’d rename main() to something more
meaningful like process_pdf or the like.

What remains is:

iterate through all the PDF files
compute the resulting output filename as pdf_file but with a
different extension

You can iterate through the names in a directory with os.listdir, or
match the *.pdf names with the fnmatch function.

You can use the os.path.splitext function to break a filename into a
prefix and the extension. You can use that to get the prefix of the PDF
filename and append the appropriate .csv or other extension according
to the type of output you’re creating.

The other thing you may want to consider is: does copy/pasting a file
into a directory use a temporary name and only rename to the final name
when all the data have been copied into the file?

If that is not the case, it is possible that you might try to process a
file before it is complete.

There are various ways to address that issue, if it is an issue. I’d
tackle it after other things are already working.

Cheers,
Cameron Simpson cs@cskk.id.au

abessman · November 15, 2022, 9:07am

This will certainly work, but I would say that using pathlib, as I suggested above, is a more modern approach to achieve the same thing.

arvin_90 · November 15, 2022, 9:19am

No…I have to rename the file as input.pdf.

steven.daprano · November 14, 2022, 12:23pm

[quote]

main(pdf_file='input.pdf')

Replace that last line with something like this:

import glob

for name in glob.glob('*.pdf'):
    main(pdf_file=name)

Warning: I have not tested the code, so it is possible it may not work as I expect.

abessman · November 15, 2022, 9:37am

But your goal is to not need to do that, right?

OK. I see in your main function that every output file is named ‘data.xlsx’. Change the final line of the function to instead reuse the same name as the input file:

df_new.to_excel(pdf_file.stem + ".xlsx", index=False)

Then, simply replace the final line of the script with this:

pdf_files = Path(".").glob("*.pdf")
for pdf_file in pdf_files:
    process_pdf(pdf_file)  # Rename your 'main' function per Cameron's suggestion.

That should do it. Just run it in the same folder where you keep the pdf-files, and you will get one .xlsx file for each pdf, with the same base name.

abessman · November 15, 2022, 9:43am

(So far at least four different approaches have been suggested: pathlib, os.listdir, fname, and glob. So much for “there should be one-- and preferably only one --obvious way to do it.” )

arvin_90 · November 15, 2022, 11:13am

thank you so much for your support i really appreciate it.

cameron · November 15, 2022, 10:37pm

Indeed. Your messages were not in front of my face when I composed mine,
and I do not yet use pathlib very much, so it isn’t what I reach for.

I access this forum via email 99% of the time and my mail folder view
was clearly some… minutes out of date

Cheers,
Cameron Simpson cs@cskk.id.au

arvin_90 · November 20, 2022, 3:11pm

Guys, I have one problem. While extracting tables using Tabula, I am getting only the table on the first page.

vovavili · November 20, 2022, 3:24pm

Specify multiple_tables=True and pages="all" arguments.

steven.daprano · November 19, 2022, 6:31pm

Thanks for sharing. I hope you solve your problem soon.

If you want help solving your problem, you should:

Start a new topic, don’t just dump it at the end of a completely unrelated topic where many people will ignore it or not see it.
Actually ask a question. Be precise about what you want. Saying please and thank you also helps.
Tell us what you tried. We aren’t mind readers.

Help us to give good answers by asking good questions.

arvin_90 · November 20, 2022, 3:52pm

Mutiple table is used when we have more than 1 table on a page.