How to iterate through all the files in the directory

Right now I have to copy paste one file in the folder then rename it as input.pdf and then execute the script, then for new file, replace the input.pdf file with new file and execute the program and so on…

Now I want to iterate through all the (.pdf) files in a folder and perform the task specified in a function without changing the name of the file and save the file with the same name but with .xlsx or .csv

from PyPDF2 import PdfReader
import pandas as pd
import tabula

def main(pdf_file):

    name=''
    full_address=''
    contact_no=''
    email_id=''

    # creating an pdf object  
    # pdf_obj=PdfReader('input.pdf')
    pdf_obj=PdfReader(pdf_file)


    pdf=pdf_obj.getPage(0) #to get the first page of the PDF
    pdf_text=pdf.extractText() # now to get the text from the page

    #print(pdf_text)

    address=pdf_text.split('Contact No.') #spliting the complete text upto the Contact No.
    address=address[0]  #to get the first part of the split
    address=address.split('\n')  #now splitting the address by new lines


    for line in address: # iterating through each line and get the name if any would be there 
        if 'MR.' in line or 'MS.' in line or 'MRS.' in line:
            name+=line+'\n'
            continue
        full_address+=line+'\n'


    lines=pdf_text.split('\n')  #spliting the pdf text by new lines
    # print(pdf_text)
    for line in lines:
        if 'Contact No.' in line:  #taking the line with contact no
            contact_no=line.split('Contact No.:')[-1].strip() 
        if 'Email Id' in line:     #taking the line with email id
            email_id=line.split('Email Id:')[-1].split()[3]
            # print(line)


    print('**************')

    lines=pdf_text.split('Lien in Favour of')[-1]  #spliting the oage text and taking 2nd part

    lines=lines.split('\n') # now taking lines as a list by splitting for newlines
    # for line in lines:
    #     print(line)

    # replacing unnecessary characters 
    booking_id=lines[0].replace(':','')
    ref_no=lines[1].replace(':','')
    tower=lines[2].replace(':','')
    appart_no=lines[3].replace(':','')
    floor=lines[4].replace(':','')
    super_area=lines[5].replace(':','')

    if 'Applicant Ledger' not in line[7]: #checking if the "Lien in Favour of" is in multiple lines
        lien_favour=lines[6].replace(':','')+'\n'+lines[7]
    else:
        lien_favour=lines[6].replace(':','') 


    # print(name)
    # print(full_address)
    # print(contact_no)
    # print(email_id)
    # print(booking_id)
    # print(ref_no)
    # print(tower)
    # print(appart_no)
    # print(floor)
    # print(super_area)
    # print(lien_favour)

    #storing the data in dictionary format
    data={
    'Name':[name],
    'Full_Address':[full_address],
    'Contact_No':[contact_no],
    'Email_Id':[email_id],
    'Booking_Id':[booking_id],
    'Ref_No':[ref_no],
    'Tower':[tower],
    'Apartment_No.':[appart_no],
    'Floor':[floor],
    'Super_Area':[super_area],
    'Lien_in_Favour_of':[lien_favour],
    }

    df=pd.DataFrame(data) #creating a pandas dataframe
    table = pd.DataFrame(tabula.read_pdf(pdf_file, pages ='all', stream= 'True')[0])
    df_new = pd.concat([df, table],axis = 1)
    df_new.to_excel('data.xlsx',index=False) # saving the data in data.xlsx outout file
    #print(df_new)
    #print(table)
    #print(type(table))
    #print(type(df))

main(pdf_file='input.pdf')

Iterating over all files ending in .pdf in the current directory:

from pathllib import Path

pdf_files = Path(".").glob("*.pdf")
for pdf in pdf_files:
    # Do something.

Getting the filename without file type suffix:

filename = pdf.stem
1 Like

thank you for your prompt response
can you confirm me where I have to write the above code ?

Does your main function already do what you want for a single pdf-file? If so, you can just call it inside the for loop.

Right. So you’ve got main(pdf_file) - you’re already set up to process
a file with an arbitrary name. I’d rename main() to something more
meaningful like process_pdf or the like.

What remains is:

  • iterate through all the PDF files
  • compute the resulting output filename as pdf_file but with a
    different extension

You can iterate through the names in a directory with os.listdir, or
match the *.pdf names with the fnmatch function.

You can use the os.path.splitext function to break a filename into a
prefix and the extension. You can use that to get the prefix of the PDF
filename and append the appropriate .csv or other extension according
to the type of output you’re creating.

The other thing you may want to consider is: does copy/pasting a file
into a directory use a temporary name and only rename to the final name
when all the data have been copied into the file?

If that is not the case, it is possible that you might try to process a
file before it is complete.

There are various ways to address that issue, if it is an issue. I’d
tackle it after other things are already working.

Cheers,
Cameron Simpson cs@cskk.id.au

This will certainly work, but I would say that using pathlib, as I suggested above, is a more modern approach to achieve the same thing.

No…I have to rename the file as input.pdf.

[quote]

main(pdf_file='input.pdf')

Replace that last line with something like this:

import glob

for name in glob.glob('*.pdf'):
    main(pdf_file=name)

Warning: I have not tested the code, so it is possible it may not work as I expect.

But your goal is to not need to do that, right?

OK. I see in your main function that every output file is named ‘data.xlsx’. Change the final line of the function to instead reuse the same name as the input file:

df_new.to_excel(pdf_file.stem + ".xlsx", index=False)

Then, simply replace the final line of the script with this:

pdf_files = Path(".").glob("*.pdf")
for pdf_file in pdf_files:
    process_pdf(pdf_file)  # Rename your 'main' function per Cameron's suggestion.

That should do it. Just run it in the same folder where you keep the pdf-files, and you will get one .xlsx file for each pdf, with the same base name.

(So far at least four different approaches have been suggested: pathlib, os.listdir, fname, and glob. So much for “there should be one-- and preferably only one --obvious way to do it.” :upside_down_face:)

thank you so much for your support i really appreciate it. :slight_smile:

Indeed. Your messages were not in front of my face when I composed mine,
and I do not yet use pathlib very much, so it isn’t what I reach for.

I access this forum via email 99% of the time and my mail folder view
was clearly some… minutes out of date :slight_smile:

Cheers,
Cameron Simpson cs@cskk.id.au

1 Like

Guys, I have one problem. While extracting tables using Tabula, I am getting only the table on the first page.

Specify multiple_tables=True and pages="all" arguments.

Thanks for sharing. I hope you solve your problem soon.

If you want help solving your problem, you should:

  1. Start a new topic, don’t just dump it at the end of a completely unrelated topic where many people will ignore it or not see it.
  2. Actually ask a question. Be precise about what you want. Saying please and thank you also helps.
  3. Tell us what you tried. We aren’t mind readers.

Help us to give good answers by asking good questions.

1 Like

Mutiple table is used when we have more than 1 table on a page.