Print 5 lines before and after a keyword is found in pdf

Hi all - I am new to python and need help to print 5 lines before and after a search keyword is found whilst reading pdf file. my code only print current line when keyword is found

import pdfplumber
import pandas as pd

file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input("Enter the search Keyword: ")
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split(‘\n’):
lines.append(line)
print(line)
if line.find(word) != -1:

                    print('\n')
                    print(word, 'string exists in file')
                    print('Line Number:', lines.index(line))
                    print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n") 
                    print('Line:', line)

Please wrap code in triple backticks to preserve the formatting:

```python
if True:
    print(''Hello world!')
```

You could read all of the lines and then look for the search keyword in those lines, printing out the surrounding lines when you find a match:

for i, line in enumerate(lines):
    if word in line:
        # Using 'max' in case there's a match near the start of the list.
        print(lines[max(i - 5, 0) : i + 6])

If the text is very big, you’ll have to limit the number of lines in memory at any one time.

Thank you Matthew. This works. Only problem is the output is repeated several times and not once. Is there anything I need to change in below code. Thanks again

import pdfplumber
import pandas as pd

file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input(“Enter the search Keyword: “)
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split(‘\n’):
lines.append(line)
#print(line)
for i, line in enumerate(lines):
if word in line:
print(‘\n’)
#print(word, ‘string exists in file’)
print(‘Line Number:’, lines.index(line))
print(”\n**** Lines containing Keyword: "” +word+ “" ****\n”)
#print(‘Line:’, line)
print(lines[max(i - 5, 0) : i + 6])

The formatting is still messed up because you didn’t wrap it in backticks.

Also, you don’t need lines.index(line) because i is the line number. In fact, lines.index(line) could give you the wrong answer because it’ll stop at the first match, which is a problem if that line occurs more than once.

Thanks Matthew. Something not right in the code pls. same lines are repeating again and again

You’re searching the lines you’ve collected and printing out the results while you’re iterating over the PDF and adding new lines. Fix the indentation so that you collect the lines first and search them all afterwards.

Thanks Matthew. The backtics and format works fine in sublime text. I have set this to Tab width:4. formatting messes up when I paste code here. pls can you guide me where I need to make corrections so that I don’t get repetition of output. thanks again for all your help

import pdfplumber
import pandas as pd

file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input("Enter the search Keyword: ")
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()

    for i, line in enumerate(text.split('\n')):
        lines.append(line)
        #print(line)
        if line.find(word) != -1:
                print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n")
                    #print("\" ****\n")
                print("Page No: ",i)
                print(lines[max(i - 5, 0) : i + 6]) 
        #for i, line in enumerate(lines):

When posting code here, please wrap code in triple backticks to preserve the formatting:

```python
if True:
    print(''Hello world!')
```

As I said, you need to fix the indentation where you search the lines, like this:

import pdfplumber

file = 'C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf'
word = input("Enter the search Keyword: ")
lines = []

with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            lines.append(line)
            #print(line)
            
for i, line in enumerate(lines):
    if word in line:
        print('\n')
        #print(word, 'string exists in file')
        print('Line Number:', i)
        print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n") 
        #print('Line:', line)
        print(lines[max(i - 5, 0) : i + 6])

Thank you so very much Matthew. Much appreciate and grateful for your help. Merry Christmas and happy new year.

one final thing psl if I may ask. if I want to search from the list of keywords and print the line when these keywords are found. I have tweaked the code but there is no output. pls can you let me know what modifications I need to make

for i, line in enumerate(lines):
… if line in [“with”,“ethical”,“leader”]:
… print(‘\n’)
… print(word, ‘string exists in file’)
… print(‘Line Number:’, i)
… print(“\n**** Lines containing Keyword: "” +word+ “" ****\n”)
… print(‘Line:’, line)

That will look for the line in the list, which is true only if the line is equal to “with”, “ethical”, or “leader”.

If you want to look for any of those words, do:

    if any(word in line for word in ["with", "ethical", "leader"]):

Thanks once again. Works perfectly fine

one more thing pls. I want to loop through all the pdf files in folder, look for keyword and print the lines where keyword found with the name of pdf where this keyword is found. below code works fine for printing of few pdfs. after then it starts to append lines and prints name of wrong pdf against these lines. Pls can you let me know which lines I need to make changes?. thanks again

Importing required modules

import PyPDF2
import os
import PyPDF2
import re
import os
word = [“countersigned”, “Authority”, “Company”]

lines =

for foldername,subfolders,files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):

for file in files:

    # open the pdf file
    pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername,file))
   
    # get number of pages
    pages = pdfReader.getNumPages()

Creating a pdf reader object

    #pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Getting number of pages in pdf file

#pages = pdfReader.numPages
#Pages = pdfFileObj.getNumPages()

Loop for reading all the Pages

    for i in range(pages):

    # Creating a page object
        pageObj = pdfReader.getPage(i)
    # Printing Page Number
        #print("Page No: ",i)
    # Extracting text from page
    # And splitting it into chunks of lines
    #text = pageObj.extractText().split("  ")
        text = pageObj.extract_text()
        for line in text.split('\n'):
            lines.append(line)
            #print(line)

    for i, line in enumerate(lines):

        if any(word in line for word in word):

#if line in ["with","ethical","leader"]:
           print('\n')
           print(file)  
           #print(file)
           #print(word, 'string exists in file')
           #print('Line Number:', i)
           #print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n") 
           print('Line:', line)
           
            
    
    #print(lines[max(i - 2, 0) : i + 2])

You’re still not wrapping the code in backticks to preserve its formatting when you post.

sure. pls see below code with backticks. one more thing pls. I want to loop through all the pdf files in folder, look for keyword and print the lines where keyword found with the name of pdf where this keyword is found. below code works fine for printing name of first few pdfs. after then it starts to append lines and prints name of wrong pdf against these lines. Pls can you let me know which lines I need to make changes?. thanks again

Importing required modules

```import os
```import PyPDF2
```import re
```import os
```word = [“countersigned”, “Authority”, “Company”]

```lines =[]

```for foldername,subfolders,files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):


``````for file in files:

    # open the pdf file
    ```pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername,file))
   
    # get number of pages
    ```pages = pdfReader.getNumPages()


# Creating a pdf reader object


    #pdfReader = PyPDF2.PdfFileReader(pdfFileObj)


# Getting number of pages in pdf file

#pages = pdfReader.numPages
#Pages = pdfFileObj.getNumPages()

# Loop for reading all the Pages


   ```for i in range(pages):

    # Creating a page object
      ```pageObj = pdfReader.getPage(i)
    # Printing Page Number
        #print("Page No: ",i)
    # Extracting text from page
    # And splitting it into chunks of lines
    #text = pageObj.extractText().split("  ")
        ```text = pageObj.extract_text()
        ```for line in text.split('\n'):
            ```lines.append(line)
            #print(line)

   ``` for i, line in enumerate(lines):

        ```if any(word in line for word in word):

#if line in ["with","ethical","leader"]:
           ```print('\n')
           ```print(file)  
           #print(file)
           #print(word, 'string exists in file')
           #print('Line Number:', i)
           #print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n") 
          ``` print('Line:', line)
           
            
    
    #print(lines[max(i - 2, 0) : i + 2])

By “wrapping the code in backticks” I mean put 3 backticks ``` on a line before and a line after the code, as in my original reply.

As for the other matter, you just need to clear the lines for each file. I’ve cleaned up the code a little and removed some duplicated imports and unused lines below:

# Importing required modules
import PyPDF2
import os
import re

word = ["countersigned", "Authority", "Company"]

for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
    for file in files:
        print('file:', file)
        
        # Collect the lines for this file
        lines = []

        # Open the pdf file
        pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))

        # Get the number of pages
        numPages = pdfReader.getNumPages()

        # Loop for reading all the pages
        for i in range(numPages):
            # Creating a page object
            pageObj = pdfReader.getPage(i)

            # Printing page number
            #print("Page No: ", i)

            # Extracting text from page
            # and splitting it into lines
            text = pageObj.extract_text()

            for line in text.split('\n'):
                lines.append(line)
                #print(line)

            # The above could be shortened to:
            # lines = text.splitlines()

            for i, line in enumerate(lines):
                if any(word in line for word in word):
                    print('\n')
                    print('Line Number:', i)
                    print('Line:', line)

ok thanks. clearing the lines for each file- you mean removing unused lines?

There seems to be one problem in the code. if the word is found in last page then it prints the line with word only once. However, if lets say word is found in 8th page of 10 page document then it will print it thrice; 8th page, 9th page and 10th page with same line containing that word

@LPYTHON, MRAB’s shown you the code-wrapper syntax, possibly only once
overtly.

When pasting your code into a post here, please put it between “code
fences”, like this:

 ```python
 your python code
 goes here
 ```

See the marker lines before and after, above?

There’s a button in the forum editor window like this </> for making
these markers.

Thank you,
Cameron Simpson cs@cskk.id.au

Thanks for letting me know. pls see below code

# Importing required modules
import PyPDF2
import os
import re

word = ["countersigned", "Digital", "recklessly"]

for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
    for file in files:
        print('\n')
        print('file:', file)
        print('\n')
        # Collect the lines for this file
        lines = []

        # Open the pdf file
        pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))

        # Get the number of pages
        numPages = pdfReader.getNumPages()

        # Loop for reading all the pages
        for i in range(numPages):
            # Creating a page object
            pageObj = pdfReader.getPage(i)

            # Printing page number
            print("Page No: ", i)

            # Extracting text from page
            # and splitting it into lines
            text = pageObj.extract_text()

            for line in text.split('\n'):
                lines.append(line)
                #print(line)

            # The above could be shortened to:
            # lines = text.splitlines()

            for i, line in enumerate(lines):
                if any(word in line for word in word):
                    print('\n')

                    #print('Line Number:', i)
                    print('Line:', line)
                    #print('file:', file)

I see what you mean. In that case, just clear the lines at the start of each page:

# Importing required modules
import PyPDF2
import os
import re

word = ["countersigned", "Digital", "recklessly"]

for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
    for file in files:
        print('\n')
        print('file:', file)
        print('\n')

        # Open the pdf file
        pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))

        # Get the number of pages
        numPages = pdfReader.getNumPages()

        # Loop for reading all the pages
        for pageNum in range(numPages):
            # Collect the lines for this page
            lines = []

            # Creating a page object
            pageObj = pdfReader.getPage(pageNum)

            # Printing page number
            print("Page No: ", pageNum)

            # Extracting text from page
            # and splitting it into lines
            text = pageObj.extract_text()

            for line in text.split('\n'):
                lines.append(line)
                #print(line)

            # The above could be shortened to:
            # lines = text.splitlines()

            for lineNum, line in enumerate(lines):
                if any(word in line for word in word):
                    print('\n')

                    #print('Line Number:', lineNum)
                    print('Line:', line)
                    #print('file:', file)