Print 5 lines before and after a keyword is found in pdf

LPYTHON · December 13, 2022, 9:35pm

Hi all - I am new to python and need help to print 5 lines before and after a search keyword is found whilst reading pdf file. my code only print current line when keyword is found

import pdfplumber
import pandas as pd

file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input("Enter the search Keyword: ")
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split(‘\n’):
lines.append(line)
print(line)
if line.find(word) != -1:

                    print('\n')
                    print(word, 'string exists in file')
                    print('Line Number:', lines.index(line))
                    print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n") 
                    print('Line:', line)

MRAB · December 13, 2022, 11:58pm

Please wrap code in triple backticks to preserve the formatting:

```python
if True:
    print(''Hello world!')
```

You could read all of the lines and then look for the search keyword in those lines, printing out the surrounding lines when you find a match:

for i, line in enumerate(lines):
    if word in line:
        # Using 'max' in case there's a match near the start of the list.
        print(lines[max(i - 5, 0) : i + 6])

If the text is very big, you’ll have to limit the number of lines in memory at any one time.

LPYTHON · December 14, 2022, 6:50am

Thank you Matthew. This works. Only problem is the output is repeated several times and not once. Is there anything I need to change in below code. Thanks again

import pdfplumber
import pandas as pd

file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input(“Enter the search Keyword: “)
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split(‘\n’):
lines.append(line)
#print(line)
for i, line in enumerate(lines):
if word in line:
print(‘\n’)
#print(word, ‘string exists in file’)
print(‘Line Number:’, lines.index(line))
print(”\n**** Lines containing Keyword: "” +word+ “" ****\n”)
#print(‘Line:’, line)
print(lines[max(i - 5, 0) : i + 6])

MRAB · December 14, 2022, 6:37pm

The formatting is still messed up because you didn’t wrap it in backticks.

Also, you don’t need lines.index(line) because i is the line number. In fact, lines.index(line) could give you the wrong answer because it’ll stop at the first match, which is a problem if that line occurs more than once.

LPYTHON · December 18, 2022, 6:39pm

Thanks Matthew. Something not right in the code pls. same lines are repeating again and again

MRAB · December 18, 2022, 7:24pm

You’re searching the lines you’ve collected and printing out the results while you’re iterating over the PDF and adding new lines. Fix the indentation so that you collect the lines first and search them all afterwards.

LPYTHON · December 23, 2022, 6:18am

Thanks Matthew. The backtics and format works fine in sublime text. I have set this to Tab width:4. formatting messes up when I paste code here. pls can you guide me where I need to make corrections so that I don’t get repetition of output. thanks again for all your help

import pdfplumber
import pandas as pd

file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input("Enter the search Keyword: ")
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()

    for i, line in enumerate(text.split('\n')):
        lines.append(line)
        #print(line)
        if line.find(word) != -1:
                print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n")
                    #print("\" ****\n")
                print("Page No: ",i)
                print(lines[max(i - 5, 0) : i + 6]) 
        #for i, line in enumerate(lines):

MRAB · December 23, 2022, 7:10pm

When posting code here, please wrap code in triple backticks to preserve the formatting:

```python
if True:
    print(''Hello world!')
```

As I said, you need to fix the indentation where you search the lines, like this:

import pdfplumber

file = 'C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf'
word = input("Enter the search Keyword: ")
lines = []

with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            lines.append(line)
            #print(line)
            
for i, line in enumerate(lines):
    if word in line:
        print('\n')
        #print(word, 'string exists in file')
        print('Line Number:', i)
        print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n") 
        #print('Line:', line)
        print(lines[max(i - 5, 0) : i + 6])

LPYTHON · December 24, 2022, 7:48pm

Thank you so very much Matthew. Much appreciate and grateful for your help. Merry Christmas and happy new year.

LPYTHON · December 24, 2022, 10:17pm

one final thing psl if I may ask. if I want to search from the list of keywords and print the line when these keywords are found. I have tweaked the code but there is no output. pls can you let me know what modifications I need to make

for i, line in enumerate(lines):
… if line in [“with”,“ethical”,“leader”]:
… print(‘\n’)
… print(word, ‘string exists in file’)
… print(‘Line Number:’, i)
… print(“\n**** Lines containing Keyword: "” +word+ “" ****\n”)
… print(‘Line:’, line)

MRAB · December 24, 2022, 11:24pm

That will look for the line in the list, which is true only if the line is equal to “with”, “ethical”, or “leader”.

If you want to look for any of those words, do:

    if any(word in line for word in ["with", "ethical", "leader"]):

LPYTHON · December 25, 2022, 12:09am

Thanks once again. Works perfectly fine

LPYTHON · December 26, 2022, 11:35am

one more thing pls. I want to loop through all the pdf files in folder, look for keyword and print the lines where keyword found with the name of pdf where this keyword is found. below code works fine for printing of few pdfs. after then it starts to append lines and prints name of wrong pdf against these lines. Pls can you let me know which lines I need to make changes?. thanks again

Importing required modules

import PyPDF2
import os
import PyPDF2
import re
import os
word = [“countersigned”, “Authority”, “Company”]

lines =

for foldername,subfolders,files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):

for file in files:

    # open the pdf file
    pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername,file))
   
    # get number of pages
    pages = pdfReader.getNumPages()

Creating a pdf reader object

    #pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Getting number of pages in pdf file

#pages = pdfReader.numPages
#Pages = pdfFileObj.getNumPages()

Loop for reading all the Pages

    for i in range(pages):

    # Creating a page object
        pageObj = pdfReader.getPage(i)
    # Printing Page Number
        #print("Page No: ",i)
    # Extracting text from page
    # And splitting it into chunks of lines
    #text = pageObj.extractText().split("  ")
        text = pageObj.extract_text()
        for line in text.split('\n'):
            lines.append(line)
            #print(line)

    for i, line in enumerate(lines):

        if any(word in line for word in word):

#if line in ["with","ethical","leader"]:
           print('\n')
           print(file)  
           #print(file)
           #print(word, 'string exists in file')
           #print('Line Number:', i)
           #print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n") 
           print('Line:', line)
           
            
    
    #print(lines[max(i - 2, 0) : i + 2])

MRAB · December 26, 2022, 6:00pm

You’re still not wrapping the code in backticks to preserve its formatting when you post.

LPYTHON · December 27, 2022, 8:53pm

sure. pls see below code with backticks. one more thing pls. I want to loop through all the pdf files in folder, look for keyword and print the lines where keyword found with the name of pdf where this keyword is found. below code works fine for printing name of first few pdfs. after then it starts to append lines and prints name of wrong pdf against these lines. Pls can you let me know which lines I need to make changes?. thanks again

Importing required modules

```import os
```import PyPDF2
```import re
```import os
```word = [“countersigned”, “Authority”, “Company”]

```lines =[]

```for foldername,subfolders,files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):


``````for file in files:

    # open the pdf file
    ```pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername,file))
   
    # get number of pages
    ```pages = pdfReader.getNumPages()


# Creating a pdf reader object


    #pdfReader = PyPDF2.PdfFileReader(pdfFileObj)


# Getting number of pages in pdf file

#pages = pdfReader.numPages
#Pages = pdfFileObj.getNumPages()

# Loop for reading all the Pages


   ```for i in range(pages):

    # Creating a page object
      ```pageObj = pdfReader.getPage(i)
    # Printing Page Number
        #print("Page No: ",i)
    # Extracting text from page
    # And splitting it into chunks of lines
    #text = pageObj.extractText().split("  ")
        ```text = pageObj.extract_text()
        ```for line in text.split('\n'):
            ```lines.append(line)
            #print(line)

   ``` for i, line in enumerate(lines):

        ```if any(word in line for word in word):

#if line in ["with","ethical","leader"]:
           ```print('\n')
           ```print(file)  
           #print(file)
           #print(word, 'string exists in file')
           #print('Line Number:', i)
           #print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n") 
          ``` print('Line:', line)
           
            
    
    #print(lines[max(i - 2, 0) : i + 2])

MRAB · December 27, 2022, 9:20pm

By “wrapping the code in backticks” I mean put 3 backticks ``` on a line before and a line after the code, as in my original reply.

As for the other matter, you just need to clear the lines for each file. I’ve cleaned up the code a little and removed some duplicated imports and unused lines below:

# Importing required modules
import PyPDF2
import os
import re

word = ["countersigned", "Authority", "Company"]

for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
    for file in files:
        print('file:', file)
        
        # Collect the lines for this file
        lines = []

        # Open the pdf file
        pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))

        # Get the number of pages
        numPages = pdfReader.getNumPages()

        # Loop for reading all the pages
        for i in range(numPages):
            # Creating a page object
            pageObj = pdfReader.getPage(i)

            # Printing page number
            #print("Page No: ", i)

            # Extracting text from page
            # and splitting it into lines
            text = pageObj.extract_text()

            for line in text.split('\n'):
                lines.append(line)
                #print(line)

            # The above could be shortened to:
            # lines = text.splitlines()

            for i, line in enumerate(lines):
                if any(word in line for word in word):
                    print('\n')
                    print('Line Number:', i)
                    print('Line:', line)

LPYTHON · December 27, 2022, 10:47pm

ok thanks. clearing the lines for each file- you mean removing unused lines?

There seems to be one problem in the code. if the word is found in last page then it prints the line with word only once. However, if lets say word is found in 8th page of 10 page document then it will print it thrice; 8th page, 9th page and 10th page with same line containing that word

cameron · December 27, 2022, 10:54pm

@LPYTHON, MRAB’s shown you the code-wrapper syntax, possibly only once
overtly.

When pasting your code into a post here, please put it between “code
fences”, like this:

 ```python
 your python code
 goes here
 ```

See the marker lines before and after, above?

There’s a button in the forum editor window like this </> for making
these markers.

Thank you,
Cameron Simpson cs@cskk.id.au

LPYTHON · December 27, 2022, 11:30pm

Thanks for letting me know. pls see below code

# Importing required modules
import PyPDF2
import os
import re

word = ["countersigned", "Digital", "recklessly"]

for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
    for file in files:
        print('\n')
        print('file:', file)
        print('\n')
        # Collect the lines for this file
        lines = []

        # Open the pdf file
        pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))

        # Get the number of pages
        numPages = pdfReader.getNumPages()

        # Loop for reading all the pages
        for i in range(numPages):
            # Creating a page object
            pageObj = pdfReader.getPage(i)

            # Printing page number
            print("Page No: ", i)

            # Extracting text from page
            # and splitting it into lines
            text = pageObj.extract_text()

            for line in text.split('\n'):
                lines.append(line)
                #print(line)

            # The above could be shortened to:
            # lines = text.splitlines()

            for i, line in enumerate(lines):
                if any(word in line for word in word):
                    print('\n')

                    #print('Line Number:', i)
                    print('Line:', line)
                    #print('file:', file)

MRAB · December 28, 2022, 1:14am

I see what you mean. In that case, just clear the lines at the start of each page:

# Importing required modules
import PyPDF2
import os
import re

word = ["countersigned", "Digital", "recklessly"]

for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
    for file in files:
        print('\n')
        print('file:', file)
        print('\n')

        # Open the pdf file
        pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))

        # Get the number of pages
        numPages = pdfReader.getNumPages()

        # Loop for reading all the pages
        for pageNum in range(numPages):
            # Collect the lines for this page
            lines = []

            # Creating a page object
            pageObj = pdfReader.getPage(pageNum)

            # Printing page number
            print("Page No: ", pageNum)

            # Extracting text from page
            # and splitting it into lines
            text = pageObj.extract_text()

            for line in text.split('\n'):
                lines.append(line)
                #print(line)

            # The above could be shortened to:
            # lines = text.splitlines()

            for lineNum, line in enumerate(lines):
                if any(word in line for word in word):
                    print('\n')

                    #print('Line Number:', lineNum)
                    print('Line:', line)
                    #print('file:', file)