Hi all - I am new to python and need help to print 5 lines before and after a search keyword is found whilst reading pdf file. my code only print current line when keyword is found
import pdfplumber
import pandas as pd
file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input("Enter the search Keyword: ")
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split(‘\n’):
lines.append(line)
print(line)
if line.find(word) != -1:
Please wrap code in triple backticks to preserve the formatting:
```python
if True:
print(''Hello world!')
```
You could read all of the lines and then look for the search keyword in those lines, printing out the surrounding lines when you find a match:
for i, line in enumerate(lines):
if word in line:
# Using 'max' in case there's a match near the start of the list.
print(lines[max(i - 5, 0) : i + 6])
If the text is very big, you’ll have to limit the number of lines in memory at any one time.
Thank you Matthew. This works. Only problem is the output is repeated several times and not once. Is there anything I need to change in below code. Thanks again
import pdfplumber
import pandas as pd
file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input(“Enter the search Keyword: “)
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split(‘\n’):
lines.append(line) #print(line)
for i, line in enumerate(lines):
if word in line:
print(‘\n’) #print(word, ‘string exists in file’)
print(‘Line Number:’, lines.index(line))
print(”\n**** Lines containing Keyword: "” +word+ “" ****\n”) #print(‘Line:’, line)
print(lines[max(i - 5, 0) : i + 6])
The formatting is still messed up because you didn’t wrap it in backticks.
Also, you don’t need lines.index(line) because i is the line number. In fact, lines.index(line) could give you the wrong answer because it’ll stop at the first match, which is a problem if that line occurs more than once.
You’re searching the lines you’ve collected and printing out the results while you’re iterating over the PDF and adding new lines. Fix the indentation so that you collect the lines first and search them all afterwards.
Thanks Matthew. The backtics and format works fine in sublime text. I have set this to Tab width:4. formatting messes up when I paste code here. pls can you guide me where I need to make corrections so that I don’t get repetition of output. thanks again for all your help
import pdfplumber
import pandas as pd
file = ‘C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf’
word = input("Enter the search Keyword: ")
lines =
rows =
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for i, line in enumerate(text.split('\n')):
lines.append(line)
#print(line)
if line.find(word) != -1:
print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n")
#print("\" ****\n")
print("Page No: ",i)
print(lines[max(i - 5, 0) : i + 6])
#for i, line in enumerate(lines):
When posting code here, please wrap code in triple backticks to preserve the formatting:
```python
if True:
print(''Hello world!')
```
As I said, you need to fix the indentation where you search the lines, like this:
import pdfplumber
file = 'C:/Users/ambar/OneDrive/Desktop/dummy/ab.pdf'
word = input("Enter the search Keyword: ")
lines = []
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
lines.append(line)
#print(line)
for i, line in enumerate(lines):
if word in line:
print('\n')
#print(word, 'string exists in file')
print('Line Number:', i)
print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n")
#print('Line:', line)
print(lines[max(i - 5, 0) : i + 6])
one final thing psl if I may ask. if I want to search from the list of keywords and print the line when these keywords are found. I have tweaked the code but there is no output. pls can you let me know what modifications I need to make
for i, line in enumerate(lines):
… if line in [“with”,“ethical”,“leader”]:
… print(‘\n’)
… print(word, ‘string exists in file’)
… print(‘Line Number:’, i)
… print(“\n**** Lines containing Keyword: "” +word+ “" ****\n”)
… print(‘Line:’, line)
one more thing pls. I want to loop through all the pdf files in folder, look for keyword and print the lines where keyword found with the name of pdf where this keyword is found. below code works fine for printing of few pdfs. after then it starts to append lines and prints name of wrong pdf against these lines. Pls can you let me know which lines I need to make changes?. thanks again
Importing required modules
import PyPDF2
import os
import PyPDF2
import re
import os
word = [“countersigned”, “Authority”, “Company”]
lines =
for foldername,subfolders,files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
for file in files:
# open the pdf file
pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername,file))
# get number of pages
pages = pdfReader.getNumPages()
for i in range(pages):
# Creating a page object
pageObj = pdfReader.getPage(i)
# Printing Page Number
#print("Page No: ",i)
# Extracting text from page
# And splitting it into chunks of lines
#text = pageObj.extractText().split(" ")
text = pageObj.extract_text()
for line in text.split('\n'):
lines.append(line)
#print(line)
for i, line in enumerate(lines):
if any(word in line for word in word):
#if line in ["with","ethical","leader"]:
print('\n')
print(file)
#print(file)
#print(word, 'string exists in file')
#print('Line Number:', i)
#print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n")
print('Line:', line)
#print(lines[max(i - 2, 0) : i + 2])
sure. pls see below code with backticks. one more thing pls. I want to loop through all the pdf files in folder, look for keyword and print the lines where keyword found with the name of pdf where this keyword is found. below code works fine for printing name of first few pdfs. after then it starts to append lines and prints name of wrong pdf against these lines. Pls can you let me know which lines I need to make changes?. thanks again
Importing required modules
```import os
```import PyPDF2
```import re
```import os
```word = [“countersigned”, “Authority”, “Company”]
```lines =[]
```for foldername,subfolders,files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
``````for file in files:
# open the pdf file
```pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername,file))
# get number of pages
```pages = pdfReader.getNumPages()
# Creating a pdf reader object
#pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# Getting number of pages in pdf file
#pages = pdfReader.numPages
#Pages = pdfFileObj.getNumPages()
# Loop for reading all the Pages
```for i in range(pages):
# Creating a page object
```pageObj = pdfReader.getPage(i)
# Printing Page Number
#print("Page No: ",i)
# Extracting text from page
# And splitting it into chunks of lines
#text = pageObj.extractText().split(" ")
```text = pageObj.extract_text()
```for line in text.split('\n'):
```lines.append(line)
#print(line)
``` for i, line in enumerate(lines):
```if any(word in line for word in word):
#if line in ["with","ethical","leader"]:
```print('\n')
```print(file)
#print(file)
#print(word, 'string exists in file')
#print('Line Number:', i)
#print("\n**** Lines containing Keyword: \"" +word+ "\" ****\n")
``` print('Line:', line)
#print(lines[max(i - 2, 0) : i + 2])
By “wrapping the code in backticks” I mean put 3 backticks ``` on a line before and a line after the code, as in my original reply.
As for the other matter, you just need to clear the lines for each file. I’ve cleaned up the code a little and removed some duplicated imports and unused lines below:
# Importing required modules
import PyPDF2
import os
import re
word = ["countersigned", "Authority", "Company"]
for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
for file in files:
print('file:', file)
# Collect the lines for this file
lines = []
# Open the pdf file
pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))
# Get the number of pages
numPages = pdfReader.getNumPages()
# Loop for reading all the pages
for i in range(numPages):
# Creating a page object
pageObj = pdfReader.getPage(i)
# Printing page number
#print("Page No: ", i)
# Extracting text from page
# and splitting it into lines
text = pageObj.extract_text()
for line in text.split('\n'):
lines.append(line)
#print(line)
# The above could be shortened to:
# lines = text.splitlines()
for i, line in enumerate(lines):
if any(word in line for word in word):
print('\n')
print('Line Number:', i)
print('Line:', line)
ok thanks. clearing the lines for each file- you mean removing unused lines?
There seems to be one problem in the code. if the word is found in last page then it prints the line with word only once. However, if lets say word is found in 8th page of 10 page document then it will print it thrice; 8th page, 9th page and 10th page with same line containing that word
# Importing required modules
import PyPDF2
import os
import re
word = ["countersigned", "Digital", "recklessly"]
for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
for file in files:
print('\n')
print('file:', file)
print('\n')
# Collect the lines for this file
lines = []
# Open the pdf file
pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))
# Get the number of pages
numPages = pdfReader.getNumPages()
# Loop for reading all the pages
for i in range(numPages):
# Creating a page object
pageObj = pdfReader.getPage(i)
# Printing page number
print("Page No: ", i)
# Extracting text from page
# and splitting it into lines
text = pageObj.extract_text()
for line in text.split('\n'):
lines.append(line)
#print(line)
# The above could be shortened to:
# lines = text.splitlines()
for i, line in enumerate(lines):
if any(word in line for word in word):
print('\n')
#print('Line Number:', i)
print('Line:', line)
#print('file:', file)
I see what you mean. In that case, just clear the lines at the start of each page:
# Importing required modules
import PyPDF2
import os
import re
word = ["countersigned", "Digital", "recklessly"]
for foldername, subfolders, files in os.walk(r"C:/Users/ambar/OneDrive/Desktop/dummy/dummy1"):
for file in files:
print('\n')
print('file:', file)
print('\n')
# Open the pdf file
pdfReader = PyPDF2.PdfFileReader(os.path.join(foldername, file))
# Get the number of pages
numPages = pdfReader.getNumPages()
# Loop for reading all the pages
for pageNum in range(numPages):
# Collect the lines for this page
lines = []
# Creating a page object
pageObj = pdfReader.getPage(pageNum)
# Printing page number
print("Page No: ", pageNum)
# Extracting text from page
# and splitting it into lines
text = pageObj.extract_text()
for line in text.split('\n'):
lines.append(line)
#print(line)
# The above could be shortened to:
# lines = text.splitlines()
for lineNum, line in enumerate(lines):
if any(word in line for word in word):
print('\n')
#print('Line Number:', lineNum)
print('Line:', line)
#print('file:', file)