I am trying to get python code to read my pdf search for a certain keyword but keep coming up empty could anyone help my out !!??!!
What are you using to read the PDF?
PDFs usually compress their contents to reduce their size, so you can’t necessarily just read it as a plain text file.
i have tried a few things like pdfminer or PyPDF2
Is it possible to read a pdf file with Python ??
Why are you asking whether it’s possible when you’ve already used pdfminer and PyPDF2?
Surely the question should be why it’s not finding the keyword!
What code did you use when you were trying with, say, PyPDF2?
Simply set up a file handler, then use the .extract_text()
method on each page, in a loop; that should do it.
well im using chatGBT for my code but it cant answer all my question’s
i tried that i come up empty
Although ChatGPT can give you code, it won’t necessarily be correct. Sometimes it won’t even run!
Without any code or the PDF, there’s no way to know what’s wrong.
import PyPDF2
import re
def extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number):
with open(pdf_path, ‘rb’) as file:
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
page_width = float(page.mediaBox.upper_right[0])
page_height = float(page.mediaBox.upper_right[1])
crop_x1 = x1 * page_width
crop_x2 = x2 * page_width
crop_y1 = (1 - y2) * page_height
crop_y2 = (1 - y1) * page_height
cropped_page = page.crop((crop_x1, crop_y1, crop_x2, crop_y2))
text = cropped_page.extract_text()
# Use regex pattern to find the mix design number
pattern = r'Mix Design:\s*(\d+)'
match = re.search(pattern, text)
if match:
extracted_mix_design_number = match.group(1)
if extracted_mix_design_number == desired_mix_design_number:
return True
else:
return False
else:
return False
Example usage
pdf_path = ‘C:/Users/Operator/Desktop/Batch Reports/New Folder/1.pdf’ # Replace with the actual path to your PDF file
x1 = 0.8 # Replace with the x-coordinate of the top-left corner of the rectangular region
x2 = 0.9 # Replace with the x-coordinate of the bottom-right corner of the rectangular region
y1 = 0.1 # Replace with the y-coordinate of the top-left corner of the rectangular region
y2 = 0.2 # Replace with the y-coordinate of the bottom-right corner of the rectangular region
desired_mix_design_number = ‘50’ # Replace with the desired mix design number
mix_design_found = extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number)
if mix_design_found:
print(“Mix Design Number found in the specified region”)
else:
print(“Mix Design Number not found in the specified region”)
I’ll let you have the code that I use, but as you’ve a different use case, you’ll have to adapt.
What this does is to extract the text from each page and save the text in a text file. I know it works for the PDF files that I used it on, but the files I’ve used it on are not protected in any way, so if you’re not seeing any results, then that’ll be why and I don’t have a solution for protected files.
from PyPDF2 import PdfReader
pdf_document = input("file: ")
file_name = pdf_document.split('.')
text_file = f'{file_name[0]}.txt'
with open(pdf_document, "rb") as filehandle, open(text_file, mode='w', encoding='UTF-8') as output:
pdf = PdfReader(filehandle)
num_of_pages = len(pdf.pages)
for page_number in range(num_of_pages):
page = pdf.pages[page_number]
print(f"Page: {page_number+1}", file=output)
print('', file=output)
print(page.extract_text(), file=output)
print('', file=output)
im still new to this so bare with me !!!
from PyPDF2 import PdfReader
pdf_document = 'C:/Users/Operator/Onedrive/Desktop/Batch Reports/New Folder/1.pdf'
file_name = pdf_document.split('.')
text_file = 'C:/Users/Operator/Onedrive/Desktop/Batch Reports/New Folder/1.txt'
with open(pdf_document, "rb") as filehandle, open(text_file, mode='w', encoding='UTF-8') as output:
pdf = PdfReader(filehandle)
num_of_pages = len(pdf.pages)
for page_number in range(num_of_pages):
page = pdf.pages[page_number]
print(f"Page: {page_number+1}", file=output)
print('', file=output)
print(page.extract_text(), file=output)
print('', file=output)
is this right ??
To preserve the formatting of the code (and, in the case of error, the traceback), select the code and then click the </>
button.
import PyPDF2
import re
def extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number):
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
page_width = float(page.mediaBox.upper_right[0])
page_height = float(page.mediaBox.upper_right[1])
crop_x1 = x1 * page_width
crop_x2 = x2 * page_width
crop_y1 = (1 - y2) * page_height
crop_y2 = (1 - y1) * page_height
cropped_page = page.crop((crop_x1, crop_y1, crop_x2, crop_y2))
text = cropped_page.extract_text()
# Use regex pattern to find the mix design number
pattern = r'Mix Design:\s*(\d+)'
match = re.search(pattern, text)
if match:
extracted_mix_design_number = match.group(1)
if extracted_mix_design_number == desired_mix_design_number:
return True
else:
return False
else:
return False
# Example usage
pdf_path = 'C:/Users/Operator/Desktop/Batch Reports/New Folder/1.pdf' # Replace with the actual path to your PDF file
x1 = 0.8 # Replace with the x-coordinate of the top-left corner of the rectangular region
x2 = 0.9 # Replace with the x-coordinate of the bottom-right corner of the rectangular region
y1 = 0.1 # Replace with the y-coordinate of the top-left corner of the rectangular region
y2 = 0.2 # Replace with the y-coordinate of the bottom-right corner of the rectangular region
desired_mix_design_number = '50' # Replace with the desired mix design number
mix_design_found = extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number)
if mix_design_found:
print("Mix Design Number found in the specified region")
else:
print("Mix Design Number not found in the specified region")
Yeah looks okay to me.
As you’re use hard code file paths, you can drop the file_name = pdf_document.split('.')
, but it’ll not harm to leave it there; it’s just not doing anything, is all.
this is all thats in the txt file
Page: 1
Are you sure that the PDF files have text, and not simply images of text? Can you select the text, with the with your mouse, as you would in a regular text document?
I downloaded PyPDF2 3.0.1 from PyPI and found that page.mediaBox
is now page.mediabox
and there’s no page.crop
.
how could i show you my pdf ??
i know about the update thats just the code chatGBT gave me