PDF File reader

I am trying to get python code to read my pdf search for a certain keyword but keep coming up empty could anyone help my out !!??!!

What are you using to read the PDF?

PDFs usually compress their contents to reduce their size, so you can’t necessarily just read it as a plain text file.

i have tried a few things like pdfminer or PyPDF2
Is it possible to read a pdf file with Python ??

Why are you asking whether it’s possible when you’ve already used pdfminer and PyPDF2?

Surely the question should be why it’s not finding the keyword!

What code did you use when you were trying with, say, PyPDF2?

Simply set up a file handler, then use the .extract_text() method on each page, in a loop; that should do it.

well im using chatGBT for my code but it cant answer all my question’s

i tried that i come up empty

Although ChatGPT can give you code, it won’t necessarily be correct. Sometimes it won’t even run!

Without any code or the PDF, there’s no way to know what’s wrong.

import PyPDF2
import re

def extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number):
with open(pdf_path, ‘rb’) as file:
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
page_width = float(page.mediaBox.upper_right[0])
page_height = float(page.mediaBox.upper_right[1])

    crop_x1 = x1 * page_width
    crop_x2 = x2 * page_width
    crop_y1 = (1 - y2) * page_height
    crop_y2 = (1 - y1) * page_height

    cropped_page = page.crop((crop_x1, crop_y1, crop_x2, crop_y2))
    text = cropped_page.extract_text()

    # Use regex pattern to find the mix design number
    pattern = r'Mix Design:\s*(\d+)'
    match = re.search(pattern, text)
    if match:
        extracted_mix_design_number = match.group(1)
        if extracted_mix_design_number == desired_mix_design_number:
            return True
        else:
            return False
    else:
        return False

Example usage

pdf_path = ‘C:/Users/Operator/Desktop/Batch Reports/New Folder/1.pdf’ # Replace with the actual path to your PDF file
x1 = 0.8 # Replace with the x-coordinate of the top-left corner of the rectangular region
x2 = 0.9 # Replace with the x-coordinate of the bottom-right corner of the rectangular region
y1 = 0.1 # Replace with the y-coordinate of the top-left corner of the rectangular region
y2 = 0.2 # Replace with the y-coordinate of the bottom-right corner of the rectangular region
desired_mix_design_number = ‘50’ # Replace with the desired mix design number

mix_design_found = extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number)
if mix_design_found:
print(“Mix Design Number found in the specified region”)
else:
print(“Mix Design Number not found in the specified region”)

I’ll let you have the code that I use, but as you’ve a different use case, you’ll have to adapt.

What this does is to extract the text from each page and save the text in a text file. I know it works for the PDF files that I used it on, but the files I’ve used it on are not protected in any way, so if you’re not seeing any results, then that’ll be why and I don’t have a solution for protected files.

from PyPDF2 import PdfReader

pdf_document = input("file: ")

file_name = pdf_document.split('.')

text_file = f'{file_name[0]}.txt'

with open(pdf_document, "rb") as filehandle, open(text_file, mode='w', encoding='UTF-8') as output:
    pdf = PdfReader(filehandle)
    num_of_pages = len(pdf.pages)
    for page_number in range(num_of_pages):
        page = pdf.pages[page_number]
        print(f"Page: {page_number+1}", file=output)
        print('', file=output)
        print(page.extract_text(), file=output)
        print('', file=output)

im still new to this so bare with me !!!

from PyPDF2 import PdfReader

pdf_document = 'C:/Users/Operator/Onedrive/Desktop/Batch Reports/New Folder/1.pdf'
file_name = pdf_document.split('.')

text_file = 'C:/Users/Operator/Onedrive/Desktop/Batch Reports/New Folder/1.txt'

with open(pdf_document, "rb") as filehandle, open(text_file, mode='w', encoding='UTF-8') as output:
    pdf = PdfReader(filehandle)
    num_of_pages = len(pdf.pages)
    for page_number in range(num_of_pages):
        page = pdf.pages[page_number]
        print(f"Page: {page_number+1}", file=output)
        print('', file=output)
        print(page.extract_text(), file=output)
        print('', file=output)

is this right ??

To preserve the formatting of the code (and, in the case of error, the traceback), select the code and then click the </> button.

import PyPDF2
import re

def extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        page = reader.pages[0]
        page_width = float(page.mediaBox.upper_right[0])
        page_height = float(page.mediaBox.upper_right[1])

        crop_x1 = x1 * page_width
        crop_x2 = x2 * page_width
        crop_y1 = (1 - y2) * page_height
        crop_y2 = (1 - y1) * page_height

        cropped_page = page.crop((crop_x1, crop_y1, crop_x2, crop_y2))
        text = cropped_page.extract_text()

        # Use regex pattern to find the mix design number
        pattern = r'Mix Design:\s*(\d+)'
        match = re.search(pattern, text)
        if match:
            extracted_mix_design_number = match.group(1)
            if extracted_mix_design_number == desired_mix_design_number:
                return True
            else:
                return False
        else:
            return False

# Example usage
pdf_path = 'C:/Users/Operator/Desktop/Batch Reports/New Folder/1.pdf'  # Replace with the actual path to your PDF file
x1 = 0.8  # Replace with the x-coordinate of the top-left corner of the rectangular region
x2 = 0.9  # Replace with the x-coordinate of the bottom-right corner of the rectangular region
y1 = 0.1  # Replace with the y-coordinate of the top-left corner of the rectangular region
y2 = 0.2  # Replace with the y-coordinate of the bottom-right corner of the rectangular region
desired_mix_design_number = '50'  # Replace with the desired mix design number

mix_design_found = extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number)
if mix_design_found:
    print("Mix Design Number found in the specified region")
else:
    print("Mix Design Number not found in the specified region")

Yeah looks okay to me.

As you’re use hard code file paths, you can drop the file_name = pdf_document.split('.'), but it’ll not harm to leave it there; it’s just not doing anything, is all.

this is all thats in the txt file

Page: 1

Are you sure that the PDF files have text, and not simply images of text? Can you select the text, with the with your mouse, as you would in a regular text document?

I downloaded PyPDF2 3.0.1 from PyPI and found that page.mediaBox is now page.mediabox and there’s no page.crop.

how could i show you my pdf ??

Thank you.

@cadrenw I think that this post was meant to be directed at you.

i know about the update thats just the code chatGBT gave me