PDF File reader

cadrenw · July 6, 2023, 10:51pm

I am trying to get python code to read my pdf search for a certain keyword but keep coming up empty could anyone help my out !!??!!

MRAB · July 6, 2023, 11:01pm

What are you using to read the PDF?

PDFs usually compress their contents to reduce their size, so you can’t necessarily just read it as a plain text file.

cadrenw · July 6, 2023, 11:04pm

i have tried a few things like pdfminer or PyPDF2
Is it possible to read a pdf file with Python ??

MRAB · July 6, 2023, 11:12pm

Why are you asking whether it’s possible when you’ve already used pdfminer and PyPDF2?

Surely the question should be why it’s not finding the keyword!

What code did you use when you were trying with, say, PyPDF2?

rob42 · July 6, 2023, 11:35pm

Simply set up a file handler, then use the .extract_text() method on each page, in a loop; that should do it.

cadrenw · July 6, 2023, 11:50pm

well im using chatGBT for my code but it cant answer all my question’s

cadrenw · July 7, 2023, 12:11am

i tried that i come up empty

MRAB · July 7, 2023, 12:15am

Although ChatGPT can give you code, it won’t necessarily be correct. Sometimes it won’t even run!

Without any code or the PDF, there’s no way to know what’s wrong.

cadrenw · July 7, 2023, 12:17am

import PyPDF2
import re

def extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number):
with open(pdf_path, ‘rb’) as file:
reader = PyPDF2.PdfReader(file)
page = reader.pages[0]
page_width = float(page.mediaBox.upper_right[0])
page_height = float(page.mediaBox.upper_right[1])

    crop_x1 = x1 * page_width
    crop_x2 = x2 * page_width
    crop_y1 = (1 - y2) * page_height
    crop_y2 = (1 - y1) * page_height

    cropped_page = page.crop((crop_x1, crop_y1, crop_x2, crop_y2))
    text = cropped_page.extract_text()

    # Use regex pattern to find the mix design number
    pattern = r'Mix Design:\s*(\d+)'
    match = re.search(pattern, text)
    if match:
        extracted_mix_design_number = match.group(1)
        if extracted_mix_design_number == desired_mix_design_number:
            return True
        else:
            return False
    else:
        return False

Example usage

pdf_path = ‘C:/Users/Operator/Desktop/Batch Reports/New Folder/1.pdf’ # Replace with the actual path to your PDF file
x1 = 0.8 # Replace with the x-coordinate of the top-left corner of the rectangular region
x2 = 0.9 # Replace with the x-coordinate of the bottom-right corner of the rectangular region
y1 = 0.1 # Replace with the y-coordinate of the top-left corner of the rectangular region
y2 = 0.2 # Replace with the y-coordinate of the bottom-right corner of the rectangular region
desired_mix_design_number = ‘50’ # Replace with the desired mix design number

mix_design_found = extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number)
if mix_design_found:
print(“Mix Design Number found in the specified region”)
else:
print(“Mix Design Number not found in the specified region”)

rob42 · July 7, 2023, 12:18am

I’ll let you have the code that I use, but as you’ve a different use case, you’ll have to adapt.

What this does is to extract the text from each page and save the text in a text file. I know it works for the PDF files that I used it on, but the files I’ve used it on are not protected in any way, so if you’re not seeing any results, then that’ll be why and I don’t have a solution for protected files.

from PyPDF2 import PdfReader

pdf_document = input("file: ")

file_name = pdf_document.split('.')

text_file = f'{file_name[0]}.txt'

with open(pdf_document, "rb") as filehandle, open(text_file, mode='w', encoding='UTF-8') as output:
    pdf = PdfReader(filehandle)
    num_of_pages = len(pdf.pages)
    for page_number in range(num_of_pages):
        page = pdf.pages[page_number]
        print(f"Page: {page_number+1}", file=output)
        print('', file=output)
        print(page.extract_text(), file=output)
        print('', file=output)

cadrenw · July 7, 2023, 12:26am

im still new to this so bare with me !!!

from PyPDF2 import PdfReader

pdf_document = 'C:/Users/Operator/Onedrive/Desktop/Batch Reports/New Folder/1.pdf'
file_name = pdf_document.split('.')

text_file = 'C:/Users/Operator/Onedrive/Desktop/Batch Reports/New Folder/1.txt'

with open(pdf_document, "rb") as filehandle, open(text_file, mode='w', encoding='UTF-8') as output:
    pdf = PdfReader(filehandle)
    num_of_pages = len(pdf.pages)
    for page_number in range(num_of_pages):
        page = pdf.pages[page_number]
        print(f"Page: {page_number+1}", file=output)
        print('', file=output)
        print(page.extract_text(), file=output)
        print('', file=output)

is this right ??

MRAB · July 7, 2023, 12:29am

To preserve the formatting of the code (and, in the case of error, the traceback), select the code and then click the </> button.

cadrenw · July 7, 2023, 12:31am

import PyPDF2
import re

def extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        page = reader.pages[0]
        page_width = float(page.mediaBox.upper_right[0])
        page_height = float(page.mediaBox.upper_right[1])

        crop_x1 = x1 * page_width
        crop_x2 = x2 * page_width
        crop_y1 = (1 - y2) * page_height
        crop_y2 = (1 - y1) * page_height

        cropped_page = page.crop((crop_x1, crop_y1, crop_x2, crop_y2))
        text = cropped_page.extract_text()

        # Use regex pattern to find the mix design number
        pattern = r'Mix Design:\s*(\d+)'
        match = re.search(pattern, text)
        if match:
            extracted_mix_design_number = match.group(1)
            if extracted_mix_design_number == desired_mix_design_number:
                return True
            else:
                return False
        else:
            return False

# Example usage
pdf_path = 'C:/Users/Operator/Desktop/Batch Reports/New Folder/1.pdf'  # Replace with the actual path to your PDF file
x1 = 0.8  # Replace with the x-coordinate of the top-left corner of the rectangular region
x2 = 0.9  # Replace with the x-coordinate of the bottom-right corner of the rectangular region
y1 = 0.1  # Replace with the y-coordinate of the top-left corner of the rectangular region
y2 = 0.2  # Replace with the y-coordinate of the bottom-right corner of the rectangular region
desired_mix_design_number = '50'  # Replace with the desired mix design number

mix_design_found = extract_mix_design_number(pdf_path, x1, y1, x2, y2, desired_mix_design_number)
if mix_design_found:
    print("Mix Design Number found in the specified region")
else:
    print("Mix Design Number not found in the specified region")

rob42 · July 7, 2023, 12:35am

Yeah looks okay to me.

As you’re use hard code file paths, you can drop the file_name = pdf_document.split('.'), but it’ll not harm to leave it there; it’s just not doing anything, is all.

cadrenw · July 7, 2023, 1:00am

this is all thats in the txt file

Page: 1

rob42 · July 7, 2023, 1:06am

Are you sure that the PDF files have text, and not simply images of text? Can you select the text, with the with your mouse, as you would in a regular text document?

MRAB · July 7, 2023, 1:07am

I downloaded PyPDF2 3.0.1 from PyPI and found that page.mediaBox is now page.mediabox and there’s no page.crop.

cadrenw · July 7, 2023, 1:07am

how could i show you my pdf ??

rob42 · July 7, 2023, 1:11am

Thank you.

@cadrenw I think that this post was meant to be directed at you.

cadrenw · July 7, 2023, 1:13am

i know about the update thats just the code chatGBT gave me

Topic		Replies	Views
PDF Extraction with python wrappers Python Help	38	6363	January 15, 2024
Not able to read the pdf files Python Help	5	2808	September 12, 2022
To Get Font size of the text Python Help	1	4101	July 16, 2020
Extracting XML from PDFs Python Help help	3	1675	October 5, 2021
Convert PDF into TXT Python Help help	8	4011	April 12, 2023

PDF File reader

Example usage

Related Topics