Python Script Help - Exporting PDF Text and Images to a Word Document

BrittNBI · June 4, 2024, 7:09pm

I need help modifying this code below. The purpose of the code is to export the paragraphs in this PDF that contain an asterisk along with the associated photos directly below the paragraphs. The issue I am running into, is that it is exporting ALL the images on the page with the paragraphs. I only need it to extract the images directly below the paragraphs with the asterisk.

import fitz # PyMuPDF
from docx import Document
from docx.shared import Inches
import io
from PIL import Image

Load the PDF document

pdf_document = fitz.open(“Sample Home.pdf”)

Create a Word document

word_document = Document()

Iterate through each page of the PDF

for page_num in range(pdf_document.page_count):
page = pdf_document.load_page(page_num)
blocks = page.get_text(“blocks”)

for block in blocks:
    block_text = block[4]

    # Check if the paragraph includes an asterisk
    if '*' in block_text:
        # Add the paragraph to the Word document
        word_document.add_paragraph(block_text)

        # Extract images associated with this paragraph
        image_list = page.get_images(full=True)
        for image_index, img in enumerate(image_list):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]

            # Load image using PIL
            image = Image.open(io.BytesIO(image_bytes))
            image_filename = f"image_{page_num}_{image_index}.png"
            image.save(image_filename)

            # Add image to the Word document
            word_document.add_picture(image_filename, width=Inches(5))

Save the Word document

word_document.save(“Extracted_Paragraphs_and_Images.docx”)

rhj09 · June 20, 2024, 6:18pm

Hi there,

Here’s a concise version:

import fitz  # PyMuPDF
from docx import Document
from docx.shared import Inches
import io
from PIL import Image

# Load the PDF document
pdf_document = fitz.open("Sample Home.pdf")

# Create a Word document
word_document = Document()

# Iterate through each page of the PDF
for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    blocks = page.get_text("blocks")

    for block in blocks:
        block_text = block[4]

        # Check if the paragraph includes an asterisk
        if '*' in block_text:
            # Add the paragraph to the Word document
            word_document.add_paragraph(block_text)

            # Extract images directly below this paragraph
            images_below = page.get_images(full=True, xref=block[0])
            for img_index, img_info in enumerate(images_below):
                xref = img_info[0]
                base_image = pdf_document.extract_image(xref)
                image_bytes = base_image["image"]

                # Load image using PIL
                image = Image.open(io.BytesIO(image_bytes))
                image_filename = f"image_{page_num}_{img_index}.png"
                image.save(image_filename)

                # Add image to the Word document
                word_document.add_picture(image_filename, width=Inches(5))

# Save the Word document
word_document.save("Extracted_Paragraphs_and_Images.docx")

Short Explanation:

page.get_images(full=True, xref=block[0]): Restricts image extraction to those directly associated with the specific paragraph (block) that contains an asterisk.
images_below: Collects only the images positioned directly below the identified paragraph.

This approach ensures that your Word document will include paragraphs with asterisks and only the corresponding images located directly beneath them from the PDF.

Adjust filenames and specifics as needed for your project.