The problem of inserting text into a PDF

kongjin81-China · October 24, 2024, 9:57am

i use code below to insert text into a PDF,The inserted content 1111 is in the correct position when opening the PDF, but when querying with the read_pdfw_ith_fitz function, it is found at the end of the PDF page.Please help me see how to make the necessary modifications to succeed

import fitz # PyMuPDF

def replace_text_in_pdf(input_file, output_file, target_text, modified_text):
pdf_document = fitz.open(input_file) # 请将此路径替换为您的 PDF 文件路径

for page in pdf_document:

    text_instances = page.search_for(target_text)  # 查找目标文本的所有实例

 
    for inst in text_instances:
       
        page.add_redact_annot(inst)  
        page.apply_redactions()  

        
        x0, y0, x1, y1 = inst  # 获取原文本的位置
      
        page.insert_text((x0, y0), modified_text + " " + target_text, fontsize=10, color=(0, 0, 0), fontname="helv")  # 插入新文本


pdf_document.save(output_file)  # 请将此路径替换为您想保存的 PDF 文件路径
pdf_document.close()  # 关闭 PDF 文件

def read_pdf_with_fitz(file_path, modified_text):

pdf_document = fitz.open(file_path)


for page_number in range(len(pdf_document)):
    page = pdf_document[page_number]
    text = page.get_text()  
    if modified_text in text:
        print(f"Page {page_number} text found successfully.")
    print(f"Page {page_number + 1}:\n{text}\n{'-' * 40}")

input_pdf = “input1.pdf”
output_pdf = “output22.pdf”
target_text = ‘因此，通过数据计算项⽬的收益相当可观。’
modified_text = ‘1111’

replace_text_in_pdf(input_pdf, output_pdf, target_text, modified_text)

read_pdf_with_fitz(output_pdf, modified_text)

c-rob · October 24, 2024, 9:11pm

I do not know how to fix your code but I will explain why. I was working for a client to make sure PDFs that were read by a computer were read in the same order as the text shown on the page. By working with the internal text of the PDF page I (and others who worked on this project) noticed that the internal text in the PDF, which the computer read, was not in the same order as it showed on the page.

This may or may not have something to do with your issue.

PDF internal structure is bizarre and didn’t make sense to us.

funkyfuture · October 25, 2024, 11:06am

that’s because the format is designed to produce highly accurate prints. it has to make sense for a printer, not a human. if you intend to use PDF for something else than provide input to a printing device, that’s the fundamental error you have to correct.

beside that @kongjin81-China, if you ask for help, make sure that what your question is comprehensible by others.

c-rob · October 28, 2024, 9:18am

Frank, People who use ideographic languages, like Mandarin or Cantonese, often use machine translation, and free machine translation services are not that great. So we will see plenty of bad translations. Even for the Latin character set based languages, machine translation is not that great.

We used machine translation a few years back and got a human translator to check it. The results of the translator software was awful.

kongjin81-China · October 30, 2024, 2:54am

i appreciate your help

Rosuav · October 30, 2024, 6:47am

Oh, I’m pretty sure the internal structure of a PDF doesn’t make sense to a printer either. It’s not quite PSD level insane but it’s up there. Some of PDF’s weirdnesses make sense in very specific, narrow circumstances (eg there are some features that are bizarre but make it possible to digitally sign the file multiple times by different people), others look like what happens when a file format grows and evolves over time while maintaining backward compatibility with older publishers and readers (there are three or four different ways to do the same thing, but all modern PDFs use just one of them), but then there are some just downright bizarre mixtures.

PostScript was (presumably) designed with the printer in mind. PDF? Not so sure. If you told me that its primary purpose was to boost sales of alcohol, I wouldn’t be able to prove you wrong.