Convert any type of pdf to xml formatted

tanvi · December 19, 2022, 4:58am

Hello,

Below is my code, where I am converting pdf to xml format.
But this gives me xml formatted file only if I used XFA-PDF(pdf form) formatted pdf.
I need to convert any type of pdf to xml format and xml contain information about text value, tables, images, objects/drawings and their x,y co-ordinates.
Is there any way to get this type of xml from pdf?

Thank you!

import PyPDF2
import re
def findInDict(needle, haystack):
    for key in haystack.keys():
        try:
            value=haystack[key]
        except:
            continue
        if key==needle:
            return value
        if isinstance(value,dict):          
            x=findInDict(needle,value)            
            if x is not None:
                return x

def create_xml_PDFform(xfa):
    for i in range(0,len(xfa)):
        try:
            xml = xfa[i].getObject().getData()
            f = open('C:\\Users\\tanvi_karekar\\'+str(pdf_file)+'.xml', 'ab')
            f.write(xml)
            f.close()
        except:
            continue

if __name__ == '__main__':
    pdf_file = 'sampleDoc3'
    pdf_file_path = 'C:\\Users\\tanvi_karekar\\'+str(pdf_file)+'.pdf'
    pdfobject = open(pdf_file_path,'rb')
    pdf = PyPDF2.PdfFileReader(pdfobject)
    xfa = findInDict('/XFA',pdf.resolved_objects) 
    create_xml_PDFform(xfa)

dlshriver · December 22, 2022, 9:53pm

It might help to provide some examples of what you are trying to do. It’s not clear to me that all pdfs can be converted to xml, or would be useful as xml. What is the problem you are trying to solve?

tanvi · December 26, 2022, 8:54am

Basically, I want structure of pdf into xml format. Suppose pdf contain table. row1 has columns has heading “image” and “name” and row2 has first column contain image1 & second contain its name like “Image123”. Then I want xml like,

<Page1>
<Table>
<Row1>
<Col1>
       <text>image</text>
</Col1>
<Col2>
       <text>name</text>
</Col2>
</Row1>
<Row2>
<Col1>
       <img name = "image1.jpg"></img>
</Col1>
<Col2>
       <text>Image123</text>
</Col2>
</Row2>
</Table>
</Page1>

Not exactly same but like that.

dlshriver · December 26, 2022, 4:08pm

Have you looked at pdfminer (https://pypi.org/project/pdfminer/)? It claims to be able to convert PDFs to xml.