Extracting XML from PDFs

Hello people,

this is my first Post. Please don’t be too harsh with me :slight_smile:
So here is my current Situation. I wanted to code a Programm which can extract the XML files from a PDF and accumulate multiple XML files into one. I tried doing it with: pdfminer.six, the error message i get is: No module named ‘miner_text_generator’. What am i missing here? Am i using an outdated version or am i just not understanding something? (Im coding with python for 2 months now)

My code i currently have:

import os
import xml.etree.ElementTree as xml
from miner_text_generator import extract_text_by_page
from xml.dom import minidom
from tkinter.filedialog import askopenfilename
from datetime import datetime
import sys

openfile = askopenfilename()
today = datetime.now()

def export_as_xml(pdf_path, xml_path):
    openfile = os.path.splitext(os.path.basename(pdf_path))[0]
    root = xml.Element('{openfile}'.format(openfile=openfile))
    pages = xml.Element('Pages')
    root.append(pages)

    counter = 1
    for page in extract_text_by_page(pdf_path):
        text = xml.SubElement(pages, 'Page_{}'.format(counter))
        text.text = page[0:100]
        counter += 1

    tree = xml.ElementTree(root)
    xml_string = xml.tostring(root, 'utf-8')
    parsed_string = minidom.parseString(xml_string)
    pretty_string = parsed_string.toprettyxml(indent='  ')

    with open(xml_path, 'a') as fh:
        fh.write(pretty_string)

if __name__ == '__main__':
    pdf_path = openfile
    xml_path = 'my_file.xml'
    export_as_xml(pdf_path, xml_path)

Hello Benjamin,

The line

is trying to import a function from a module called miner_text_generator - i.e., from a file miner_text_generator.py. The error message is telling you that this module/this file does not exist. This is not a standard module, so you have to supply it.

I assume you copied this line from somewhere. Wherever you got it from should also tell you what the miner_text_generator module is supposed to be, and where to find it.

1 Like

Thanks for the response. I thought i would have installed all the necessary things. Im gonna double check that. Thanks alot

Thank you for letting me look back into it. I found out i had to code another program and then put it into the folder of my current program so that it gets the import. Good thing is I learned something new today :slight_smile: