Insert hierarchical data into a sql file

Hello,

Have spent the past few days working out this script to read a pdf and insert the categories based on the font size (pixels). Have that working.

Then the data will be inserted into a MySQL insert statement.

However, the use of the insert statement is to populate a db table which will then be used to populate html selects. They are dynamically dependent. This requires each record have 2 levels.

So, how would you get the nearest level up based on font size?

I had an idea maybe after this line: with open('/path-to-file/insert_master_format.sql') as fileobj: text = fileobj.read()
You could compare the first level that is contained above it, copy it and then paste it in the current record such that ('NULL', 'NULL', 'NULL', 'Project Directory ', 'NULL'); Project Directory would be copied for the next lvl_5?

But I’m not sure if that’s feasible… any ideas? The trouble I see, is that by the time the record is in the sql file, it no longer has the column designation - meaning it’s a string not a list. In sql this is pretty easy, but I’m really new on python.


Here’s a link to a pdf for testing: https://specsintact.ksc.nasa.gov/PDF/MasterFormat/MasterFormat-2016.pdf

from io import StringIO
output_string = StringIO()
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar, LAParams

LAParams(line_overlap=20,char_margin=2.0,line_margin=5,word_margin=.01,boxes_flow=-1,detect_vertical=False,all_texts=True)

def switch_lvl(firstChar, characters):

    switch = {
        'lvl_1': characters if firstChar == 20 else 'NULL',
        'lvl_2': characters if firstChar == 14 else 'NULL',
        'lvl_3': characters if firstChar == 12 else 'NULL',
        'lvl_4': characters if firstChar == 11 else 'NULL',
        'lvl_5': characters if firstChar == 10 else 'NULL',
    }
    
    record = list(switch.values())
    
    filedata = ('INSERT INTO master_formats (lvl_1, lvl_2, lvl_3, lvl_4, lvl_5)' + ' VALUES (\'' + "', '".join(record) + '\');')

    with open('/path-to-file/insert_master_format.sql') as fileobj:
        text = fileobj.read()
    
    with open('/path-to-file/insert_master_format.sql', 'a') as file:
        if not text.endswith('\n'):
            file.write('\n')
        file.write(filedata) 
        file.close()
             

for page_layout in extract_pages("/path-to-file/MasterFormat-2016-1.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:              
                
                characters = (text_line.get_text())
                
                for character in text_line:
                    if isinstance(character, LTChar):
              
                        firstChar = round(character.size)
                      
                        switch_lvl(firstChar, characters)
                    
                        break

Has anyone found a way to associate a line-wrap while using pdfminer.six? You can see above

00 00 00 Procurement and Contracting
         Requirements

The line wrap is not included on after Contracting and then interprets the pdf as two columns.

Any tips on getting around this? Pretty tricky from what I can tell.