Hello,
Have spent the past few days working out this script to read a pdf and insert the categories based on the font size (pixels). Have that working.
Then the data will be inserted into a MySQL insert statement.
However, the use of the insert statement is to populate a db table which will then be used to populate html selects. They are dynamically dependent. This requires each record have 2 levels.
So, how would you get the nearest level up based on font size?
I had an idea maybe after this line: with open('/path-to-file/insert_master_format.sql') as fileobj: text = fileobj.read()
You could compare the first level that is contained above it, copy it and then paste it in the current record such that ('NULL', 'NULL', 'NULL', 'Project Directory ', 'NULL');
Project Directory
would be copied for the next lvl_5
?
But I’m not sure if that’s feasible… any ideas? The trouble I see, is that by the time the record is in the sql file, it no longer has the column
designation - meaning it’s a string not a list. In sql this is pretty easy, but I’m really new on python.
Here’s a link to a pdf for testing: https://specsintact.ksc.nasa.gov/PDF/MasterFormat/MasterFormat-2016.pdf
from io import StringIO
output_string = StringIO()
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar, LAParams
LAParams(line_overlap=20,char_margin=2.0,line_margin=5,word_margin=.01,boxes_flow=-1,detect_vertical=False,all_texts=True)
def switch_lvl(firstChar, characters):
switch = {
'lvl_1': characters if firstChar == 20 else 'NULL',
'lvl_2': characters if firstChar == 14 else 'NULL',
'lvl_3': characters if firstChar == 12 else 'NULL',
'lvl_4': characters if firstChar == 11 else 'NULL',
'lvl_5': characters if firstChar == 10 else 'NULL',
}
record = list(switch.values())
filedata = ('INSERT INTO master_formats (lvl_1, lvl_2, lvl_3, lvl_4, lvl_5)' + ' VALUES (\'' + "', '".join(record) + '\');')
with open('/path-to-file/insert_master_format.sql') as fileobj:
text = fileobj.read()
with open('/path-to-file/insert_master_format.sql', 'a') as file:
if not text.endswith('\n'):
file.write('\n')
file.write(filedata)
file.close()
for page_layout in extract_pages("/path-to-file/MasterFormat-2016-1.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
characters = (text_line.get_text())
for character in text_line:
if isinstance(character, LTChar):
firstChar = round(character.size)
switch_lvl(firstChar, characters)
break