Tabula not placing the nested column properly on csv when using read_pdf dataframe to_csv function

krishna · January 6, 2024, 3:45pm

I am a beginner on python.I was trying to read pdf tables using read_pdf function on tabula.

I have read the data and written to csv file using read_pdf “dataframe.to_csv” function

My pdf has nested columns(multiple columns under a single column).For those the values are not placing properly.
My requirement is read the data against each column and save this into corresponding columns in my database table

My plan was to conert to csv and then read from csv to data array.

How can I get the nested column values properly on data array using tabula?

I here by attaching the pdf structure

.The columns in red marked area not reading properly
Here is the code I have used

import tabula
import pandas as pd
infile  = "demo.pdf" 
df_data = tabula.read_pdf(infile, 
                          pages = "1",
                          multiple_tables = False, 
                          lattice=True,
                          #pandas_options={'skiprows':1}
                          #pandas_options={'skiprows':1}
                          #pandas_options={'header': [0,1]}
                         )[0]



df_data.to_csv("filename_2.csv")