Hi all, just joined this forum, nice to meet you.
I wrote a code that cycles through pages of PDF tables (as image), recognizes data and saves them in dataframe. (It’s based on awesome article “A table detection, cell recognition and text extraction algorithm to convert tables in images to excel files” in towardsdatascience)
I need help creating a page ID variable. Here is the excerpt of code in question, please let me know if it’s not enough:
for p in range(count): # pages cycle
outer=[]
for i in range(len(finalboxes)):
for j in range(len(finalboxes[i])):
inner=''
if(len(finalboxes[i][j])==0):
outer.append(' ')
else:
for k in range(len(finalboxes[i][j])):
y,x,w,h = finalboxes[i][j][k][0], finalboxes[i][j][k][1], finalboxes[i][j][k][2], finalboxes[i][j][k][3]
# image manipulation goes here, for each cell (box) in tables
out = pytesseract.image_to_string(cell_final, config=custom_config1)
if(len(out)==0):
out = pytesseract.image_to_string(cell_final, config='--psm 3')
inner = inner + out
outer.append(inner)
arr = np.array(outer)
df = pd.concat([df,
pd.DataFrame(arr.reshape(len(row),countcol)).replace(r'\r+|\n+|\t+','', regex=True)])
#ignore_index = True)
#df['PageNum'] = str(p+1) # this works but fills all values with last page number
Thanks in advance!