Adding data from PDF pages to dataframe using loop, would love to have page ID variable

Hi all, just joined this forum, nice to meet you.

I wrote a code that cycles through pages of PDF tables (as image), recognizes data and saves them in dataframe. (It’s based on awesome article “A table detection, cell recognition and text extraction algorithm to convert tables in images to excel files” in towardsdatascience)
I need help creating a page ID variable. Here is the excerpt of code in question, please let me know if it’s not enough:

for p in range(count):  # pages cycle
    outer=[]
    for i in range(len(finalboxes)):
        for j in range(len(finalboxes[i])):
            inner=''
            if(len(finalboxes[i][j])==0):
                outer.append(' ') 
            else:
                for k in range(len(finalboxes[i][j])):
                    y,x,w,h = finalboxes[i][j][k][0], finalboxes[i][j][k][1], finalboxes[i][j][k][2], finalboxes[i][j][k][3]
                    # image manipulation goes here, for each cell (box) in tables
                    out = pytesseract.image_to_string(cell_final, config=custom_config1)
                    if(len(out)==0):
                        out = pytesseract.image_to_string(cell_final, config='--psm 3')
                    inner = inner + out
                outer.append(inner)
    arr = np.array(outer)
    df = pd.concat([df, 
                    pd.DataFrame(arr.reshape(len(row),countcol)).replace(r'\r+|\n+|\t+','', regex=True)]) 
                    #ignore_index = True)
    #df['PageNum'] = str(p+1)   # this works but fills all values with last page number

Thanks in advance!

One way to do it is by placing this after the for loop:

df["PageNum"] = range(1, count+1)

Thanks… but it throws

ValueError: Length of values (7) does not match length of index (21)

I have 7 pages and I pick only 3 top rows from each for test runs.

So the dataframe has multiple rows which should have the same page number? Is it always three rows per page?

Is the dataframe empty before this loop, or does it already contain something?

Yes, multiple rows should have same page number… and not always three rows per page, can have more, or less, depending on PDF pages the code is dealing with.
Empty before the loop.

You can repeat each page number:

df["PageNum"] = [p for p in range(1, count + 1) for _ in range(3)]

Thx! Will give it a shot tomorrow…

Did not work either, was giving me size mismatch error.
Good news is I figured it out, this below did the trick. Thanks for all the help!

    page_df = pd.DataFrame(arr.reshape(len(row),countcol)).replace(r'\r+|\n+|\t+','', regex=True)  #  reshape(len(row),countcol) to save time, see above
    page_df['PageNum'] = p+1
    df = pd.concat([df, page_df], ignore_index = True)