Beginner help with tables

mischief · November 20, 2022, 2:33pm

Hi All,
I am trying to learn from the ground up with what i thought would be an easy exercise.

What I would like to do: Read a table from a PDF file (lab data reports), extract the data, and write them to a CSV file. Eventually I would like to read tables from several PDF files and write them to a single CSV file.

What I did: I found a library called pdftextract ( pdftextract · PyPI ) which at least got me to read in the pdf and find the table i want. I thought this would be good because the author mentions you can us regex on the data.

Heres what i have, just playing with it:

from pdftextract import XPdf 
import re
import csv

file_path = "Test.pdf"

pdf = XPdf(file_path)

txt = pdf. to_text (table-True)

tables = pdf.table[:]

print (len(tables)) # prints how many tables were found in pad
print(tables [3]) # print formatted content of table 3
table3_ data = tables [3].data # will return all rows in table 3 except headers
print (table3_data)


tables[3].to_csv(f"text.csv")

The resulting table has some characters and columns i don’t need. The first column is the chemical name, that’s fine. The second column has the lab result but with some letters in front and behind it, that i would like to remove. the last column i don’t need at all.

I thought i could use some regex commands (like “pattern” since all the letters to be removed are the same) to remove the letters from the second column but i dont fully understand how to loop through the rows of a table that was read in through this library. I tried table3_data.replace(‘ug/L ‘,’’) but got AttributeError: List object has no attribute ‘replace’.

I think in general I am having a hard time understanding how to manipulate tables in python, and searching online only gives me results such as creating tables. I would like to do this without any additional libraries if possible, just so i can better learn the syntax and see the logic in front of me.

vovavili · November 20, 2022, 3:30pm

The standard way of extracting dataframes from PDF is by using Tabula, which you can then write into a CSV with df.to_csv.

I think in general I am having a hard time understanding how to manipulate tables in python

99% of the time you use pandas for that. Don’t worry about the remaining 1% for now.

I would like to do this without any additional libraries if possible

I strongly recommend you to avoid this mentality. If there is one library you have to learn to be decent at Python data wrangling, it absolutely has to be pandas. Trying to go around that is like trying to build a house out of bricks without using any cement - doable, but probably a huge waste of time.

mischief · November 20, 2022, 3:52pm

ok, thank you so much. I had read somewhere else to not rely on libraries at the beginning, I waffled a bit when i read that, so its good to be told thats wrong. Everything i searched on the topic came up with pandas.

I will retry and attack this from that approach, as you suggest - certainly at least theres a lot more resources and examples.

vovavili · November 20, 2022, 4:23pm

This is a terrible take. The whole point of Python is its gamut of amazing libraries to make some task at hand feel much easier to understand. Since I am an experienced Python user I can personally rely on raw sqllite3 to write my SQL queries, navigate the web with urllib or use ElementTree to parse my XML files, but good like trying to write SQL without an ORM like SQLAlchemy, send a GET request without requests, or trying to navigate XML without using xmltodict if you struggle to make a for loop. You’d literally be 5 times less productive. The very few kinds of libraries you’d probably want to avoid using as a beginner are optimization-focused like orjson as opposed to json or cytoolz as opposed to functools, which basically do the same thing productivity-wise but in a more performant or elegant way.

mischief · November 20, 2022, 5:16pm

Ok, thank you - I understand your point of view, but as a beginner who grew up on Fortran, its such a new way of life.

Back to my little code: I got Tabula and everything up and running. I converted the pdf and managed to get my table off the second page and deleted the columns I dont need. Much easier, for sure.

Now my issue is this: I want to delete rows that have the word “sum” in them, whether it be “sum” or “total_sum” under a column called “Parameter”.

I tried the following and the code runs, but does not drop the rows


discard = ['Sum', 'sum']
table[~table1.Parameter.str.contains('|'.join(discard))]

any ideas?

vovavili · November 20, 2022, 7:41pm

Just like in Fortran, you would need to reassign the expression return value to a variable:

table = table[~table.Parameter.str.lower().str.contains("sum")]

grandwazoo · November 21, 2022, 2:37pm

In every language I know, most of the features are provided in libraries. This is true of C, C++, Java, etc. Python fits right in with that approach.

vovavili · November 21, 2022, 3:10pm

Yeah, and for example in R if you don’t use tidyverse and especially magrittr pipes you may as well have not used the language at all.