I am trying to learn from the ground up with what i thought would be an easy exercise.
What I would like to do: Read a table from a PDF file (lab data reports), extract the data, and write them to a CSV file. Eventually I would like to read tables from several PDF files and write them to a single CSV file.
What I did: I found a library called pdftextract ( pdftextract · PyPI ) which at least got me to read in the pdf and find the table i want. I thought this would be good because the author mentions you can us regex on the data.
Heres what i have, just playing with it:
from pdftextract import XPdf
file_path = "Test.pdf"
pdf = XPdf(file_path)
txt = pdf. to_text (table-True)
tables = pdf.table[:]
print (len(tables)) # prints how many tables were found in pad
print(tables ) # print formatted content of table 3
table3_ data = tables .data # will return all rows in table 3 except headers
The resulting table has some characters and columns i don’t need. The first column is the chemical name, that’s fine. The second column has the lab result but with some letters in front and behind it, that i would like to remove. the last column i don’t need at all.
I thought i could use some regex commands (like “pattern” since all the letters to be removed are the same) to remove the letters from the second column but i dont fully understand how to loop through the rows of a table that was read in through this library. I tried table3_data.replace(‘ug/L ‘,’’) but got AttributeError: List object has no attribute ‘replace’.
I think in general I am having a hard time understanding how to manipulate tables in python, and searching online only gives me results such as creating tables. I would like to do this without any additional libraries if possible, just so i can better learn the syntax and see the logic in front of me.
The standard way of extracting dataframes from PDF is by using Tabula, which you can then write into a CSV with df.to_csv.
I think in general I am having a hard time understanding how to manipulate tables in python
99% of the time you use pandas for that. Don’t worry about the remaining 1% for now.
I would like to do this without any additional libraries if possible
I strongly recommend you to avoid this mentality. If there is one library you have to learn to be decent at Python data wrangling, it absolutely has to be pandas. Trying to go around that is like trying to build a house out of bricks without using any cement - doable, but probably a huge waste of time.
ok, thank you so much. I had read somewhere else to not rely on libraries at the beginning, I waffled a bit when i read that, so its good to be told thats wrong. Everything i searched on the topic came up with pandas.
I will retry and attack this from that approach, as you suggest - certainly at least theres a lot more resources and examples.
This is a terrible take. The whole point of Python is its gamut of amazing libraries to make some task at hand feel much easier to understand. Since I am an experienced Python user I can personally rely on raw
sqllite3 to write my SQL queries, navigate the web with
urllib or use
ElementTree to parse my XML files, but good like trying to write SQL without an ORM like
SQLAlchemy, send a GET request without
requests, or trying to navigate XML without using
xmltodict if you struggle to make a for loop. You’d literally be 5 times less productive. The very few kinds of libraries you’d probably want to avoid using as a beginner are optimization-focused like
orjson as opposed to
cytoolz as opposed to
functools, which basically do the same thing productivity-wise but in a more performant or elegant way.
Ok, thank you - I understand your point of view, but as a beginner who grew up on Fortran, its such a new way of life.
Back to my little code: I got Tabula and everything up and running. I converted the pdf and managed to get my table off the second page and deleted the columns I dont need. Much easier, for sure.
Now my issue is this: I want to delete rows that have the word “sum” in them, whether it be “sum” or “total_sum” under a column called “Parameter”.
I tried the following and the code runs, but does not drop the rows
discard = ['Sum', 'sum']
Just like in Fortran, you would need to reassign the expression return value to a variable:
table = table[~table.Parameter.str.lower().str.contains("sum")]
In every language I know, most of the features are provided in libraries. This is true of C, C++, Java, etc. Python fits right in with that approach.
Yeah, and for example in R if you don’t use tidyverse and especially magrittr pipes you may as well have not used the language at all.