I’m new to Python and trying to write a script that reads a pdf of my bank statement for later use in spreadsheets.
The bank statement has withdrawals & deposits in separate columns. When I read the text in the pdf with PyPDF2, it just grabs chunks of text and separates them by a space. The problem is I cannot differentiate between withdrawals and deposits because they have the same sign and I don’t know which column the values came from. Is there a solution to this?
Even for someone who is not new to Python this would probably be pretty daunting -
The following SO post may have some useful links:
Oh wow! Daunting indeed. Looks like my bank has the option to download csv files… I guess that would be easier.
That is infinitely more easy. Most banks and other similar institutions offer .csv downloads.
In that case using the builtin
csv module is not too hard, but it may be easier to use
pandas (after doing a
pip install pandas):
import pandas as pd
df = pd.read_csv("statements.csv")
But simply loading the .csv as spreadsheet can also be done directly in programs like Excel or Numbers (on Mac).
Yes; most likely they do this specifically because CSV is designed for doing further work with (it’s the “plain text” of spreadsheet programs), while PDF is designed to look nice when printed.
The annoying thing about the csv is that I have to manually set a bunch of options on the bank’s page before they generate it (date range, account, format, etc). I wonder if I can write a script to automate that part?
It is certainly possible to automate interactions with a website. The problem would be whether your bank allows you to do so in their terms and conditions. The practice is called “web scraping”. Real Python has a tutorial here: