I’m new to Python and trying to write a script that reads a pdf of my bank statement for later use in spreadsheets.
The bank statement has withdrawals & deposits in separate columns. When I read the text in the pdf with PyPDF2, it just grabs chunks of text and separates them by a space. The problem is I cannot differentiate between withdrawals and deposits because they have the same sign and I don’t know which column the values came from. Is there a solution to this?
That is infinitely more easy. Most banks and other similar institutions offer .csv downloads.
In that case using the builtin csv module is not too hard, but it may be easier to use pandas (after doing a pip install pandas):
Yes; most likely they do this specifically because CSV is designed for doing further work with (it’s the “plain text” of spreadsheet programs), while PDF is designed to look nice when printed.
The annoying thing about the csv is that I have to manually set a bunch of options on the bank’s page before they generate it (date range, account, format, etc). I wonder if I can write a script to automate that part?
It is certainly possible to automate interactions with a website. The problem would be whether your bank allows you to do so in their terms and conditions. The practice is called “web scraping”. Real Python has a tutorial here: