Tabula has been incredibly useful for extracting tables from pdf files, and even has an R library to further automate the process.
I can recommend tabula, (and tabula-java). There is also an R package which brings tabula-java to R. I recently used tabula and tabula-java on a large number of PDFs and it worked like a charm. A friend of mine used the R package 'tabulizer' and called it "utter f***ing magic".
You could write a python script to extract the data.
Or, if that sounds ominous to you, you could check out this nifty tool, tabula. With this you can just drag a window over each table and it'll save the table data in one csv, tsv, or json file. It'll take a little bit of manual labor but probably not much more than 20 minutes.
Tabula is a free, offline option for converting PDFs to Excel. For more a robust a paid option, I like Able2Extract. I've had PDFs that Tabula struggled with that Able2Extract easily handled.
If the statements are 3 pages or less, or if you're willing to split the statements into 3 pages, you could try the trial for Able2Extract. Otherwise, you could try Tabula. If the statements are scans, they have to be OCR'd before you can use them with Tabula (and possibly with Able2Extract as too, can't remember), but the results won't be as nice compared to PDFs that are already text-based.
Just do all the steps in PowerQuery/Get&Transform
You have to manually do it the first time, but for any updates you can just import the new files automatically and have it spit out the final file.
I highly suggest converting the pdf --> html in word first though
or run it through http://tabula.technology/
I'm not able to access the source PDFs at the moment, but if there are tables, tabula may help. It is Ruby based, but could be used as a step in the pipeline when extracting the data.
This dashboard displays annual endowment market value totals for approximately 800 US college and universities with the largest endowments. The data is provided by the NACUBO. I downloaded the PDF tables and used (Tabula)[http://tabula.technology/] and used Tableau to create the visualizations.
Ah wow haven't heard of this service. I will be reading into it this weekend! Thanks!
I guess since we're on the subject, would something like this project be something of interest to you? (If we can narrow down the details of how this would actually work?)
edit: just found this for PDF's which seems to be kinda nifty from my brief testing http://tabula.technology
if theres a way to have this know what is probably an address or name/skills etc then this could be a great starting point for PDF's