What is Reddit's opinion of Tabula?
From 3.5 billion Reddit comments

➔ Tabula website

By popularity on Reddit, this Service is:

9 reviews of this app found across Reddit:

Tabula has been incredibly useful for extracting tables from pdf files, and even has an R library to further automate the process.

I can recommend tabula, (and tabula-java). There is also an R package which brings tabula-java to R. I recently used tabula and tabula-java on a large number of PDFs and it worked like a charm. A friend of mine used the R package 'tabulizer' and called it "utter f***ing magic".

http://tabula.technology/

https://github.com/tabulapdf/tabula-java

https://github.com/ropensci/tabulizer

You could write a python script to extract the data.

Or, if that sounds ominous to you, you could check out this nifty tool, tabula. With this you can just drag a window over each table and it'll save the table data in one csv, tsv, or json file. It'll take a little bit of manual labor but probably not much more than 20 minutes.

Tabula is a free, offline option for converting PDFs to Excel. For more a robust a paid option, I like Able2Extract. I've had PDFs that Tabula struggled with that Able2Extract easily handled.

If the statements are 3 pages or less, or if you're willing to split the statements into 3 pages, you could try the trial for Able2Extract. Otherwise, you could try Tabula. If the statements are scans, they have to be OCR'd before you can use them with Tabula (and possibly with Able2Extract as too, can't remember), but the results won't be as nice compared to PDFs that are already text-based.

Just do all the steps in PowerQuery/Get&Transform

You have to manually do it the first time, but for any updates you can just import the new files automatically and have it spit out the final file.

I highly suggest converting the pdf --> html in word first though

or run it through http://tabula.technology/

I'm not able to access the source PDFs at the moment, but if there are tables, tabula may help. It is Ruby based, but could be used as a step in the pipeline when extracting the data.

This dashboard displays annual endowment market value totals for approximately 800 US college and universities with the largest endowments. The data is provided by the NACUBO. I downloaded the PDF tables and used (Tabula)[http://tabula.technology/] and used Tableau to create the visualizations.

Ah wow haven't heard of this service. I will be reading into it this weekend! Thanks!

I guess since we're on the subject, would something like this project be something of interest to you? (If we can narrow down the details of how this would actually work?)

edit: just found this for PDF's which seems to be kinda nifty from my brief testing http://tabula.technology

if theres a way to have this know what is probably an address or name/skills etc then this could be a great starting point for PDF's

What is Reddit's opinion of Tabula? From 3.5 billion Reddit comments

➔ Tabula website

By popularity on Reddit, this Service is:

9 reviews of this app found across Reddit:

What is Reddit's opinion of Tabula?
From 3.5 billion Reddit comments