I would approach this by writing up a parser in Python probably. If you want to learn how to do that find yourself a copy of "Automate the Boring Stuff with Python."
For your immediate problem, maybe try this?
To my knowledge, none of the automation tools in Salesforce would have this capability natively. Here's an third party tool that might help with what you're looking for: https://docparser.com/blog/pdf-salesforce-integration/
Disclaimer: I'm a team member of DocuSutra.
We created the product exactly to let the end-users "teach" the system to parse documents following a similar format. It is simple a point-and-click interface and requires no knowledge of coding.
The popular existing solution in the market is DocParser, which takes a template-based approach that requires users to have background knowledge of regular expressions.
Have you looked at services like this one? https://docparser.com
I have built a parser with code for extracting tables from documents I have but pdf is not a very good format for data processing so it might be tough.
Let me know if you need more help
PDFs are sort of infamous for being annoying to work with. Just google "PDF parser" to see what I mean--there's ready-built software for it, there are programming-language modules for it, but it's always a pain. At the minimum you want your database to include the article title, the article source, and the article text. I'm not sure that there are a lot of PDF-parsing tools that can "click" the internal links--maybe there are, maybe there are not, not sure. So you have to hope that the text itself is structured in such a way that you can identify the links, the start of each article associated with the link, and the end of the article. This may require regular expressions. All of this will definitely require a general-purpose programming language.
If you CAN successfully parse the PDFs, then your parser can store the article title, the source, and the text in your Access database . . . except Access is sort of entry-level and may not allow you to store the full article text in a field, so then you might have to use Postgres or MySQL or purchase MS SQL Server to do this, or find a database system optimized for full-text articles if you're going to eventually want to collect gazillions of these (possibly a "noSQL" solution).
So I really think the key here is the group preparing these PDFs for you. Even if they're doing it purely manually, the correct starting point for your database pipeline is whatever they do BEFORE they make your PDF. At that point, it'll be a lot easier to make entries into this database.
And if the narrow goal is just to be able to search all of your PDFs and not to build a robust article database, I really think just using a free PDF combiner software or paying for Adobe to do it (if it's not free on Adobe) is the best bet. Maybe every 1000 pages you just make a new combo file.
Just my perspective though, and I may be missing something crucial.
>^.+(?:\n.+)+\n\n^.+(?:\n.+)+
I like how it showed as one match without 2 separate groups. Unfortunately, it doesn't work in docparser.com. Their regex feature seems iffy with quantifiers or backreferencing. @mfb-'s solution seems to be the only that work though. Thanks so much for the help!
Hello! DocSpring might make the PDF filling step easier once you have all of the data ready. I can also recommend the DocParser service to extract data from PDFs.
If you are interested, I would be happy to give you a free DocSpring account (up to 1000 generated PDFs/mo) since this is for your masters exam. I'm not affiliated with DocParser but you could reach out to them as well and see if they can offer a free student account.
I see. I work at Fluix.io, and we have a B2B service that makes this possible within team environments, but not quite like the use case you're mentioning.
I wonder if DocParser would be a good fit here: https://docparser.com/solutions/form-pdf-to-text.
Apparently ABBYY (and other existing software) can do tables already / bank statements so this isn't really new
When it comes to something I've actually used however, I just made free accounts to use with https://docparser.com/ for OCR'ing bank statements
Great if you find someone to make an in-house solution, but this software already exists in many forms.
It's called OCR. Acrobat has it built in.
If you are exporting to a database, then you want something like this
There are many options. Basically every large Business Intelligence company at some point has consumed a scan to text to database/spreadsheet company.
If you are willing to use a paid solution, I suggest a little software I use, is called "Docparser" it can extract data from PDF files, and upload it straight into a spreadsheet of your choosing, is not expensive, is very reasonable,you get charge per month and you choose your own monthly plan, starts at $40 a month, you can upload the pdf files to a Dropbox folder, and the program can automatically extract the data from whatever PDF files you have uploaded to the specified dropbox folder.
check it out: https://docparser.com/
So, i spent a week tweaking to get tesseract to work. My bank statements had these horizontal dotted lines between every transaction and those were almost impossible to remove, and kept being recognized as text.
In the end I ended up googling and finding docparser.com
Night, and day. Took me a few minutes before i realized that i'd been wasting my time.
Also, go into the advanced options. There are some pre-processing filters there that will auto-rotate, etc.
Anyway. Cost me $20, worth every damn cent.
Given that constraint, a solution could be parsing their work order forms programmatically and entering that data into your system. The difficulty here depends on the format and consistency of their forms though, and there are various approaches to tackling something like this. You can develop it yourself for maximum control or use something like https://docparser.com/ (found with a quick google search) that you can use to set parsing rules and have it spit out the data you need into a format you can use
This is advertisement for a product that does what you want and should give you some good terms to search on to find others.
https://docparser.com/blog/extract-data-from-pdf/
I have no connection to this company in any way - just the first thing I found with google.