> the main difficulty is in OCR on the pdf's
I've achieved pretty good results with OCRmyPDF
I'm taking a different approach with my database. It's based on user generated input, meaning every report has to be manually entered. Of course this would rely on the help of quite a few people to achieve a meaningful amount of reports. On the other hand you'll be able to query the DB in a very structured and granular form. The data model in it's current form would allow you to search the DB for very complex terms, like "Every report from credible witnesses in which a UFO landed next to a street whilst changing colors from green to red, and in which the witness approached the craft and were paralyzed, after which the craft took off at a 45 degree angle and left marks on the ground". I think it's worth putting in the effort to create this database in return for the research capabilities that it would offer.
I'll be linking to the documents that these reports were generated from, and I was thinking of running OCR over them too, so maybe you'd be interested in a collaboration of sorts.
You can do it directly using e.g. tesseract, if I recall correctly. But there are also extended tools for that. I think I used
https://github.com/jbarlow83/OCRmyPDF And for djvu "ocrodjvu" http://jwilk.net/software/ocrodjvu Before.
Edit: These are Linux/unix/macos solutions. Did not see the windows tag before, but I'll leave it here just in case.
My problem with Paperless is how it stores the documents - I'd rather not have the documents encrypted and be stored in a logical folder hierarchy. There's an open ticket for a similar request (although it's mostly to do with Owncloud) and it looks like the dev's not interested. I've ended up using https://github.com/jbarlow83/OCRmyPDF to help index my documents for lucene and I'm working on an auto filenaming/moving script.
Everyone's already mentioned what I would suggest:
Another option though is to split things up. I have https://github.com/jbarlow83/OCRmyPDF running and it's great. It's what paperless uses (or used to use). Right now I just have this monitoring a folder on my nas and then it moves the files elsewhere on the nas. So if you can find some software that can search inside of OCR'd PDFs, you can use this first and then stash them inside the other software.
It is already covered how to get the files. So here how to OCR.
I would use OCRmyPDF good experience with it. It uses tesseract as underlying image to text engine.
You can either write a bulk script yourself or i bet u will find one online that suits your needs.
ocrmypdf is what I’d normally suggest if you’re wanting to just apply OCR to an entire PDF of scanned pages.
If you have an electronically created pdf (not scanned) and you’re just wanting to run OCR on embedded images then you’ll want a pdf library that can extract the figure images for you, and then you can use tesserocr to run OCR on those images.
If you’ve got a more complicated case of scanned pages that have some text and some figures on them then you’ll need to find a way of classifying which area is text and which area is figure (which could probably be done quite effectively with a gradient-magnitude approach, where regions of an image that have significant changes rapidly everywhere are probably text, and regions that have larger flat blocks are probably figures).
This issue from the ocrmypdf GitHub (the open-source PDF OCR program) is likely worth reading.
The gist is that to detect tables it’s best to have something that knows what tables are (which neither tesseract or PDFs do), so you can either use something made for it, or use something like opencv to split the table into cells, and then ocr each cell and recreate the table in the pdf. For crisp lines you’d need to remove the existing image and create a new one by adding lines yourself.
More generally, if you’re doing OCR with Python you should use tesserocr rather than pytesseract - it’s an actual binding to the tesseract library and is better in practically every way.
So do they all use the same open source OCR engine?
How about multi-user?
How about full-text search?
Do they offer easy export/backup ?
Auto-tagging based on rules?
What other criteria are there to consider?
As far as OCR is concerned I was pretty disappointed by the results that papermerge and paperless gave. Incomplete texts, oversensitive to images that are not 100% black and white like from a phone camera. They both use tesseract as far as I can tell. I found ocrmypdf (https://github.com/jbarlow83/OCRmyPDF) to yield better results due to its configurability, but I'd rather have an OCR assistant UI where it's easy to adjust settings on the fly for each document if default settings did not work well. Haven't found anything like that yet, though.
Hazel can only organize files but it can also run scripts. It shouldn't be too difficult to brew install
ocrmypdf or pdfsandwich and script it from there.
Unser Set-up:
Scanner: Brother 1700w (speichert pdf ins Netzlaufwerk)
Dienst, der nach neuen Dateien sucht und OCR aufruft (Custom)
OCRmypdf (extrahiert Text aus pdf, fügt Textlayer in pdf ein)
Dienst, der den Textlayer nach Keywords sortiert
For the more technically inclined who just want to OCR their pdfs check out this instead
https://github.com/jbarlow83/OCRmyPDF
there is even a docker image to simplify setup.
You can convert a CBR losslessly to PDF then OCR the PDF
To revert back to CBR for some reason use pdfimages
With Firefox, you could try an extension like these: Evernote could be a good option, as well as PDF Mage, Save as PDF, or the NIMBUS solution there.
Check, if the saved PDFs are really type PDF/A, e.g. searchable text content. If not, you could add this feature using OCRmyPDF, which can even run as an Batch-file.
https://github.com/jbarlow83/OCRmyPDF/blob/7691ba8535cf65da2c790f48c9cba69203d05504/docs/docker.rst
This is a good starting point. You either have to use the command line, either have to setup the container to start the web service, obviously on a different port that 5000 in your case. I never used ocrmypdf, so can't help you way further than that.
Is this basically what you're doing: https://github.com/jbarlow83/OCRmyPDF/issues/180 ?
Any issues with the OCRmyPDF OCR engine? Do you change the file names later or just scan by date?
Ah right, yeah for OCR - I've used OCRmyPDF with some success - it uses pytesseract "under the hood"
https://github.com/jbarlow83/OCRmyPDF
If you don't need OCR - you may want to try PyMuPDF
I looked for a solution for a long time, and ultimately settled on simply leveraging MacOS Finder to create a directory structure. I can tag files in Finder to assist in searching, but usually I’m too lazy and a date based directory structure works fine. All of the files I store are OCR’d so Spotlight automatically indexes the content and makes it searchable.
To scan and OCR files, I use a Brother document scanner that dumps files to an incoming folder. I then have a little program that watches for new files and runs them through the open source OCRMyPDF (https://github.com/jbarlow83/OCRmyPDF) program to produce a copy with searchable text that is dumped into an output folder by month. Periodically I’ll go update the filenames from a date/time to a meaningful name.
Pretty simple, and free from some overly proprietary storage database.
Save the document as a pdf file. Then use ocrmypdf is a free OCR program that does a good job of extracting text from pdf files. You can dump the text into a separate file, or integrate the text into the pdf. With the text in the pdf, the pdf is then searchable.
​
You can easily install ocrmypdf using HomeBrew.
I looked for a solution for a long time, and ultimately settled on simply leveraging MacOS Finder to create a directory structure. I can tag files in Finder to assist in searching, but usually I’m too lazy and a date based directory structure works fine. All of the files I store are OCR’d so Spotlight automatically indexes the content and makes it searchable.
To scan and OCR files, I use a Brother document scanner that dumps files to an incoming folder. I then have a little program that watches for new files and runs them through the open source OCRMyPDF (https://github.com/jbarlow83/OCRmyPDF) program to produce a copy with searchable text that is dumped into an output folder by month. Periodically I’ll go update the filenames from a date/time to a meaningful name.
Pretty simple, and free from some overly proprietary storage database.
It looks feasible, but not super straight forward.
Automation of Outlook is fairly well documented/searchable - this usually uses win32com.
Automation of Adobe Acrobat doesn't look easy - it might be easier to use another method to make these documents searchable. OCRmyPDF looks promising.
Scheduling - use the windows Task Scheduler to get this to run regularly.
It sounds like if you can just automate the scrapping and sending of emails, this would already present some substantial time savings as you already can at least batch the OCR conversion.
I've had fairly good luck with this before: https://github.com/jbarlow83/OCRmyPDF. It's a little bit complicated to use (it's command-line based, there's no UI at all), but it's multithreaded, and the performance seems to be fairly good. On my dual-core laptop, it took around 7 minutes to convert a 60-page PDF, I would assume your laptop would take approximately 1/3 the time to do the same thing.
I use a Brother wireless document scanner (on mobile and not sure of the model) that scans to a simple ftp server on my mac. Then I OCR the scanned PDFs with https://github.com/jbarlow83/OCRmyPDF, and sort into folders by year. For searching - I just use Finder which already indexes the pdf text content. It’s not elegant, but it is cheap/free and doesn’t tie my documents into any kind of proprietary software. Works out well enough.
If you are willing to put in some work and have some linux knowledge, I would suggest https://github.com/jbarlow83/OCRmyPDF It uses tesseract and has done a good job for me in the past. If you are good with scripting, you could automate it. It will use all available cores which is much better than the per-core pricing of ABBYY.