Doing it in AWS was mainly because I don't have root on the machines I have access to at work, so I couldn't install the tools I needed. AWS gives you that in a few clicks.
From there all I did was write a script that uses wkhtmltopdf to generate the screenshot, and then copies the result somewhere that I've given Destin access to.
I added the script to cron and told it to run every 15 minutes. Now I can forget about it until I run out of disk space at the target location, run out of my free quota on AWS, or someone reminds me :)
wget --recursive
--no-clobber
--page-requisites
--html-extension
--convert-links
--restrict-file-names=windows
--no-parent
I've found the best way to generate PDFs is to create HTML and then convert that to a PDF. Here is a great program that does it: wkhtmltopdf. All the flexibility of HTML, none of the pain of PDFs.
Probably more work than you want to do but hey this is /r/selfhosted, I can't assume that. First, are you just looking to save attachments to disk or do you need to do some conversion? If the former all you need is fetchmail, procmail and metamail on the server set up to poll for email and save the attachments to a directory. This will probably help: https://kuther.net/howtos/howto-receive-mail-and-save-attachment-fetchmail-procmail-and-metamail
If the latter you're probably going to need to do some coding. I would probably do the steps I previously discussed then write a python script to monitor the email directory and feed the new files into a converter. Here are the two main pieces you need: https://sourceforge.net/projects/python-fam/
Use wkhtmltopdf. I believe it is what PDFCrowd uses on their end. It basically uses an embedded, headless Chromium browser to create a Print Preview of a page and then saves that preview as a PDF.
It's a command line util and I've never come across a Laravel package for it. mikehaertl/phpwkhtmltopdf
is a decent wrapper for it, though. I've done it that way and by calling the application directly using exec
. They both work fine.
You will need root access to whatever server this is going on, however. wkhtmltopdf requires libxrender
, which I don't believe is installed by default on most Linux distros, at least not Debian or CentOS which I've used it on.
http://wkhtmltopdf.org/
https://github.com/mikehaertl/phpwkhtmltopdf
When I had to convert HTML+CSS Rmd reports to PDF a couple of years ago, I ultimately found that it was easiest to just pipe the HTML output itself through wkhtmltopdf, which uses WebKit to render your report as PDF. It's basically like opening the HTML report in your browser and hitting Print.
Three big advantages of this approach:
1. You don't have to learn a new markup language or rewrite any report.
2. It preserves your CSS styling in the PDF (which Pandoc won't)
3. You can use CSS' media queries and wkhtmltopdf's --print-media-type flag to have different styling for HTML and PDF versions of the same report (the page-break stuff is really useful).
But - it was two years ago and I no longer have access to that code, so proceed with caution.
I was using tcpdf then mpdf for years, and then I discovered wkhtmltopdf and it changed my life. If you can exec() then you will enjoy super fast operation, no more memory/process time outs, and overall easier and better PDFs...
Check out http://wkhtmltopdf.org/ ... binaries for most platforms...
Here is an example of some code I used...
//Setup file parameters $cwd = getcwd(); $htmlfile = $cwd.DIRECTORY_SEPARATOR.'tmp'.DIRECTORY_SEPARATOR.date('Y-m-d-h-m-s').'.html'; $pdffile = $cwd.DIRECTORY_SEPARATOR.'tmp'.DIRECTORY_SEPARATOR.date('Y-m-d-h-m-s').'.pdf';
//Fat Free Framework code that generates a valid html file and writes to disk
$html = \Template::instance()->render('pdf/order.html');
$f3->write($htmlfile, $html);
//Convert HTML to PDF exec('"C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe" '.$htmlfile.' '.$pdffile);
//Output the file to browser
header('Content-Type: application/pdf');
header('Content-Length: ' . filesize($pdffile));
readfile($pdffile);
//Delete the junk and bye!
unlink($htmlfile);
unlink($pdffile);
die();
This is completely doable for beginners.
The part that scales in difficulty is in finding the "trademark colour"; it depends on how you want to do it. The method you described would work to some degree, but you may run into various difficulties (e.g. external CSS files) and limitations. A more sophisticated method would be to render the page using a Web renderer (either a library or a program) and then do some analysis on the image.
> * what libraries should I focus on? > * what functions will be crucial?
See Python's urllib module for reading remote files.
> * where can I upload my app to actually run the scrapper? I have a hosting plan but I think I can't host my own stuff because it's a shared account.
I don't recommend thinking about making this a dynamic website just yet. That's another bag of snakes.
> * Am I ready to do this? I haven't done any big projects apart from those in the tutorials, like tictactoe.
I think so. It's a great idea for a beginner project: simple, useful, and has room to grow.
I use a very similar routine for generating invoices for my consulting jobs, though I use wkhtmltopdf to convert the HTML into the final PDF invoice. It's a CLI tool that uses headless webkit to convert an HTML file to PDF, so no need to manually load that HTML in a browser and print to PDF anymore.
I tried out http://wkhtmltopdf.org/ It uses an older built of webkit for the pdf rendering. It is pretty cool and easy too use. You can set the page width and height via CSS aka width: 210mm height: 297mm etc.
Prawn is pretty cool, and generates nice PDFs. Right now I think that if I had to do a new PDF project, though, I'd use wkhtmltopdf. Being able to use HTML / CSS to create a PDF makes it a lot easier for front-end folks to contribute effectively; sure they can pick up the Prawn DSL, but it's nicer if they can just use what they know.
i'm a php developer so i would have to recommend php->
laravel is the framework of the moment, use this to generate your html pages from your db then you could use something like http://wkhtmltopdf.org/
call this from the command line using the php exec command .
or a quicksearch lead me to https://github.com/thujohn/pdf-l4 which seems straight forward
wkhtmltopdf is, in my opinion, the best stream to PDF option available right now.
So the way that works is that you create the page in HTML to look like you want with the dynamic data, then you run your HTML through the wkhtmltopdf parser and poof, out the other end comes a PDF that looks like the web page you created. Images, CSS, content, and all.
wkhtmltoimage has worked great for me in the past to take screen shots of a website from the command line.
You can use something like xvfb to render the page at the dimensions you want. So something implemented like this.
I was using tcpdf then mpdf for years, and then I discovered wkhtmltopdf and it changed my life. If you can exec() then you will enjoy super fast operation, no more memory/process time outs, and overall easier and better PDFs...
Check out http://wkhtmltopdf.org/ ... binaries for most platforms...
Here is an example of some code I used...
//Setup file parameters $cwd = getcwd(); $htmlfile = $cwd.DIRECTORY_SEPARATOR.'tmp'.DIRECTORY_SEPARATOR.date('Y-m-d-h-m-s').'.html'; $pdffile = $cwd.DIRECTORY_SEPARATOR.'tmp'.DIRECTORY_SEPARATOR.date('Y-m-d-h-m-s').'.pdf';
//Fat Free Framework code that generates a valid html file and writes to disk
$html = \Template::instance()->render('pdf/order.html');
$f3->write($htmlfile, $html);
//Convert HTML to PDF exec('"C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe" '.$htmlfile.' '.$pdffile);
//Output the file to browser
header('Content-Type: application/pdf');
header('Content-Length: ' . filesize($pdffile));
readfile($pdffile);
//Delete the junk and bye!
unlink($htmlfile);
unlink($pdffile);
i see the following in the documentation http://wkhtmltopdf.org/usage/wkhtmltopdf.txt
--custom-header <name> <value> Set an additional HTTP header (repeatable)
I am trying to leverage this option to set an additional header field in my request that would be sent to retrieve the page(to be more specific, setting the websocket version that i am using).
when i fire the command (wkhtmtopdf --custom-header "name" "value" www.xyz.com ~/path/to/file) from terminal and inspect the request, i dont see the additional header field.
Since this switch/option has been there for sometime, i am expecting it to work. I am guessing that i am missing something here.
Please let me know if i am not being clear.
I think the issue is that MIT Press doesn't allow sharing of the PDF files, even if it is a draft version.
See the comments in one of the author's G+ announcement :
>Ian Goodfellow | Jun 2, 2015
>@Adam Goodkind, please don't share the PDF.
Yoshua Bengio shared the original announcement and somebody pointed it out there as well :
>Dan Farmer | May 21, 2015
>Unless something has changed since the last post MIT Press isn't allowing PDFs to be posted. Still very exciting!
btw. you could use something like wkhtmltopdf to get your own fresh PDF.
I use wkhtmltopdf, because it uses the webkit engine to render the output, so it's very accurate. I had issues with complex layouts using CSS with the pure PHP conversion libs (dompdf, mpdf, etc).
wkhtmltopdf is a command-line app, but there's a php wrapper to make it easier to use here: https://github.com/mikehaertl/phpwkhtmltopdf
I've used wkhtmltopdf before with great success. It's server side software that you run via PHP, and it turns your webpage into a PDF. Can take a bit of fiddling with the styles to get it looking right.
You won't be able to install stuff like this on a lot of shared hosts, though.
INFO: Could not find files for the given pattern(s). INFO: Could not find files for the given pattern(s). INFO: Could not find files for the given pattern(s). Traceback (most recent call last): File "C:\Users\antek\Pulpit\untitled\venv\lib\site-packages\imgkit\config.py", line 30, in init with open(self.wkhtmltoimage): FileNotFoundError: [Errno 2] No such file or directory: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:/Users/antek/Pulpit/untitled/scsc.py", line 7, in <module> imgkit.from_url('http://google.com', 'blank.jpg') File "C:\Users\antek\Pulpit\untitled\venv\lib\site-packages\imgkit\api.py", line 31, in from_url cover_first=cover_first) File "C:\Users\antek\Pulpit\untitled\venv\lib\site-packages\imgkit\imgkit.py", line 34, in init self.config = Config() if not config else config File "C:\Users\antek\Pulpit\untitled\venv\lib\site-packages\imgkit\config.py", line 36, in init 'http://wkhtmltopdf.org\n'.format(self.wkhtmltoimage)) OSError: No wkhtmltoimage executable found: "b''" If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - http://wkhtmltopdf.org
Process finished with exit code 1
Does this cover the automation side of things because there are 511 lectures to this Linux course.
I've clicked on your link and my work place has blocked it due to:
​
" Not allowed to browse Shareware download category.
You tried to visit:
This "imgkit" depends on wkhtmltopdf (as it already says in the error message). Have you installed wkhtmltopdf? This is actually a web browser that can be scripted (not Python specific) to output to PDF or image (PNG, JPG etc.). You need to install it before running "imgkit".
I did spin up a small backend service that uses http://wkhtmltopdf.org/ and it has been a godsend.
Before i was using window.print() and a good amount of css @media print
, but client browsers were crashing when generating too much data.
I wonder if the website isn't the problem. Re: the ridonkulously huge image at the top of the page – screenshotmachine.com/serve.php?img=apec2015-ph-FULL-70080c.png.
Maybe you should try doing it with a command-line tool? Re: wkhtmltopdf.org
It's only one piece of the puzzle, but since you're already using HTML to make your PDFs you might look at WkHtmlToPdf. It's a headless WebKit browser that loads your HTML and then spits out a PDF, and you just call it from the command line.
i'll do it first with plain simple HTML.
Have a simple webpage in your domain and allow registered users to copy it.
it's not as easy as it sounds you may want to use a web framework and RESTful API's.
then when all of this is working use this -> http://wkhtmltopdf.org/
I think you are looking for this http://www.nrecosite.com/html_to_image_generator_net.aspx which is based on http://wkhtmltopdf.org/
The alternate would be to use a HTML to PDF library which has output as an image option.
It should work on 3 without much problem.
html2pdf is a completely different PHP based service, where you would run that seperately and make API requests from your Flask application to it.
wkhtmltopdf uses the WebKit-QT rendering engine to layout the HTML, and render it to a PDF. Its a command line linux application and library. You need to install wkhtmltopdf seperately from your python packages, and the PDFKit and Pywhtmltopdf both just call the command line software and interface with it
xvfb is a linux application as well, and only needed for certain PDF features if you have a limited build of wkhtmltopdf, see the FAQ at http://wkhtmltopdf.org/downloads.html but you basically need a "static" build for things to work perfectly without xvfb or xephyr.
The issues with wkhtmltopdf arise when you start to generate large documents (I found anything over 5 or so pages started to use memory exponentially, max my cpu, and lock things up. 10+ pages often caused crashes in cairo)
You might not be able to get wkhtmltopdf installed on your hosting, so this might all be moot though. For ubuntu, I use the binaries provided through the website. You might be able to include libwkhtmltox.so.0 and the wkhtmltopdf binaries in your upload files though.
I would stick to wkhtml since its entire purpose is rendering, and also shares the webkit base with phantomjs. Try and play arround with the viewport, image and resolution size parameters in wkhtml http://wkhtmltopdf.org/usage/wkhtmltopdf.txt. Also check what your pages look like in the print (vs "screen") css media type.
I started a project which I'll be using for prototyping which takes csv files and then puts the data into jade templates and runs that through wkhtmltopdf to generate the pdfs. I've gotten the basic system working for one prototype I did, but I'm working on making it more general purpose. Long term, I'm hoping to have a lot of templates to choose from, but also make it easy to use html/css to create custom designs. I don't know if that's something which would be interesting to anyone. I'd be happy to hear feedback on what features people would want, or how to make it more useful/easier to use. I'm not really planning on doing anything WYSIWYG - html/css are exactly what I want, but I'll be making it open source, so who knows, maybe someday.
You are going down a dark and terrible road.
These two are the best for different reasons: wkhtmltopdf and htmldoc.
You need an X server for wkhtml to work and nothing for htmldoc to work. For modern day pages, wkhtml is better. htmldoc is better if you want to do text reports and such.
This is an excellent idea, but I'd make a small modification. You can convert PDFs to HTML quite easily - for example, you can use a tool like wkhtmltopdf. It's pretty easy to find the coordinates of HTML elements. You can modify the header of the HTML file to include jQuery and then do something like $('.my-target-element-class').offset().
Quite easy to install, use and automate. Uses webkit as rendering engine. You'd have to deal about generating only the HTML you want to render as pdf yourself however.
If you are open to rendering a pdf from HTML on the clientside, I have used this tool before: http://wkhtmltopdf.org/. It basically takes a webpage, loads it and then renders a pdf from it. Just a thought
In my experience iTextSharp is extremely low level. It's a huge time-sink. Are you sure you really want to essentially implement a manual layout engine yourself? The project I'm currently on already has iTextSharp deeply ingrained, but for the next project, I'd pick something else - perhaps wkhtmltopdf or PrinceXML.
Another option you have (if you can install stuff on your box) is wkhtmltopdf (http://wkhtmltopdf.org/).
Convert the PDF to HTML, use your preferred templating system to drop in the user-submitted values, shoot the output to wkhtmltopdf. Boom, filled-in PDF.