Absolutely. Don't forget that a lot of the scraping these days is done using headless browsers (e.g., https://github.com/GoogleChrome/puppeteer). In this case, when done right, there is virtually no difference between human and a machine. You will see bots moving mouse, scrolling and doing whatever else a human would do to perform an equivalent task. Bear in mind that the latter method produces a lot more requests against the website (as it needs to load CSS, JavaScript and all the other unnecessary elements of the page). The only reason bot authors opt for this method is because website owners block the simple ways.
Built using Google’s new headless browser API Puppeteer which was released yesterday.
The program will take a URL and a list of devices to emulate. It will then modify the browser resolution and user agent to match that of the device. For each device it will take a screenshot (can also be a full-page screenshot) and put it in a generated folder.
I used PhantomJS because I had previous experience working with the PhantomJS API. I explored Google Chrome Headless and it didn't support custom header and footer until now (https://github.com/GoogleChrome/puppeteer/issues/373 - 5 days ago).
No, you are not required to use cheerio for web scraping, at all. You can use request() and parse the HTML yourself... but, realize this: NodeJS is not a browser, it doesn't understand HTML or the DOM. People use cheerio because it makes their life easier when web scraping. It's not a full version jquery, just the DOM parsing parsing stuff.
Also, if you are scraping more modern websites, where data comes from APIs instead of being rendered on the server, you might want to look into puppeteer (https://github.com/GoogleChrome/puppeteer). Think of it as a script-able "headless" browser. I have stopped using request+cheerio and moved my scrapers over to it.
I've had great success with puppeteer to do the actual act of webscraping. What you're likely looking for is a headless browser + API to programmatically pull out elements. fetch
requests can work for simple sites, but for more complex ones that use JS or other techniques to asynchronously render content you need a proper browser engine to parse, execute the JS, and load additional content.
Periodically running the 'fetch' is your best bet for keeping up to date. Pick an interval that makes sense and is 'reasonable': does your University need <10s latency between updates? <10m? Finding that lower limit gives you a guideline for how frequently you need to poll.
Some sites pass a hash back inside an etag
header, which gives you a very cheap option: you can make a HEAD
request that should return with just the headers, compare the etag
to the one you previously fetch, and if it has changed, refetch the page. This highly depends on the website you're targeting, so I wouldn't rely on it, but it's a good way to reduce the bandwidth you're consuming.
After you run the 'fetch', you probably want to hash whatever content you've pulled out and compare it to a previous hash. It's cheaper than storing the whole page (although that works too) and is "good enough" when it comes to change detection for most purposes.
Make sure you check the source website's robots.txt
and Terms of Use. Both can specify terms for how you scrape their site that you should be aware of.
Finally - if it's a small site or another department in your university, consider reaching out via phone or email. They might be willing to open up read-only access to their database directly, which could save you a fair amount of time.
PhantomJS is also out of development ever since Google officially released Chrome Headless and Puppeteer. Granted that is very recent and that doesn't mean Phantom is not usable anymore. But I imagine over time people will completely switch over.
I use Google headless Chrome w/ Puppeteer running on Lambda for crawling.
Visual differencing (as well as HTML + text differencing) is done on another Lambda function.
I have run into this request so many times because our customers LOVE their pdfs for whatever reason. My comrades and I have tried repeatedly to get it looking good cross-browser on the front end and it's just way too labor intensive.
We've finally settled into doing it server side with puppeteer.
We have a node REST API service we keep stand alone only for pdf generation. We typically use React on the front end but I see no reason this can't work with Angular. Here's our flow:
Pretty simple overall. Puppeteer has a great API and documentation, and does better at consistently creating well formatted pdfs from HTML & CSS than any other method we've used. It's minimal effort. Our clients get their pdfs and are happy. We don't have to work to hard to do it so we're happy also.
An easy one I've been using lately is puppeteer made by Chrome / Google team.
https://github.com/GoogleChrome/puppeteer/blob/v1.5.0/docs/api.md
Have a look through their API and see if it has what you are after.
You can find some other examples here https://medium.com/@e_mad_ehsan/getting-started-with-puppeteer-and-chrome-headless-for-web-scrapping-6bf5979dee3e
Good luck!
You have to use something to run a headless browser. Google released puppeteer last I/O.
https://github.com/GoogleChrome/puppeteer
I haven't messed with it myself, but a few people at work have been using it and say it's a dream to use over Phantom or Nightmare.
If you're not testing React, there's no point in using Jest imo. It just makes your life harder with config and gotchas that you don't really need.
I would suggest mocha as a test runner, as it's blazing fast and lightweight, and you can pair it up with a good assertion library to give you more expressive power when writing your tests (see Mocha - Assertions). It also supports async out of the box, which is most likely something you'll need at some point.
If you're looking into integration testing (e.g. automatic clicking around simulating user interactions), I've had good experiences with NightwatchJS. On the same topic, I've heard great things about Puppeteer, which takes away the somewhat painful part of configurating your integration test suite and also exposes a nice testing API.
Hope that helps, and good luck!
Colly doesn't execute JS, so you won't be able to catch AJAX calls. You need something like Headless Chrome - take a look at https://github.com/chromedp/chromedp or https://github.com/GoogleChrome/puppeteer if you are ok using Node.js.
Cheerio is just an option, however, it can only be used to scrape sites that are rendered server side and not client side. Any SPA site built with React, Vue, Angular, etc. you'll need to use something like Puppeteer
Ditch Selenium and look at Google's Puppeteer library they released last week. It drives the new headless feature of Chrome with a much simpler API deliberately meant for automation without a primary focus on automated testing.
Their example in their README is actually about ten lines long and shows how to make screenshots.
Built using Google’s new headless browser API Puppeteer which was released yesterday.
The program will take a URL and a list of devices to emulate. It will then modify the browser resolution and user agent to match that of the device. For each device it will take a screenshot (can also be a full-page screenshot) and put it in a generated folder.
If the website is dynamic you might want to look into a headless browser API like Puppeteer. This will execute the JavaScript needed to "render" the full page. Then you can scrape your target.
Do you have a repo or anything that you'd be willing to share?
I played around with making a similar wikipedia webscraper in javascript. It (should) scrape the grey info box when run as a snippet in Chrome and Puppeteer. I agree it's loads of fun!
I wrote a script that used puppeteer to control a headless Google Chrome instance. It went to the cal poly login page, logged in, then navigated to the status page. It then returned the innerText of the status line element on the page.
If you know Node/JS, it's simple stuff. Otherwise, it might be a bit difficult.
Your error is
> TypeError: "listener" argument must be a function
Suggesting theres a function call being made that's expecting a function but got something else - or nothing at all. It happened immediately after console.log("immediately after await .goto");
which is where on
was being called - on
being a common alias for addEventListener
(or similar) and, in fact, the stack trace includes a reference to Page.addListener
which makes sense because Page extends EventEmitter.
The documentation you're referencing only uses page.on('domcontentloaded')
as a title to show what's being covered. It is not an actual, usable code snippet. If you look at the top of the page, you can see a real code sample with the correct usage of on
:
function logRequest(interceptedRequest) { console.log('A request was made:', interceptedRequest.url()); } page.on('request', logRequest); // Sometime later... page.removeListener('request', logRequest);
Take a look at Chrome Headless as well https://github.com/GoogleChrome/puppeteer, it's not a Python, but it should be much simpler to implement scraper with specific scenario using Chrome Headless
https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagepdfoptions
There’s the key part of the api. I have a small node js service using Puppeteer and Koa that opens a new tab with the page to be converted, prints it to pdf then throws the result on DO’s spaces after closing the tab. I’ll probably switch it to AWS S3 but DO was easy since I was already using it.
I recommend <strong>headless Chrome / Puppeteer</strong>. It's easy to do in Node.js, and you can enter in the resolution of your rendering using: the page.setviewport(viewport) function.
https://developers.google.com/web/updates/2017/06/headless-karma-mocha-chai
According to this, you can access the raw devtools protocol with page._connection
, and then you'll have access to eg. resource timings (enable request interception, and listen to Network.responseReceived
events)
The reason this project was possible, is because Google released a headless Chrome API for node. If Opera had a headless API, then it would be possible. I don't think it does, though...
CasperJS caters mostly for testing but it sounds like you're just after some simple automation.
In that case I would check out some of the new libraries popping up that implement Chrome's DevTools protocol and thus can drive the new (and awesome!) headless feature of Chrome itself.
For JavaScript check out the official Google API released last week: pupeeter. For Python there are some in the works too: WebFriend, Chromote.
Built using Google’s new headless browser API Puppeteer which was released yesterday.
The program will take a URL and a list of devices to emulate. It will then modify the browser resolution and user agent to match that of the device. For each device it will take a screenshot (can also be a full-page screenshot) and put it in a generated folder.
Still no success. Now I get a different error message Failed to launch Chrome! TROUBLESHOOTING: https://GitHub.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md There’s nothing in the docs I can identify as useful. I can start chrome from the cli with a simple google-chrome command.
I am using puppeteer to run automated testing for our companies app. Its pretty cool figuring out all of the things you can get it to do . Here's the doc if anyone is interested.
Does "download a pdf version" mean the websites are normal HTML, and you want to essentially "print to PDF", or that there are pdfs on the websites and you just want to download them?
One of the standard examples for "puppeteer" is save as PDF, and that library is designed to be used from node.js, but what I don't know is what its characteristics are for running "at scale": does it leak memory, does it close when asked to, how much CPU does that process use per webpage, that kind of thing.
> I'm essentially looking for daily updates on specific information across thousands of websites-no idea if this is realistically possible.
Be aware that the latter half of your question requires a quite different amount of energy than the first half. Getting updates on thousands of websites is absolutely trivial with Scrapy or any number of existing web scraping toolkits. Converting a webpage to PDF, however, requires rendering it, which means you need a full-blown webbrowser. See the difference?
Looks like it's a major version bump for 2x breaking changes:
>Puppeteer now requires Node.js v8+; Node.js v6 is no longer supported
>
>page.screenshot now clips elements to the viewport (#5080)
I tried a lot of different selenium implementations out a few years ago. Really wanted to like NightwatchJS but it wasn't very good (could have improved since then). I ended up going with Python because it just made more sense to me than Javascript. But we tend to have Python Back Ends so it made sense.
If you're on a node app, I would look at Webdriver.io or Puppeteer if you're more concerned with functionality than compatibility.
There is no simple way to do this. What I can suggest, for example if you use puppeteer or any manager for chrome debug protocol, you can prohibit certain requests. You can do this by request type or by name. You can intercept requests before they sent to server and prohibit them, it will look for page like network is down for certain requests. You need to find yourself what to block. Like this one: https://github.com/GoogleChrome/puppeteer/blob/master/examples/block-images.js You may need to block only couple and that would block all remained chain
Yeah, you can do that with JS.
You could create a chrome extension, or just run a script in the js console. Or if you have Node, you can use something like puppeteer: https://github.com/GoogleChrome/puppeteer.
Basically you'll need to query the dom for the select input and loop through the options and trigger the downloads. You may need to do some async stuff where you wait on downloads to finish before going to the next..
puppetteer l'ho usato (e sto usando) moltissimo. https://github.com/GoogleChrome/puppeteer Gli esempi sono molto ben fatti, ci ho scritto anche un post sul mio blog tempo fa.
https://coding.napolux.com/how-to-scrap-web-page-nodejs-puppeteer/
Do you have any experience with a particular language?
As for automating Chrome - it's probably simpler to use Selenium instead of creating a plugin.
I've not used Powershell - but from a quick search - it does look like you can use selenium from it.
I have used the Python bindings - http://selenium-python.readthedocs.io/
There are also things like puppeteer - https://github.com/GoogleChrome/puppeteer
It's not really a language specific thing - so if you have experience already - you should probably start there.
While it is true that Puppeteer do not have direct access to the any variables declared on the window
object, it is not true that you can not access them at all. Have a look at the evaluate method. As you can see in the waitForTarget - example, inside the function given to evaluate
you can access window.open
, so there is no reason why you should not be able to access your custom variables like window.foo.bar.baz
inside.
No idea as i've never done it, but I would start with this
The Google Cast Framework API, for Chrome https://www.youtube.com/watch?v=pvyfddIFsVA
and this Puppeteer, an API to control a headless chrome instance https://github.com/GoogleChrome/puppeteer
If you'd like a lot of control, you could use Puppeteer.
> Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
All of the selections below are good at scraping websites. Just starting to learn Puppeteer and liking it.
C#, Python - Selenium
NodeJS - Puppeteer (By Google)
SQL - PhantomJS (discontinued from updates but still usable)
The one step further option would be to pull from JSON API they send directly and parse it out using one of the languages above. You can see the response in the Network tab when you right click and inspect the page.
You might check out puppeteer -- It's a nodejs library but pretty simple to pick up. I'm way more proficient at python than js and had no issues banging out what I wanted with some tutorials online. Can't comment on error code though 🤔
You can use puppeteer to automate scraping, taking a screenshot, and it does by running a "headless" Chrome. You might find other ways to do this by searching on Google for "headless" browser automation.
Here is an example of how to take a screenshot with Puppeteer.
Take a look at Puppeteer-Sharp, it should do what you need. Its a .NET API for Puppeteer, which is a headless browser.
Seconded. Would recommend https://github.com/GoogleChrome/puppeteer as it is the official driver made by Google and is orders of magnitude more performant and less buggy than Selenium.
Follow the tutorial to build a step by step automated bot that will do the tasks you need.
Any plans to support Firefox, as Puppeteer is working on?
Ideally you would have a solution that works for both browsers, because switching between Chrome and FF is useful in case one is faster than the other for a specific PDF (although I'm not sure if the difference would be significant).
This is one of those problems where the complexity of the solution is going to vary a lot depending on the details of your requirements that you haven't specified, to the point where there isn't really a one-size-fits-all tutorial to point you at.
If you just want to draw a one-line text string at a specific X/Y coordinate on a blank canvas of known dimensions, that's really easy. If you want to handle stuff like text wrapping, and adjusting the position and dimensions of multiple text fields relative to each other (e.g. correctly positioning the job description below the title) that makes it somewhat harder. And if you need to handle rich text formatting, doing all of the layout from scratch is going to get quite complicated indeed.
Also, the available tools for something like this will vary depending on what languages you're comfortable with. For really simple stuff, you could use a library like Pillow (formerly known as the Python Imaging Library). If you need lots of layout flexibility, probably your best bet is to convert your text to HTML and then render it using something like Puppeteer, which lets you use the full power of Chrome's rendering engine.
What you can do is basically this:
​
To make an external API call, you will need to update the cloud functions from the fully free to the pay-as-you-go tier.
​
I don't know of any examples that go through the whole flow, but hopefully this will give you a good sense of how to build it.
I would avoid manually copying it but it might be fine, make sure you tell them in a comment or something though.
I would use puppeteer to fetch the data, or selenium.
https://github.com/GoogleChrome/puppeteer/blob/master/README.md
Follow the instructions on github or maker-tutorials.com
I use puppeteer
You might try using Puppeteer to render a PDF from the web page: https://github.com/GoogleChrome/puppeteer/blob/master/examples/pdf.js
Puppeteer might not be needed, though. You mention that you already have the page looking good in HTML. If it's just a one-off task, you might just need a print stylesheet to ensure that the page prints nicely when you manually save it as PDF from the browser: https://www.smashingmagazine.com/2011/11/how-to-set-up-a-print-style-sheet/
I can probably do this in a couple of hours if you want to contract it out to me. Find my email on my personal site: https://kayce.basqu.es
For web scraping I've used puppeteer, google's scraper, it's pretty easy to get going with it, If what you actually want to do is build your own website and transpile it into a static website, there are many static web generators depending on the language you used for your website, like Gatsby for React websites
PowerShell is not the tool for every job, the thing you're looking to do requires execution of JS. I would recommend you look into Puppeteer. With this library you write simple JS that'll control a headless Chrome (Headless Chrome is all the rage these days).
​
If you wanna do this properly I'd learn just this tiny bit of JS required for this, make a node CLI app that outputs the data you want to stdout in JSON format for you to parse with PowerShell and resume PowerShell operations from there.
​
EDIT: I saw you can only do this in pure PowerShell, i would then recommend you set up a server elsewhere that can run NodeJS and do the same thing as suggested above, output JSON that your "Pure PowerShell script" can consume.
To add on to this, if you are proficient in JavaScript + Node.js you can use Puppeteer which is a headless Chrome API like /u/bagelmountain mentions. Fairly easy to use
puppeteer is essentially a library that lets you automate regular user actions in the browser but with node.js. It powers headless chrome.
It works well with pages that are dynamically loaded (e.g ajax or SPAs like react/vue-based websites.
If you can't use an API I'd look for a very friendly scraping tool. I really like puppeteer: https://github.com/GoogleChrome/puppeteer but it's for Javascript and uses headless chrome.
I have no problem with cheerio/jQuery. Haven't seen easier ways to work with XML documents in JS so far.
What I do have a problem with is using axios (or any other HTTP client like request) for this. At first this might seem like a good idea because, well, you need to crawl an HTTP resource.
But the thing is that a lot of (not all) websites these days are aware of being scraped and have implemented ways to prevent that. Mediums strange hashes in the url come to mind. Also, your site just has to sit behind something like cloudflare. That makes it also inaccessible for a simple HTTP client.
A headless/automated browser is the solution. See pupeteer for example.
There's nothing wrong with cheerio. The arguments here about jQuery have to do with the frontend.
I would advise against PhantomJS because it hasn't been supported for a while, but if you need to simulate a browser Google's Puppeteer or NightmareJS are decent options. These aren't really solving the same problem, though. If you're scraping static data and don't need to execute any kind of JS on the page, then stick with cheerio.
The Puppeteer project has a Docker container/instructions. It's specifically intended for testing.
Look for ”Docker" on this page to get started: https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md
Hey, yeah you're spot-on for the intent!
Appreciate the github suggestions as well, I'll write a more verbose description and better README for the next version.
The site selection algorithm is a placeholder, its literally just random. It might work for now because there's unlikely to be a lot of filtering by the tech giants, but that would change if this idea got any kind of traction. Ideally you'd want to train a model on how people browsed and use that to trick filters.
I think puppeteer can be used to have the fake browser act more like a real person. The perfect version of this app would mimic mouse and keyboard behavior on the page. I don't have any idea how to do that yet.
Thanks for the well considered feedback!
You need to call the selector on page
using this function. If you use $
by itself, you aren't pointing to the page you loaded.
Example:
let numItems = await page.$$('h2.s1okktje-0'); // Don't use .length here console.log("The number of items with class 's1okktje-0' is ", numItems.length);
Take a look at this thread: https://github.com/GoogleChrome/puppeteer/issues/478 I don't know if answer is there but my first guess was that puppeteer should allow this. From scrolling through this issue it might be not that simple but maybe one of packages listed there could help you.
I would use puppeteer to scrape the site as it appears the data you want is dynamically loaded. If you have any specific question about how you might use puppeteer, feel free to ask, though I am not an expert.
The API doc for Puppeteer is pretty good..
I quite recently made an app with Puppeteer. Here's one before I changed it to use classes. Maybe it can help you a bit?
If you just want to get the elements data, then you can use $$eval to select the elements and iterate over them.
https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pageevalselector-pagefunction-args
Would need to see the full code about the errors.
Hi!
I would check this issue raised on Github for different ways to address this issue.
https://github.com/GoogleChrome/puppeteer/issues/1149
Btw if the link's solutions doesn't work for you then please share more about your specific circumstances as those attempts in that GH link would likely be my first guesses as well.
​
Good luck and let me know if this resolves the issue.
I wouldn't dare put up my experimental Casper.js scripts. Spaghetti but functional.
If you want a go, there are quite a few options, probably would pick nightmarejs is supposed to be simpler than phantom/casper.
https://github.com/segmentio/nightmare
or you can always try googles puppeteer: https://github.com/GoogleChrome/puppeteer?ref=stackshare
Node is actually very accessible and easy to learn if you already know JavaScript. There are examples provided on the Puppeteer Github repo:
https://github.com/GoogleChrome/puppeteer/tree/master/examples
Install Node and then to run the examples open your terminal at the examples directory and then run:
node ${script-name}
The screenshot example looks pretty simple and a good entry point, so that would be:
node screenshot.js
Chrome headless. Usi il motore di rendering del DOM di Chrome e poi ti mangi il risultato. Così hai esattamente il generato dalla pagina.
​
E lì però si apre un mondo... puppeteer, devtools protocol, etc... io ho usato browserless in container Docker perché mi forniva già tutto pronto, ma volendo puoi usarlo sul tuo computer.
Should definitely be able to pull the PickCenter info. I wasn't sure how far back it went or even if it kept open/closes for completed games, so that's good to know.
For things that would need to be scraped (i.e. not through an existing API), I've traditionally used something like Puppeteer (which is a library for using headless Chrome) or request-promise in conjunction with cheerio. I know a lot of people here use Python, though, so I know that's not super helpful. If I had the URLs they were using to pull that data, then it should be relatively easy to create something similar.
It should be pretty straightforward to save an html document as a pdf with chrome headless via puppeteer - here is an example: https://github.com/GoogleChromeLabs/puppeteer-examples/blob/master/element-to-pdf.js
I noticed that import.io is not free anymore(it used to be). In that case I suggest that you create a crawler of your own in your favorite programming language.
I mostly use JavaScript:
https://github.com/GoogleChrome/puppeteer
Or PHP for this:
http://docs.guzzlephp.org + https://symfony.com/doc/current/components/dom_crawler.html
Plenty of reading material online on how to build one. Also don't actually bother using that data in production because you will get hit with the holy SEO sledgehammer of Google due to duplicate content and whatnot.
Sure -- looks like their documentation has been updated since the last time I tried.
I basically had to put together a subset of that example Dockerfile in order to make it work inside my node.js test runtime, since we're running our tests inside a container inside docker-compose.
This was relevant, too.
> ...Puppeteer is bundled with Chromium--not Chrome--and so by default, it ...... (However, it is possible to force Puppeteer to use a separately-installed version Chrome instead of Chromium via the executablePath option to puppeteer.launch...
I started scraping with Perl, Regex and Split functions before most parsing libraries where known. Later on started with python in combination with BS and selenium. Selenium is to unstable for production use and no active development is done. Now I work with Puppeteer, that looks promising.
It's more bizarre how OP plan to monetise a feature that is provided by the Headless Chrome API. See here. I suppose writing an API wrapper is now a business of its own.
I found some people at my work using PhantomJS or Puppeteer. It seems to be not as easy as would like (a simple library would have been nice), but that may work. I'll try that.
>Cheerio is not a web browser
>Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. If your use case requires any of this functionality, you should consider projects like PhantomJS or JSDom.
This is from their github. Basically you can't scrape anything that's not a server sided website which makes cheerio not a good tool for webscraping.
https://github.com/GoogleChrome/puppeteer
This is what you actually want. edit: or well, phantomjs as they recommend, but a headless chrome is better
The first one. I just want to make hotkeys for myself to use and have functions run based on what key I pressed.
Keyboard API is in the documentation but the example aren't helping me.
https://bpaste.net/show/9807abd3b440
I tried this but nothing happens when I press the key.
I've always used wkhtmltopdf binaries from their website, and never had any problems using this: l (used it for 5+ years, last 3 different debian versions).
Apart from that I would really recommend headless chrome / puppet as a driver for html to pdf. While wkhtml2pdf for a very long times was the least bad converter around, it has many quirks. Using the chromium and puppeteer as a converter makes it a lot easier to test your printable files.
Yes, in theory, but in practice it's much more difficult
IIRC the page reloads, so a bookmarklet will not work as it would need to be run every time the page loads. If that's wrong though a BM could do it easily.
May be possible with a GreaseMonkey/TamperMonkey/etc user script + using cookies or some form of web storage to keep track, and would be able to run in a default browser without the user needing to login, but it would need a bit of state-management code to decide when it should continue running after page reload, and when it's finished running.
Puppeteer (a headless or full [e.g. with GUI] version of Chrome or Chromium useful through NodeJS) could do it but would require users to enter their login info (it may be possible without doing so by copying cookies, but it would not be a great idea to use a hack). It can click, type, etc, just as if a human were doing the actions.
node-fetch as the web request/response handler and cheerio as a jQuery-like library to handle parsing/selecting elements from the HTML response for use in the code logic to decide what to do would also be able to handle this pretty well, and likely much lighter + slightly faster than Puppeteer in the end.
The backend uses Puppeteer to login on your behalf onto Waterloo Works and grab whatever data you requested. Nothing is stored and the session is killed after the response is sent. I didn't find any headless browsers that work with React Native. That's why there's a backend, also others can also use it to build their own software.
I've used wkhtmltopdf for this purpose before, but now I'd use puppeteer. Make a basic html page styled the way you want (use the css print media) and use Puppeteer to print the page to PDF.
Don’t use phantomJS! Use headless Chrome and a library like puppeteer (https://github.com/GoogleChrome/puppeteer). Its API should lend itself well for scraping (and potentially saving cookies so you’ll keep your auth session).
If this is something you’re looking to productionalize you might also find https://browserless.io/ useful as well (full disclosure I’m the author/founder of that).
And multiple viewport widths for responsive screenshots! To be honest, this is really more of a job for headless mode.
For Chrome, I use puppeteer. I already use it to grab screenshots in mobile, tablet and desktop. It seems like @2x would be the next step.
It seems like it would be an easy change.
You'd want to take the HTML as a post parameter and then change the call to page.goto(opts.url, opts.goto) in core.js to page.setContent(opts.html)
https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagesetcontenthtml
I am not sure is it better than webdriverio or not. But the Automation Web Testing community is moving fast towards Chrome Headless. PhantomJS & Selenium IDE, used for this task previously have been discontinued. Those tools were also used in Web Scrapping for Modern websites with client side rendering. So over the next few months developers would be moving quickly to Chrome Headless even for Web Scrapping. You can estimate its popularity by the "stars" the Puppeteer repository has aquired in just under 10 days of its release: https://github.com/GoogleChrome/puppeteer
(BTW: Puppeteer is npm package with APIs to Chrome)
I don't know how those library work, I cant tell you the benefits against them sorry :(
I can say that the strong points of exquisite (of course exquisite is a WIP and is in beta stage!)