> scraping off an app is different to HTML scraping
for sure, yes, but in my experience it can also be much, much easier because it is highly unlikely that your target will send down presentation stuff (i.e. HTML) to the app -- they will send down only the data, which is what you wanted to begin with.
That said, there are different hurdles to overcome when going after app data: authentication is almost surely involved, there could be rate limiting per login, and they are (strangely) able to change the format or data sent down almost arbitrarily, which isn't typically true for web targets.
> I intend to use an extension to run the android app
I'm not certain what that means, but I guess so long as you know and are comfortable with it, then try it out. My experience has been a mixture of man in the middle attacks, and decompilation of the app to learn the URLs and any auth schemes. But, just like going after a web target, almost every job differs.
I don't at all mean to dissuade you from using the app-centric approach, but also be sure to look at any XHRs on their current website, as it may very well be sending down the JSON you want but without all the authentication or other tidbits you may can avoid. It can be the best of both worlds: just the data, thankyouverymuch, but without all the energy expended to learn those URLs and responses.
Thanks for your answer. I had checked their robots.txt (http://www.indeed.com/robots.txt) and I found that the directories for what I want to scrap disallowed. However, I have not seen any clear statement about it in the terms and conditions. Do you mind suggesting what to look for in their terms and conditions? https://www.indeed.com/legal
Is disallowing it in their robots.txt without a clear statement that scrapping is illegal for them?
Sorry if my questions seem repetitive. Really appreciate your answer.
Here's a way to scrape Twitter data in 5 minutes without using Twitter API, Tweepy, Python, or writing a single line of code-using an automated web scraping tool - Octoparse.
As Octoparse simulates human interaction with a webpage, it allows you to pull all the information you see on any website, such as Twitter.
For example, you can easily extract Tweets of a handler, tweets containing certain hashtags, or posted within a specific time frame, etc. All you need to do is to grab the URL of your target webpage and paste it into Octoparse built-in browser. Within a few point-and-clicks, you will be able to create a crawler from scratch by yourself. When the extraction is completed, you can export the data into Excel sheets, CSV, HTML, SQL, or you can stream it into your database in real-time via Octoparse APIs.
Does "download a pdf version" mean the websites are normal HTML, and you want to essentially "print to PDF", or that there are pdfs on the websites and you just want to download them?
One of the standard examples for "puppeteer" is save as PDF, and that library is designed to be used from node.js, but what I don't know is what its characteristics are for running "at scale": does it leak memory, does it close when asked to, how much CPU does that process use per webpage, that kind of thing.
> I'm essentially looking for daily updates on specific information across thousands of websites-no idea if this is realistically possible.
Be aware that the latter half of your question requires a quite different amount of energy than the first half. Getting updates on thousands of websites is absolutely trivial with Scrapy or any number of existing web scraping toolkits. Converting a webpage to PDF, however, requires rendering it, which means you need a full-blown webbrowser. See the difference?
Hi, it's me again :-) I was on my phone at the time and couldn't load your link, and I hoped someone hanging out here would be able to help you
Now that I've seen the content, I'm sorry to say it's just going to be a grind. They are one of the few websites left in the world that doesn't use a javascript API (meaning loading the data would be super, super easy), and they are so old that there aren't any meaningful labels in the page source that would give away the "field" versus the "value," at least not in a way that a computer can easily tell. For example, the "Mailing Address" spanning 2 table cells is the kind of irregularity that drives computers crazy. Then the table underneath the main one switches from horizontal label-value to vertical label-value. That kind of stuff.
But the good news is that there doesn't appear to be very much hidden content, by which I mean data that only in the page source.
I do hear you that programming is not your strong suit, but take a look at Scrapely and its friend Portia and see if any of the words make sense. It's hard to judge if those links are interesting, helpful, or just intimidating, because I don't know your background.
Separately, there have been several products/browser extensions/etc that have claimed to do point-and-click page extraction, but I don't have enough experience with them to recommend one over another
But, as I mentioned before, feel free to come back and ask more questions, as this stuff really is good fun and is really empowering, it just takes a little getting used to asking the computer in the right way
Actually just realized a tool I heavily used a little while back now does almost exactly this: https://repl.it
However you must sign up, and it's not exactly stable/reliable from what I remember. Account got reset and all data disappeared, and have had to change the password more then once because it wouldn't work anymore.
I'm still open to suggestions if anyone has any :)