So here's a slightly better way to go about this, it goes slightly outside of python but it's a super useful skill to learn.
The idea is to find the data being sent to the page through it's internal API, rather than trying to scrape the page itself.
I went to the site and used chrome inspector to look through all of the other requests this page made until I found something that looked like it returned only data, preferably in json format. (EDIT: You can also filter by XHR requests only to find the most likely candidates). It looks like this:
http://imgur.com/9mY7t7Q http://imgur.com/6IrEYm4
Notice a few things about this:
POST
request, not a GET
requestWe can then access the companies like so:
import requests
r = requests.post('https://siftery.com/product-json/microsoft-outlook') data = r.json() content = data['content'] companies = content['companies'] print(companies.keys())
Always better to use data that is already structured if it's available!
DDB and MDB, as products, are pretty different. One is an open source software application that can be run anywhere, while the other one is proprietary and can only be run on AWS. DDB is expensive in terms of throughput (read/write), while MDB is expensive in terms of storage. They tech behind them is comparable, but MDB clearly wins when it comes to marketing, which is where it matters for the most part in this case.
>Both are fully managed and pay as you go, great for startups.
Being "fully managed" and "pay as you go" doesn't make it good. Practical? Possibly. But certainly not good. DDB for instance has some pretty major issues which has caused companies to stray away from it, while MDB is doing extremely well.
> Their support contracts are the only real revenue stream, but again as ppl move to major cloud solutions that support is included by cloud provider for their version. So CFO gonna ask why am I paying twice?
Paying twice for what? MDB is free. Companies are willing to pay for enterprise support, see RedHat with CentOS.
>They need to stop bleeding money and get cash flow positive. That does not appear to be in cards anytime soon.
Many companies can be profitable anytime but prefer to invest money back into the company in order to grow. Uber, Reddit and Spotify are all currently unprofitable because they're focusing on growth. Amazon was predicted to fail within 10 months because of the massive losses, and that was 17 years ago.
Same concept, just a different url. You still dont need to sign in since you aren't scraping directly from the url retrieved by the GET request, but rather the json file retrieved by the POST request.
r = requests.post('https://siftery.com/company/sharethrough') data = r.json() products = data['response']['products'] for product_name in products: print(product_name)