What is Reddit's opinion of Scrapy?

Python is definitely a good language for that. A few people mentioned BeautifulSoup, which can be used to parse HTML. If you want a full scraping framework you should check out scrapy.

there are two different projects:

Scrapy | A Fast and Powerful Scraping and Web Crawling Framework http://scrapy.org/‎
Scapy is a powerful interactive packet manipulation program www.secdev.org/projects/scapy/‎

Good for you! Python is a great language to do web-scraping with. In terms of webscraping tools, I would check out either Scrapy or Beautiful Soup.

The both have pretty good tutorials that will get you started. The tutorial I followed with Beautiful Soup is here: http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/

Good luck!

For python if you want to scrape I suggest looking at <strong>Scrapy</strong> or <strong>BeautifulSoup</strong>.

I have used BeautifulSoup in the past and it was pretty awesome.

Two recommendations to bring your python game up.

Immediately start using requests and never look back.
Don't bother building a parallelized python web scraper, smart people have already built scrapy for you.

It's funny that you should mention wisdom and necromancy, that's totally my thing.

I hate to tell you this, but there is a popular web scraping tool caled "Scrapy", you might actually want to use it for your project.

API design is all about finding the correct abstractions and paying attention to scope. I would suggest creating a TheTVDB api first, and then build a command line / gui program to go along with that. The renaming files bit might fall outside of the scope of a pure TheTVDB api. I could imagine TheTVDB api might have functions like thetvdb.get_series(), and series.get_episode().

Also remember in python we generally use function names_with_underscores() as opposed to lowerCamelCase().

Finally, consider this function to use with the levenshtein function, so that the user can compare based on percentage likeness of two strings, rather than the number of changes, insertions, deletions which is hard to conceptualize:

def compare_strings(a, b): """ Makes the levenshtein into simple little coefficient, so that it can be rated as a 0 to 1 value. """ coef = 1 - levenshtein(a, b) / float( (len(a) + len(b)) / 2 ) if (coef < 0): return 0 return coef

https://github.com/fabpot/goutte is a simple webscraper from the maker of Symfony 2. It has great documentation and should be pretty easy to use.

If you need more advanced features I would use http://scrapy.org/ which is a scraping framework for Python, and it's extremely powerful and extensible.

What do you currently have? You also don't need to use your machine remember. If you learnt something like Scrapy you could host your project on Scrapinghub for a small fee or in some cases completely free.

That's for the code itself, but also worth pointing out that Scrapy solves the web crawling issue very well. Very sadly it does not fully support Python 3 (PRs welcome hint hint), but they are working on it.

If you want to start over:

If the sites you're scraping require JavaScript, use selenium.

If they don't require JavaScript but the job is complex, use scrapy.

If they don't require JavaScript and the job is simpler, use BeautifulSoup.

Web crawling is complex enough that you probably don't want to roll your own if you don't have to. Take a look at scrapy, which is a robust and well-maintained library for web scraping.

Try this. It might take a little to get the hang of but once you understand it you'll be in good shape. You'll scrape the page into csv and then you can import it into whatever spreadsheet or db you want.

yes! agreed. our urls are too long. we got a fix for that on the way so hold tight. galleries are in the works too, but we're still planning.

Which gifs are yours? Happy to properly source them to you. We're big on pushing gif artists.

http://giphy.com/artists

As for the script, you mean our crawler? It's not open-source but is built in python and php.

If you're looking to build a crawler, check out

http://splinter.cobrateam.info/

and

http://scrapy.org/

HTH!

Untill it is out you can use Phantom JS or Scrapy. Scrapy is very similar to this you specify what you want to parse with xpath or css queries tell the names of the data and you get JSON/csv/XML out. You can also tell specify which links it should follow. The best part of Kimono IMHO is graphically specifying what you want to scrape.

What you're describing is called web scraping and python does really well at it. You can use beautifulsoup or scrapy to scrape the pages.

http://www.crummy.com/software/BeautifulSoup/

http://scrapy.org/

The book Automate the Boring Stuff with Python has a chapter on web scraping using BeatifulSoup.

https://automatetheboringstuff.com/chapter11/

It's not really any harder than threads. In many ways, it's easier. No locking necessary, no "zombie" coroutines.

It used to be complicated before coroutines and Twisted's @inlineCallbacks came along, and really complicated when you only had async_chat…

I recently started playing around with asyncio and aiohttp (I couldn't resist trying the new async and await keywords), and it's pretty straightforward in practice. The main problem I found was that too much documentation (tutorials, too) assumes you already know all about event loops and how the whole async thing works. If you don't, it can all appear more complicated that it really is.

The only bits that might take a bit of work to get your head around are the server-/GUI app-like event-based execution model (although you can also use event loops like thread pools) and the whole "futures" business (which is no different to waiting on threads, really).

I'd recommend looking at Scrapy. It's based on Twisted, not asyncio, but the principle is exactly the same, and Scrapy is a marvellous example of (a) what async networking is great for, and (b) exactly how to write async software.

Whether asyncio or similar is a better choice than threads/multiprocessing is largely a question of how many IO streams it's useful to be able to handle simultaneously. I wrote an RSS aggregator based on Twisted, and I benchmarked it at 1800 feeds downloaded (not parsed) in 45 seconds, which is 40 feeds/second.

Try managing that with threads and/or multiprocessing…

(You can combine asynchronous IO with parsing in multiple threads/processes for maximum performance.)

> Python framework is a general question. Django is a web-framework.

Really Interesting what you said, I know of only Scrapy Framework (http://scrapy.org/) - are there any other Python Frameworks that are not Web Frameworks..

It really depends on how much you need your crawler to do. As a beginner you might be better off trying scrapy, which handles a lot of the details for you, so you can focus on getting what you need from the content of the pages. If you write your own, you'll have to handle those details yourself (which can get pretty involved).

From a skim OP's link, this looks like a pretty basic crawler, which just grabs links. How long it would take to write will depend mostly on your understanding of the major components involved: concurrency (asyncio in this case), the HTTP protocol, HTML, etc.... That said, writing one is a good way to learn :)

Don't worry about how long it would take. The best motivator will be a project that interests you.

You're just looking for better/more concise ways to scrape web data?

Man, you've been working yourself too hard. Check out these two libraries.

Beautiful Soup

Scrapy

While not an answer to your question, I suspect you'll find some level of relevant discussion over on the scrapy-users list. If nothing else, it might be worth your while to ask your question there, since that audience will be more scraping oriented than /r/datasets. Plus Scrapy is an amazing framework and I encourage every non-trivial scraper to learn it.

Alternatively, you could just ask the question (here or over there) that you would expect a hypothetical "/r/scraping" would discuss.

First off, I don't know the answer. That being said, this is The Internet... where not being knowledgeable about something rarely stops anyone from sharing their opinion anyway...

In theory, all workers performing the entire process would avoid a bit of overhead, but then again, the overhead of passing object references should be negligible.
Scrapy has different workers for different specific tasks, I assume for organizational reasons. Developer efficiency vs runtime efficiency is something to keep in mind...
There's also the issue of worker utilization. In your second situation, later workers could end-up starved/idling if the earlier tasks take longer to complete, leading to inefficiency.

Personally, I plan to use Scrapy for any web crawlers I write in the future. Is there a reason you're not just using it?

If your Python script uses Scrapy, you can ~easily integrate the Splash headless browser with scrapy-splash. I used it for a project lately and it's really handy for javascript-built pages.

Things occasionally get a bit hairy if you need to fetch data outside the rendered html, but Splash has a baked-in scripting environment that sandboxes the browser tab neatly; it's in Lua, and you can inject Javascript into the page and recover page-context javascript. That proved handy when I needed to grab the URL of an XMLHttpRequest.

Find out if the sites have public API's that might make accessing that data easier, for example: 'https://somenewssite.com/api/v1/articles/latest.json'

If they don't, then what I would do is write a program that periodically downloads their webpages and then scrapes those pages for links.

Alternatively, a library like http://scrapy.org might make this process easier.

You could then save the new links in a database

Hey,

I suggest you look into Web Scraping( a way to extract data form websites that don't have API), it will help you with collecting the data you want from the hp-lexicon website.

Said that, Python has a great scraping framework called Scrapy, and Ruby has one called Nokogiri.

Ruby and Python should be make easy to collect this data and turn it into JSON for your API.

> To learn the API provided by the website is going to very specific, not in general

It's not true : most APIs are based on the same basic principles (REST APIs for instance). So time spent learning how tu use a specific API is also spent learning those common principles.

likz /u/thetechfreak said in his comments, "If they do provide an API always, always use that instead of scraping the website" : you will write less code, the API will be documented, you won't have to "retro-engineer" the HTML's structure and deal with a million special cases.

Bonus : have a look at scrapy !

Do you have mad tech chops?

Right off the back, I'm thinking of a web-crawler. Just write a spider in Python (Scrapy Framework, provide names of actors in an index array, iterate through the array and concat to wiki url, pass as a param/directive for your crawler, target the info you're looking at with a css selector, dump into a DB or csv.

Edit:
Alternative, slow solution: create an index of actors, loop through it together in a bash script and use wget to download the wiki pages. parse the raw text/html for a css selector or a key word, dump to an empty text file.

No. I'm using Scrapy (http://scrapy.org/). The next version of the site will have short excerpts and photos. Right now it's just an aggregator, but the real plan is to gather all of the links over extended periods of time and analyze the content using natural language processing to be able to determine trends in local crime coverage.

Just some random thoughts:

there is scrapy a python framework for scrapping
always scrap into an intermediate format like xml or json and only have your site consume either the xml or json or have an intermediate process populate a database with your scrapped data. This is so that you can separate the scrapping from your sites functionality.
always try to use official apis first, if non exists send an email and ask.
make sure that you understand the legal consequences of your actions.

> Why not run a script that downloads the page using a browser like FF, and then have the script copy out image files.

I'm not sure I understand the difference between that and Scrapy, which is what I'm using.

> Or for that matter, why didn't you just grab the images off the servers directly, and ignore the HTML completely?

Because the servers are junked up with other files, and the images may be named something useless like a20b9sss.jpg. On the web pages, they are surrounded by crucial context, like the name of the product. They also link to the product page which contains the larger image, an association I'd be hard-pressed to make on a file system (given the bad names).

They're also on our 200GB Dropbox in no particular directories and no particular order, so I wouldn't even technically have to go out on the web ;)

[EDIT]: For bonus points, this way I also get the store associations for free. (Which storefronts had which products.)

It really depends on what your goals are; but if you're looking for industrial strength web scraping you could do worse than getting to know the scrapy web scraping framework.

It focuses mostly on using XPATH expressions (you can call BeatifulSoup or other parsers if you want).

I've used it for extracting prices from ecommerce websites for an aggregator and for pulling location data from forum postings for mapping projects.

It's fast, python, and focused on getting useful data in a form that you can feed into a database.

Yes, exactly. If you are working with python, you may want to take a look at scrapy. It's a screen-scraping framework that will make what you want to do easier/more efficient. Then you will want to incorporate that into a script that can put the extracted data into a database of your choice.

Since this app will be all about searching, designing the database schema will be the most important part of this project.

If you've got experience programing it should be easy enough to learn how to scrape data from that site.

Example: The page for the Blue Line stop at Clark and Lake is this page.

http://www.transitchicago.com/mobile/traintrackerarrivals.aspx?sid=40380

It's all displayed as text on a table, so it should be easy enough to use the structure of the page to define how data is scrapped from it. And since it's text, there's no special technique needed to read it.

A quick Google search gave me this example: http://www.drewconway.com/zia/?p=1037

And in the comments on that page this was suggested: http://scrapy.org/

use scrapy --help to see all commands (and version). I use version 1.0.5 and maybe they change something.

See example on main page of Scrapy (http://scrapy.org). They create spider without project.

Personally, I have fell in love with Scrapy framework. While there is a slight learning curve when starting off, if your project involves scraping more than a couple of webpages, Scrapy is definitely worth learning.

Once you learn the basics, you can just focus on the core logistics of getting the stuff you need instead of wasting time on making your scraper robust. Scrapy already has built-in features for handling various situations such as redirection and 404 error, which save you a lot of time for your real project.

Have you tried using Scrapy? http://scrapy.org It has good defaults + tons of settings to configure crawling (including politeness, retries, cookies). For crawling, it's much better than requests + bsoup.

Disclaimer: I help maintain Scrapy and work for Scrapinghub (company which sponsors Scrapy).

Use requests. It's very sane comparatively. You also might look into either scrapy (master branch is py3 compatible now) or selenium as well, with the latter being more helpful when there's JavaScript involved.

Have you tried writing a simple spider with Scrapy, saving the main contents of the most recent crawl to a database/file? Then, each time your spider runs, it could check if the data downloaded has changed since the last crawl by comparing it with the database/file contents.

Btw, do you have programming experience? If you don't, Portia might be a good alternative. It's a visual scraping tool and you can use for free on Scrapy Cloud.

Hey cool chart! Web scraping can be very fun and rewarding. For future projects, I recommend Scrapy. I've used a few different tools for scraping and web automation: PhantomJS, requests+BS4, Selenium, and Scrapy, which by far takes the cake in terms of being a manageable scraping framework. The only drawback to Scrapy is that there is no browser context as you would get from PhantomJS or Selenium.

I used Python and the Scrapy Framework for a similar crawl job.

It's basically a prgoram that visits a website, searches for certain HTML elements like 'div's or id-attributes and performes actions on them (like visiting a link or storing the text).

I guess there are similar libraries for Java but Python is pretty easy to learn so I would give it a try.

Here is a tutorial to get you started. It covers all aspects you need.

> The data I collect is often is completely different formats across projects so I can't store in one table.
What's the problem with using more tables in one database?

For your scaling problem you could look into scrapy, a more advanced scraping framework, which is useful for managing lots of different spiders/scrapers.

So, maybe. It largely depends on the county in which the addresses are found.

Most (maybe all?) counties keep public records on property ownership (it's one of the ways Zillow, et al, get their background data). So the specific task seems to be finding that data, then figuring out a way to query it with specific addresses.

Depending on the interface of the county/counties' pages, it could be as simple as downloading a database & using python/excel/whatever to lookup all of your addresses.

If the county has a clunkier interface, where you have to query one address at a time, your task gets a little more difficult. Here are some solutions that could help with that task, in descending order of technical expertise required:

Script-based web-scraping. You can use R, Python, or another language, and any one of various packages to carry this out, but the learning curve can be steep. Scrapy is a good Python package if that's your weapon of choice.
A GUI-based web scraping tool. I've only ever used import.io for this, but there are competitors. How useful they are depends on the exact technology the county's website uses.
Outsourcing to a service, like Amazon's Mechanical Turks. Essentially paying someone else to do it.
Calling the counties in question & seeing if they can help.

If you want to do it the right way, you would want to run the scraper off a server that fetches the data into a database that is stored on your server. Your app should then talk to the server to get any data. The scraper should NOT be running on the iOS app. Check out Scrapy: http://scrapy.org/

http://scrapy.org/ http://doc.scrapy.org/en/1.0/ I just embedded putlockers images for the videos/tv shows because I did not want to store anything on the server so if any legal issues did happen I would just receive a pile of DMCA takedown notices where I just have to take down the links. Also the entire site runs < 100MB of space.

I don't think it does since I could easily parse the HTML page using requests and beautifulsoup and get the data I want.

I used scrapy. It's a python framework for web crawling. The best part about scrapy is that the organisation which maintains it, Scrapinghub, has a service where you can upload your scrapy crawler and their servers do all the scraping work for you! Since I have a slow internet connection, I used this approach. All I had to do was download the data when the scraper had finished crawling.

A lot of people joking. :)

I scrap two local newspaper online classifieds. I store pretty much everything up to 1GB of text (sort of 1 year of posts). I build this with scrapy first for trying to learn that tool. I was initially thinking on building a better search engine for the public to use, but I just use it for myself because I notice that I could get to know great deals on common items like: bikes, cellphones, rents, or any other thing that will be hard to find, 30% profit if I sell it the same day again. I made some bucks, not a lot but enough to pay a fancy dinner. I know I could get more money of it, because it's easy to highlight good deals in realtime, but I just don't get enthusiasm to do it.

What an odd question. If you think you don't need one, then don't use one. That's totally fine. I myself like to keep things as vanilla as reasonable.

Just move on, and maybe you'll get to a point where you discover that some of them are indeed useful.

Scrapy for example doesn't only provide helpers for parsing HTML. It provides a whole infrastructure to run your scrapers/crawlers and to post-process results.

I know you're probably interested in starting your own project, but thought I'd share this one which I've used in the past:

Might provide you some ideas on things to do/not do.

When it comes to the crawling I wouldn't close myself on the Symfony/Laravel or even PHP language. I would try to separate the application and implement the crawler using some mature library like http://scrapy.org/. Then use the actual PHP Framework to do the rest of the logic like submitting answers to db, fetching the details etc.

Just my two cents, not sure if anyone will agree ;)

Pick one. Read the docs. If you understand them, you're good to go. If you don't, read them until you do.

What you do with the data once you pull it out of the web page is totally up to you however.

I agree with this post, and I'd like to add edx.org's MITx "Foundations of Computer Science" course series as a free, self-paced resource! The first course in the series is 6.00.1x "Introduction to Computer Science and Programming Using Python": https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x7

I learned to code by sticking near-amateurish web projects together with chewing gum and Wordpress plugins. This approach will teach you to use very specific configurations of tech & other tools, but that's not a good way to learn how to program.

The courses above gave me a solid a foundation of reoccurring programming conventions, so now I regularly switch between Python, Node.js, Ruby, and C++ (via Unreal engine) in my work and pet projects. My fulltime job is still as a web developer (at a medium-sized video game studio), but on a day-to-day basis I can knock out other developer tasks. For example, one of the sites for our oldest game uses a defunct, unsupported CMS with some serious database vulnerabilities. We put the site behind Incapsula as a temporary solution - but my permanent solution was to build a Scrapy spider (http://scrapy.org/), scrape the site into flat HTML, and serve each page with a small 20-line NodeJS app.

I never would have been able to do the above if I continued in the vein of stringing jQuery plugins together with duct tape. Check out the MITx courses!

You should not do it as part of your Django project.

Ideally you'd write an "internal" HTTP API (you don't necessarily need DRF or Tastypie for that!). Then you trigger your scraper periodically for example in a cron job. Or check out Scrapy, it provides a whole infrastructure including a server that let's you do this.

If this is too much overhead/effort (and I see that it could be) I'd implement it as a custom Django command.

It's neat, but I can't really see myself paying for a service like this, considering how dead simple Python is and how libraries like Scrapy basically do this for you already.

At that point the only other thing you're getting here is data/server space, which if you're going to pay for just get a server, which you can use for things like this in addition to anything else you ever wanted to run.

The only use I can see for this service is companies/etc who want to outsource all their data scraping/etc, but again any competent Python dev already working for said company can just read up on Scrapy and get this done with relative ease.

(And if your company doesn't have someone who knows Python or can at least learn enough to use Scrapy, you've gotta reevaluate your hiring practices, everyone and their mom can use or learn Python these days.)

It is absolutely possible to write such a script, but it might require a lot of time if you are an absolute beginner.

Unfortunately I'm unaware of premade solutions for web scraping, but I can suggest you to look in to beautiful soup which is a library for python for a somewhat simple solution.

First you would have to use the Requests library to fetch the page contents (in the form of a get request) to your computer. The example on the home page is all you need to do that.

Once that is done, you can use BeautifuSoup with which you can extract the information from the webpage.

I find the documentation for both those libraries understandable, but you do have to know at least the basics of python programming.

A more advanced way is to use Scrapy which automatically crawls webpages based on predefined rules to extract information.

What data do you need? There are a few different python libraries for scraping websites (http://scrapy.org/) though when it's simple one time use I tend to go with regular expressions (https://docs.python.org/2/library/re.html) for finding my data.

This is called web scraping. It looks like there are some options for C/C++. I'd recommend Python because there are plenty of tools available that make it a lot easier like Scrapy. You could also do it without Scrapy with a combo of requests and beautiful soup.

See this list of common libs to get your started For something simple: Beautiful Soup is very popular. Also using the requests module and parsing with regex or xpath.

May want to check out: scrapy

Have you looked into Scrapy?

What is Scrapy?

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Great.

What is the difference between this and Scrapy?

For date and time I suppose I'd use the built-in functions.

Where is a good place to relearn about pickling in an ELI5 manner? It was lightly covered in a course I was taking but I never used it.

To scrape the lyrics, I just used a program called Helium Scraper: http://www.heliumscraper.com

It's certainly not amazing, but it's quicker and easier than a package like scrapy (http://scrapy.org/) for something like this.

Otherwise, the text process was done with a bunch of little tools and doodads that I've programmed myself over the past few years, mostly in .NET, but some of it in Python as well.

The graphs themselves were all done in Excel, except for the really bad looking "Theme" graphs, which were done in a very "quick n' dirty" fashion in R. I basically hit the point of the night where I wasn't about to hand-craft that many graphs anymore :)

Check out Scrapy for a cool library that helps you scrape data from a webpage.

It looks like the play-by-play is formatted nice and static, so it should be pretty trivial to pull.

Scrapy is fantastic for web crawling and scraping. If you are rolling your own using BeautifulSoup (which is great for what it does, but not a full scraping framework) or whatever, make sure to check it out.

That site pretty much just calculates the time based off current local time and offsets to UTC. I don't see why you couldn't do something similar using this. According to this, the website you linked to uses the same db. There's also a python module called pytz that looks useful for this and appears to include the db too.

But if you really want to do it using the website, scrapy is probably the way to go.

Try scrapy for scraping. I've not used it but I've heard good things about it.

http://scrapy.org/

No idea about the API, I'd have to see it first, but I'd imagine it uses HTTP requests if it exists. You send a HTTP request to a certain URL and it sends you the data.

Hi,

a nice little tutorial is located here: http://zevross.com/blog/2014/05/16/using-the-python-library-beautifulsoup-to-extract-data-from-a-webpage-applied-to-world-cup-rankings/

and here

http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html

another thing you can check is the scrapy framework: http://scrapy.org/

if you have further questions, just ask.

A quick look at the site and network calls when submitting the form makes me think you'd be stuck scraping the page and that's gonna suck trying to do it on device. But if you have a server you could using something like Scrapy to write a scrapper and convert the data into something that is useable by device and then pull that data from your server.

Any reason for using selenium+chrome+ a download manager instead of ~~scrappy~~ scrapy + wget?

EDIT1: added links to the wget package and ~~scrappy~~ scrapy framework.

EDIT2: fixed the typo in Scrapy

I scraped reddit post titles and top comments (along with a few other things) from a selection of ten subreddits (five of the top subreddits, and five of my personal favorites) over a period of a couple weeks with the Python library Scrapy, and then generated word clouds with wordcloud--another Python library. The size of the word corresponds to that word's frequency.

Edit: The code can be found here.

So you're trying to scrape the participant id values, i.e huAA16BD? There are 4000+ records, so 4000+ ids. You're best off using something like http://scrapy.org/ and writing a little spider to load each page and pull out the URLS. It will be a whole fuck load easier

FantasyPros should do the trick.

http://www.fantasypros.com/nfl/projections/rb.php

Writing a web scraper with Scrapy is pretty easy and will greatly reduce your time gathering these stats.

Any experienced data scientists care to comment on the content/quality/relevance of this course? (alternative sources to learn similar material)

Are lecture videos available (from past or future courses)?

I've previously read scipy is valuable. looking at the description it also covers

Web scraping with the scrapy Python library and parsing JSON data.
Data wrangling, spatial statistics, data science, and spatial analysis with Python and the pandas and geopandas libraries
QGIS: a powerful, free, and open-source alternative to ArcGIS. (this alone is interesting)

http://scrapy.org/

http://www.qgis.org/en/site/

anything can be scrapped. It just happen that the provided api if any make it a lot easier to retrieve data than scrapping but scrapping is easy once you get it set up right anyway. do rate limit your scrapping because you can be ip banned at some sites.

check out scrapy

For simple solutions I would recommend http://scrapy.org/ However, there are times when it didn't work (example: NTLM authentication we had in our company; now I see in Google search results that people managed to correctly use scrapy with NTLM but that wasn't the case when I was developing my crawlers). Also, keep in mind that there are more ways to do authentication - HTTP basic auth, cookies, sessions, etc. You can even encounter captcha :)

Trying to do it all yourself is a mistake for one ;) But I suspect you know that.

for scoping, do you want to get every pdf, or just every pdf that's on a single page. The second is far easier than the first.

I'd suggest looking into something like http://scrapy.org/ to handle the extraction and parsing of the web page for you.

no one answered the basic question yet but casually mentioned it in passing. Anyways what you're looking for is something called a scrapper. Couple of links: http://scrapinghub.com (this has a pricing system) or http://scrapy.org/ Also it's like a legal grey area, big websites like amazon have an api which allow you access to their data through their api, yet make it illegal to scrap their data any way besides through their api.

But I haven't gotten further than this research in to this. I'm actually interested in scrapping data myself and haven't actually scrapped any yet.

Sure, depending on website you could simply use requests library + BeautifulSoup4 parsing, or check out a comprehensive scraping/ web crawling library Scrapy.

Regarding notification, you could configure simple console output, logging into a file, email, etc.

Ehem.... not to go against the grain, but in every instance where I wrote a web-scraper node has done a disservice to maintainability, due to the just immense juggling of async calls.. I write all of my scrapers in python to keep them maintainable for long term, sometimes a better tool is needed.

There are several great tools for doing this. Recently started using Scrapy http://scrapy.org/

If you need this data in JS land - (because all of your business logic is already there) you could call the python program it reads stdin looking for some custom json does the scraping/parsing needed, and outputs json to stdout parse this result in node.

Caveats to this would be variable passing, don't use command line arguments, can be prone to shell exploits (the json std [in/out] rigmarole gets around this)

If you guys are interested in webscraping.

I recommend looking into scrapy.

http://scrapy.org/

The tutorial is very straight forward.

If you need a browser client for dynamic javascript you can use scrapy with phantomJS and selenium. Which is a bit more harder but doable.

Hm. It's really weird. Like I said, I had no problems running your script as it was on Python 3.4, both names and dates were printed as they were on Wikipedia.

Yes, Python 3 has a better unicode support. It's not perfect and there are still some quirks, but generally it handles it really well out of box. If you decide to give Python 3 a go, remember there are some differences in syntax and not all modules works in 3.x. (for example a popular Python scrapping framework Scrapy works only in Python 2). But like I said, I work mostly with Python 3 and I'm doing well, including in web scrapping.

If you're new to encoding and unicode, here's a primer on it. Give it a read.

Post what exactly shows in place of dates, maybe I will be able to help.

Told by whom? This sounds like a Selenium talk and it might be relevant in writing automated UI tests. In scraping text from a website, not so much.

If you absolutely must use CSS selectors, consider using Scrapy. It has special pseudo-elements (which are not present in CSS 3) for selecting text and attribute name.

If your happy to do a little coding you can use something like scrapy to do pretty much anthing with a site.

I have a script running that logs into a site (HTTPS form POST), grabs some data using XPath and returns JSON.

If your just doing something quick and simple that you then want to throw away, regex is fine. But if you want to create something long term, that can be extended as needs grow, you will out grow what regex can do for you.
Also if you are scrapping whole websites, there is a great tool for that http://scrapy.org/

Right, so you are looking at two problems.

Web-scraping, which is concerned with downloading and parsing html file (or xml I believe) and web-crawling.

web-crawling is involved with writing rules to follow certain links and download web pages. So looking at your problem, you can either find out a way to programmatically create all of the links you require (all players) which would have to be something like downloading all of their names, perhaps from another source, and then using their last name first with the first two letters of their first name you can construct MOST of your pages you require. This method isn't pretty because after poking around I am unsure how exact the page names are structured. It seems there are rules depending on how long the names are, hyphens, etc.

Your other option is to not only scrape the pages, but crawl the site looking for what you need. If you are crawling I recommend scrapy. http://scrapy.org/ its awesome. With that being said I don't know what your objectives are, strengths, weaknesses, timelines, objectives. etc.

if you are just after data I would attempt to build all of the links using names. You might be able to cover a lot of ground quickly with that. If you are interested in learning and expanding as well as data. I would recommend python and scrapy, because any and all web data will be relatively within your grasp after learning them, which is a liberating feeling :)

good luck

My reply (form parameters censored)

You are not passing the hidden form parameters. Take a look at the source code, you will find something like this:

You need to pass them too as parameters in your form response.

One way would be to use the beautifulsoup library and use it to parse the html and extract the input.hidden fields.

If you don't need to solve the problem right now and want to learn something useful, I highly recommend you take a look at Scrapy a Python framework for web scraping. Solving your particular problem with this framework is really easy if you have some experience. And that's the catch: it has a learning curve. But if you know you will do some more advanced web scraping in the future, it's useful to learn it now instead of trying to solve every problem with requests and beautifulsoup.

If in fact you have to solve it right now, go the beautifulsoup+requests route.

Need more details. I've used BeautifulSoup to parse HTML pages and heard good things about scrapy. Without more details though it's difficult to help much.

For Python 2 look into Scrapy for Python 3 a bit more work but like TheKewlStore said BeautifulSoup and Request

If I were you I'd look into a library who can simplify this process. A good example is scrapy, read the docs and try making the example spider. If you've got this setup it is really easy to change the code, or generate a new spider.

As for your second question, it depends on the website. Are the different sizes just in a checkbox, i.e. are they present in the DOM? Then yes, it is possible with 1 request. Oh you had an example website, hehe, but yes, it can be done on that website.

Anyways, check out scrapy! It can do everything you want. If you have any more questions I can maybe help. Good luck.

One further thing to note about python is that the user community seems to be friendlier than average, a characteristic that shines through if you'll ever want to use (or get involved with) open source projects once you're a more experienced programmer. A good example of this is scrapy: http://scrapy.org/ which is a web crawling framework written entirely in python. Projects like these expand what you are capable of achieving as a programmer within the language.

Here's the framework I've been using BTW: http://scrapy.org/ You can see I was able to change the formatting a bit from the last scrape. I'm still trying to figure out the best way to include pictures in a list document.

For scrape jobs in python I usually use scrapy

The documentation is good and this video helped me out.

As far as doing it with BeautifulSoup, hopefully someone else answers! Good luck

1) Yes, absolutely. It's becoming one of the favourite languages for universities to start CS students on.

2) It sounds like you're mainly interested in what's known as scraping. Two good Python libraries designed to help with that are beautiful soup and scrapy

3) and 5) are pretty much the same - python is, I find, syntactically beautiful and lends itself well to readable code.

4)Once you've got your foot in the door with any language, learning another gets easier.

Not a sysadmin (just a developer that hangs out around here...), but maybe instead of automating something why not build a small app? One example that comes to mind is a Python web scraper that brings you all your latest sysadmin news from around the web. Try scrapy, there is a good tutorial here. This might help you start poking around Linux as well.

lxml has windows binaries available, you should probably use those instead of trying to use pip to install it. Simcaster is right... Linux is substantially easier to install extension modules, since pip will have direct access to compliers.

Side note: Scrapy is super nice for that sort of thing. On Ubuntu and other debians, use apt-get install build-essential python-dev libxml2-dev libxslt-dev to get everything pip needs to install scrapy and its dependancies.

If you can get your hand on a cheap mobile phone subscription that can send SMS, you could use an Android app (probably exists for other platform) to send SMS by requesting a server that runs on the phone (https://play.google.com/store/apps/details?id=eu.apksoft.android.smsgateway).

I saw someone mentioning Python here. While you have to learn a bit of python before using it, Scrapy is a great framework to build web-crawler (those allow you to retrieve data from websites).

Just using print to see what you have so far is fine.

As for testing, one approach is to supply your own HTML to scrape for the test. That way you can make sure that the HTML structure you expect to see is getting scraped correctly.

And just in case you don't know about it already, the Scrapy project is very useful for writing Python scrapers. It has support for things like filtering which links to follow and not follow, and CSS syntax for selecting page regions.

I've used scrapy for a couple of years now. It's pretty good.

If you don't like writing XPaths (I hate it), you can import pyquery so you can use CSS selectors instead.

Sorry for the lack and unclear documentation didn't had time to improve it.

With these scrapy's components (scheduler, item pipeline) the crawling and parsing can be done using many processes either in same host or different one, and the result scraped is pushed to redis in order to be consumed by multiple workers.

[1] http://scrapy.org/

Scrapy initially seems like it has a bit of a learning curve, until you pick out the core pieces you really need. It uses lxml and Twisted, and it commits foot-based violence on the scrapers/spiders I used to write with BeutifulSoup.

What is Reddit's opinion of Scrapy?
From 3.5 billion Reddit comments

➔ Scrapy website

By popularity on Reddit, this Service is:

97 reviews of this app found across Reddit:

What is Reddit's opinion of Scrapy? From 3.5 billion Reddit comments

➔ Scrapy website

By popularity on Reddit, this Service is:

97 reviews of this app found across Reddit:

What is Reddit's opinion of Scrapy?
From 3.5 billion Reddit comments