How about Common Crawl? Their latest crawl is 145TB of 1.81 billion pages. I think you can get a subset if you want to work with a slightly smaller dataset :)
Hi we had a course on big data last year. To break down one approach is the following.
Get Hadoop set up on a system. If you have a cluster available by your school/university, definitely request access, as it will massively increase what you can do for this project.
Once set up, build a map reduce job. This is the most important part. When you work with big amounts of data, you need some way to quickly traverse this data, and filter only those relevant results to be displayed. An example dataset could be found at https://commoncrawl.org. You can take an entire segment of the entire set if you can get a large cluster. NOTE THIS IS MULTIPLE HUNDREDS OF TERABYTES. Otherwise use the indexer to find a smaller sample dataset.
Now how do you map reduce. The idea is simple: You have several cycles where each element is mapped, filtered and shuffled throughout the entire Hadoop cluster. These operations often can be done in parallel and are really trivial by themselves. Important is to ensure that you bring the data to the point of computation, not the other way around. Network traffic will be a large bottle neck. Instructions on how to do this are available online.
You could do many, many things now to optimise this process. Map reduce is by no means the end of big data. But it’s a good start, especially for a tiny project.
If this is too much for a small project consider doing a part of it, or just setting up a tiny Hadoop server with a toy example of the search engine!
Good luck
Why don't you crawl the web yourself? Or index datasets from Common Crawl? Google is very tough to crack/crawl.
You can buy access to Google's data (that's what DuckDuckGo is doing?) but I bet it costs a fortune.
It's highly likely if that site owner is blocking the commercial tools you mention, they're smart enough to block most others.
But, if money were no option, you could build your own crawler, crawl the entire internet and discover them that way, the it's huge amounts of data and the storage and processing is going to be cost prohibitive most likely.
Or, you could start with someone else who has already done that, like commoncrawl.org, and extract it from their dataset, but it's huge amounts of data and the storage and processing is going to be cost prohibitive most likely (intentionally repeating myself).
OR, like u/TheRealWeedAtman said, invest the above time and effort into your own website, don't worry about the shady tactics of others, and you'll do better in the long term.
Not trading off of news, but I used to work at a venture capital firm. I built a pipeline that gets sentiment over times of different companies, based off of news articles, which the investors/analysts then used to help in their decisions.
If you're interested in historical news data the data set i used is "news crawl", which is a subset of commoncrawl. Pretty nice because you can get a ton of data and it's relatively simple to access.
As far as I'm aware google doesn't/can't do that. There are, however, resources like https://commoncrawl.org/ that hold a large repository of data that includes raw data, so it should be able to accomplish what you're looking for (though having not used it personally I can't guarantee that).
I'm in disbelief right now, I started a prompt within the one piece universe with the simple prompt:
You are Luffy, the main character in the anime called One Piece. You meet a man who is interested in hearing about your adventures as a pirate.
And it has a basic grasp of the fictional world - it knows that the currency in that universe is called "berries", it knows that the character Zoro fights with 3 swords, that the character Chopper likes "temperatures that are so cold that it would kill most people", it knows that Nami loves jewels, and that Chopper likes carrots (that was never highlighted in the show, but he is a reindeer) - and I could go on and on.. next I'm going to have to investigate their personalities!
edit: this was done with the dragon model, with randomness set to .3
After more research,
" GPT-3 was trained on the Common Crawl dataset, a broad scrape of the 60 million domains on the internet along with a large subset of the sites to which they link. This means that GPT-3 ingested many of the internet’s more reputable outlets — think the BBC or The New York Times — along with the less reputable ones — think Reddit. Yet, Common Crawl makes up just 60% of GPT-3’s training data; "
And I found that GPT-2's training involved 8 million outbound internet links from reddit (at some point in 2019)
I'm not sure to what degree the Dragon model leverages GPT-3 however.
So One Piece, a show that's been on air for over 20 years must have enough popularity & internet presence for the models to know a thing or 2, that's sick.
https://commoncrawl.org/ they are downloadable (LARGE) compressed files of dumps of cralws using (I believe) the Heretrix Internet Archive crawler. To my knowledge theres no searchable database, can anyone chime in otherwise?
From their site:
> Objectively measuring the relatedness of words is difficult, so as a proxy we look at how often the words are used together in similar contexts. We use the Common Crawl corpus, which contains thousands of different words across billions of webpages. Using an algorithm, we compute the distance (or relatedness) between the words; words such as “cat” and “dog” are often used close together and thus have smaller distances between them, while words such as “cat” and “book” would have greater distances. The total score is simply the average of these word distances: greater distances give a higher score.
I think the cheesy strategy here is going for words that are very specific and unlikely to be used near each other; words that are specific to a context quickly boosts the score. Trying it out casually it’s pretty hard to get better than 96ish in my attempts. Also, they only use your first 7 valid words, so the last 3 can be whatever you want. They won’t affect your score.
The data GPT-3 was trained on is a filtered version of CommonCrawl, which is open source web crawl data.
I don't think the model is accessible to the public, but you can request API access from OpenAI here.
There’s a lot of ways to implement your use case. Do you have any Natural Language Processing chops? Do you have access to a really beefy machine or a spark cluster? Python’s NLP Toolkit can do the word frequency counting. But counting word frequency across all US news outlets for the past 10 years is going to be pretty computationally intensive (if you can even find that comprehensive of a dataset).
I’d start with common crawls news dataset, although I’m not sure they have that 10 years of history: https://commoncrawl.org/2016/10/news-dataset-available/
This is the original data source. Common Crawl is a non-profit web spider that shares terabytes of data they acquire each month. Domains are ranked using a modified version of google pagerank called harmonic centrality
This is the original data source.
https://commoncrawl.org/connect/blog/
12.16 GB cc-main-2020-feb-mar-may-host-ranks.txt.gz
harmonic centrality and pagerank
Common Crawl which can be accessed for free on AWS at s3://commoncrawl has several million PDF files indexed from websites they crawl.
https://commoncrawl.org/2013/08/a-look-inside-common-crawls-210tb-2012-web-corpus/
>Put it this way: common crawl (like an open Google index) has a ton of web pages. We'll use 4,000,000,000 as our number. If it takes a second to download and parse each one, a typical 4 cpu computer running a Python script would burn through the whole thing in ~126 years. > >A 200 thread cluster running Python gets that down to ~6 months.
First off super cool project. Well done, seriously...
But there's some errors in what you just said. First, you download the common crawl data in batch, not one page at a time (that'd be literally the worst way for them to implement their api, https://commoncrawl.org/the-data/get-started/)
Second, even if it takes 1s to download a web page 99% of that is network latency. This is why nonblocking coding practices have become a necessity over the past 10 years - so you can actually make some use of those 4 cores - rather than waiting for i/o interrupts.
You can really make those pi cores do some serious work - I highly suggest learning some Java and taking a look at the Vert.x framework - it'll make that cluster a serious tool if done right.
Edit: I'm not saying that work can't be done in Python... But the JVM is where it's at for distributed computing.
There are web crawl data sets available - you don’t need to crawl yourself unless you need really up-to-date information.
Of course you still need to search through the data to find the matches, and that takes a big of compute power to do...
Also, it’s not perfect. You won’t have the deep web, and shouldn’t have anything behind a robots.txt... Plus, basic HREF links are easy to find, but things like script that generates links could be hard to find... Still, your results can be “mostly correct”.
Edit: Here is one such public web crawl data set. https://commoncrawl.org
Size may not be the point here. Meaningful results, for a study which intends to 'say something about OA', would seem to require one use a balanced spread of OA titles. The DOAJ, for instance, is not balanced. It omits a swathe of titles, because it quickly excludes any journal which ceases or temporarily suspends publication, including wiping the journal's tables-of-contents as hosted on the DOAJ.
However, if the data mining is merely intended to 'learn how to do some data mining' then yes, I guess the DOAJ could be useful. But a better source might be Common Crawl, which as of September 2017 includes nearly all university domains. One might pick random PDFs from university repository domains, if they contain keywords indicating they are from a journal. One would then remove the articles from predatory journals, to provide a clean and balanced set of OA articles. https://commoncrawl.org/2017/09/september-2017-crawl-archive-now-available/