It's not really even a secret: https://en.wikipedia.org/wiki/MapReduce
Google's secret is the weighting algorithms they use to order the search results, but they released whitepapers with details of how the index works. There's even an open-source version of their searcher: Apache Hadoop
Imagine you have a massive data set and you need to know how many times datum X appears in it. This thing is several TB long and it's unsorted, so it will take a long time before you can find what you're looking for using just your computer... What to do?
MapReduce is a software approach for cutting problems like this one into smaller problems, mapping each sub-problem to a different processor (usually different machines on a network) and then reducing each intermediate answer to the single final answer you're looking for. Not all problems can be distributed out to separate solvers, but for the ones that can be, you can gain a speedup of several orders of magnitude when compared to the classic, single-processor approach most of us are used to.
Going back to your problem, let's say you have 200 idle processors at your disposal. So in the mapping step, you cut the input data set into 200 pieces and send each piece out to a different machine. Each computer will now go through its respective chunk of data, count how many times datum X appears, and output its sub-count to a designated reducer, which adds all the counts together. The reduce step can have several reducers, all of which eventually send their answers to a single reducer that outputs the final answer you're looking for.
The most common use of MapReduce is probably querying huge data sets - map the input to different machines and have each one search its (much smaller) subset for whatever key you're looking for and then reduce all outputs to a list of appearances of your key in the input as a whole.
As for Hadoop, it's just a FOSS implementation of MapReduce in Java (a python-flavored version is also available, if I'm not mistaken). There are a bunch of excellent MapReduce/Hadoop tutorials online. Start here.
That's a seriously big question - depending on what you wanted to do with it, I'd start by looking at either Beowulf clusters or something like an Apache map-reduce cluster via Hadoop:
Just to give you a quick example: have you heard of the recent Big Data craze? The main service enabling that is Hadoop, which is the open version of the Google Filesystem and it is entirely written in Java. Every single product in the Hadoop ecosystem (HBase, Hive, Spark, Zookeeper) is developed in Java as well. And these are quite recent applications and services that were launched only a few years ago and are used to make cutting-edge apps everywhere, not just some legacy enterprise software. Java is still used for performance-intensive stuff.
Natively HDFS has commands very similar to unix commands to move files around locally and remotely. See Here!. There are other options depending on the application.
You don't in the sense you're talking about. You would setup passwordless ssh to localhost. Since Pseudo-distributed hadoop just runs separate jvms for each hadoop process it fakes out having multiple nodes without using multiple hostnames, etc.
Source: Hadoop wiki
If you're going to break out into multiple processes, it might be a good idea to use tools suitable for that purpose instead of coming up with an IPC strategy yourself. For example, Hadoop or Spring Batch. These will allow you to implement Producer/Consumer type configurations on a larger scale.
These options support multi-threaded, multi-process and even clustered (multi-machine).
I have a bachelors degree in Computer Science. I don't have any certs (yet). Although they are very valuable. If you are looking for a great cert to get into for a technology that going to be one of the biggest IT booms in the past few years, check out hadoop (http://hadoop.apache.org/) this is the reason sites like facebook can manage their scale of data, and will be be bigger than SQL Server in a few years.
Yes, I'm aware of the inconsistency there. The 'multiple of 10' thing comes from the source story, where a consultant type guy said most contemporary clusters tend to max out at 15PB.
I think there's also a difference between GDFS (which is one file system spanning all 200,000 disks, I think) -- and Hadoop, which is more like a big distributed network of computers and storage devices: http://hadoop.apache.org/
i.e. Facebook might have more than 21 petabytes of storage in total, but it's unlikely that that's all in the same box, or even the same room. I could be wrong, tho'.
You may want to look into Hadoop (an Apache product) for high availability and scalability. It's meant more for processing, but you can use it to cluster cheap hardware together for mass storage. You may also want to look into EMC Clarion or the VNX series SANs for just pure storage (expensive, but worth it in my opinion.) You may also want to look at Solaris; you could utilize ZFS for future expandability. I'm not sure if you can expand storage pools outside of the physical hardware, but locally in that hardware, expanding the storage pools is relatively easy. If you're feeling overwhelmed by all of this, I would highly suggest hiring a full time sysadmin that has experience with SAN or mass storage experience.
You can find the current documentation regarding configuring clusters at the following URL: http://hadoop.apache.org/common/docs/current/
Quickstart is at : http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html
The forrest documentation is pretty much up to date. If you do find a problem with documentation do mail the user mailing list. It is very active and you can be sure that documentation will be upto date.
In addition to Android, Java is mostly for the server side of web apps. Security problems with "online" java apps are pretty exclusively related to "applets" (Java hosted in the browser via a plugin kind of like Flash) which have been mostly dead since the early 2000s.
Java is also used a lot in "Big Data" projects - Hadoop was written in Java.
I was at a paid training session for Hadoop and the instructor kept referencing the Apache Hadoop documentation pages (http://hadoop.apache.org/). They seemed pretty helpful in getting the context of each project (Hive, Pig, Spark, etc.) and the strengths and weaknesses of each.
Throw ubuntu on there and setup a hadoop cluster. then setup a single node hadoop cluster on the i7 and you can run the same job/data on both setups to compare.
it comes with a number of samples so you wouldn't necessarily have to worry about programming anything to get a quick test.
and it's just awesome too... :)
Find a local comp-sci major who wants to learn about distributed parallel computing, like Apache Hadoop.
Set up a cluster using whatever machines you have available, and then have fun with it. Write up some apps to stress test the cluster, then practice performance tuning and troubleshooting.
Just an idea, it could be fun.
Something to rad up on since it sounds like you might be interested in is Hadoop. It’s a software that allows distributed computing. The idea that a “job” can be spread out over many machine to improve performance.
There are many topologies and you can easy fall down a hole about lots of interesting topics from Open compute, Software defined networks, software defines (anything) and lots of others.
It’s mind blowing what can be done now a days
From an architecture perspective, the problem I gave you is to design splunk.
Hadoop = http://hadoop.apache.org/ basically the foundation point of big data for most orgs
Storm = http://storm.apache.org/ is a real time batch processing system that you'd couple with flume to process the log files.
I can go farther into the high level design if you'd like. I just got through mapping out a similar architecture last week.
PS, you can do a hadoop cluster with 3 raspberry pis.
But hadoop has had native support for Docker containers for quite some time?
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html
edit: I see they mean the other way around. I may be short-sighted, but I don't see the need for that? What would it give you, to run hadoop in docker containers?
Kenny,
Following your steps I am stuck at Go to http://hadoop.apache.org/releases.html#Download Needless to say I need a ton of help. Even after I did or thought I did, I can't seem to access the internet. A ping to google got me into an endless loop. I don't have a gui. Can I access a web site from CentOs without the gui. I searched the internet for help but to no avail. I used if up eth0 to get on the network. eht1 doesn't work. Please help, I don't want a failing grade for the group project.
All of the stuff we currently run on Hadoop started out as xargs and shell scripts. Hell, it's usually pretty easy to build your data processing around "map" and "reduce" scripts hooked up via command line pipes then dump them into Hadoop Streaming when your project starts to wear big boy pants.
Hey, I did a little bit with Hadoop at an internship. I learned from a supervisor, but I found Hortonworks Sandbox helpful for testing some queries before I submitted my work. I've since started as an assistant to a grad student doing work with Hadoop. I've learned from there mostly just by reading documentation, manuals and Hadoop: The Definitive Guide by Tom White (O'Reilly book). My Java is hella weak which is a real disadvantage. As others have said Java is the most important thing.
Well... There really are some exceptions. But most of them will not help beginner :)
Some big enterprises (well, like Google, Facebook, Amazon etc.) use Java SE solutions to develop special platforms for they sometimes are not satisfied with what Java EE provides.
Perhaps most know example is a Hadoop with its minions like Pig, Hive, HBase, Cassandra etc - things related to BigData etc.
However though one need not knowledge of Java EE for them, these things may require even more experience to get a job with. For example I regard myself as successful JavaEE engineer, but I still need much exercise and practice to get employed as a same level Hadoop engineer. Though I hope to hit this target in future... :)
So, concluding, it may appear even harder for novice to get hired at some interesting project using only Java SE - for example some high-load server platform for some online game etc.
You want to launch an external distributed application from within a restricted contained environment? This is a very unusual request. I'm not sure what exactly you intend that to mean, but I don't think you need it.
Based on (1), it sounds like you're concerned with separate hadoop jobs, using different libraries and configurations, running at the same time -- but hadoop supports that automatically. As you supply your job jar along with every job invocation, and as it handles putting that jar on all machines that run tasks, and as that is only placed in an isolated location, there's no problem with just having multiple hadoop clients invoking jobs based on different jars at the same time. There's no cross-talk, so one job can't see the other job's jar, or any of its classes, so there's no problem with them each having a class with the same name.
Why do you want to start from a war instead of a jar?
"Submit the job locally" is also confusing. Unless you're talking about hadoop local mode (only appropriate for testing), there's nothing local about submitting a hadoop job. It is submitted into the shared hadoop map/reduce cluster.
Apache Hadoop? From my understanding of Apache Hadoop, they're completely different areas. Lapis is a framework for web applications (think Rails, Sinatra, Flask, Revel, Zend, etc.) You write the web applications in Moonscript (which compiles to Lua, if you didn't already know...) to be run on OpenResty (which is Nginx + Lua.) Hadoop, on the other hand, is a set of projects dealing with high performance, distribution, and high availability. To quote their site: > a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
So, web frameworks aren't going to help with any of these. That being said, it should be possible to use things like Lapis to communicate with some Hadoop projects, such as ZooKeeper, Cassandra™, Spark™, etc, with the assumption they have some form of network communication for clients (which I'm basically certain they do.) However, the libraries to talk to said services may or may not exist yet.
Well it's for data mining / processing. I'm not sure if you're aware of Hadoop, but that's what we use.
Nearly all colocation providers over here (.nl) provide around 0.5 amp of power at 230 volts, so that makes 115 watts, so indeed, it's a pretty tight budget.