Natively HDFS has commands very similar to unix commands to move files around locally and remotely. See Here!. There are other options depending on the application.
We move aggregates from Splunk into Hadoop using the Java SDK that Splunk provides. But for moving data in bulk, Splunk does provide a Hadoop connector which is bidirectional i believe. https://www.splunk.com/en_us/solutions/solution-areas/big-data/splunk-hadoop-connect.html Also check out Hunk, which is Splunk on Hadoop https://www.splunk.com/en_us/products/hunk.html
I was at a paid training session for Hadoop and the instructor kept referencing the Apache Hadoop documentation pages (http://hadoop.apache.org/). They seemed pretty helpful in getting the context of each project (Hive, Pig, Spark, etc.) and the strengths and weaknesses of each.
Hi bluu1,
If I were you, I would not ask if hadoop can attain this information, but rather what product would I use, to store the data, which is inherently different.
Hadoop is a ragbag of data processing and storing options in a distributed manner.
Hadoop itself, while the plain open source variant would exist, is more of a concept by now (compare it to linux) with many distributions combining a lot of software surrounding it (e.g Mapr, Hortonworks, Cloudera)
So should you use hadoop for storing the pricing information and processing it for forecastes later? Probably not. If every problem is a nail, then everything is a hammer, of course, but this tool does probably not fit your needs. I would possibly use hadoop if I scraped whole websites into my storage and would derive multiple computations from that aggregated data. If you are only interrested in getting very few features out of the website, i would put a webscraper (which you need anyways) into some sort of kubernetes cluster in order to be able to distribute their workload and scale, have some sort of message queue (maybe kafka) to provision their jobs and store the scraped results in a document based database, for example elasticsearch, mongodb.
That being said, if data is already available in a normalized manner (like json), just get your logstash instance ready and just poll the data right out of there (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-http_poller.html)
At this point because you don't know your use case and sounds like you're still trying to figure out, you're probably better off paying for a solution. https://logentries.com/ or https://www.sumologic.com/
I'm actually seeing something different. Mesos and containers. I see with VMs a lot of people go down the SR-IOV route but you also give up some things. In a lot of cases VMs are unnecessary IMO with all the other tools that already exist. At the end of the day you have choices and you can do something like Mesos with VMz and have containers but at what point are there too many layers of abstraction?
@posix4e "Apache HBase (TM) is not an ACID compliant database." - first line on the Apache HBase semantics page
A great deal of work had to be done in order to provide ACID compliance in Hive on HBase, and that work has only recently been done, and is even now, still in progress. Owen O'Malley is the architect at Hortonworks in charge of that project, and the source of my information.
Splice uses Derby and their own code to make their database ACID compliant. My source there is the Splice website.
Vote as you wish, but my information is accurate.
You want to launch an external distributed application from within a restricted contained environment? This is a very unusual request. I'm not sure what exactly you intend that to mean, but I don't think you need it.
Based on (1), it sounds like you're concerned with separate hadoop jobs, using different libraries and configurations, running at the same time -- but hadoop supports that automatically. As you supply your job jar along with every job invocation, and as it handles putting that jar on all machines that run tasks, and as that is only placed in an isolated location, there's no problem with just having multiple hadoop clients invoking jobs based on different jars at the same time. There's no cross-talk, so one job can't see the other job's jar, or any of its classes, so there's no problem with them each having a class with the same name.
Why do you want to start from a war instead of a jar?
"Submit the job locally" is also confusing. Unless you're talking about hadoop local mode (only appropriate for testing), there's nothing local about submitting a hadoop job. It is submitted into the shared hadoop map/reduce cluster.
If you have an O'reilly subscription there is a pretty good video with Ben and a few other Cloudera guys thats probably a good quickstart on cluster security A Practitioner's Guide to Securing Hadoop
Hadoop Operations is a good book but getting a bit outdated. I'd say just start setting up a cluster and maybe read it if you have some free time.
Don't really want a plug & play solution, I'd like to mimic as closely as possible starting with ~3 Linux boxes, configuring them, then installing and configuring Hortonworks on them.
End goal would be the Hortonworks HDP Administrator certification, so I need to know how to do everything myself.