What are /r/bigdata's favorite Products & Services?
From 3.5 billion Reddit comments

Wow wow wow guys, you are reading too much into this. Hudi is just a way to store data in order to efficiently upsert data on HDFS. Let’s say you have table stored as partitioned parquet. If you want to update one record in a partition, you have to rewrite the whole partition. Instead, a hudi table will maintain a delta log as avro. The performance view is the parquet data only. The real time view combines the parquet base data and the avro delta log so that you get the up-to-date info. When hudi decides that it’s worth compacting the data, it flushes the avro log and rewrites the parquet blocks. Just watch their presentation at strata nyc 2018 on learning O’Reilly: https://www.safaribooksonline.com/library/view/strata-data-conference/9781492025856/?ar&orpq. There are even better alternatives on cloud like Databricks delta and snowflake.

Firstly, let's clear one thing up: queries do not happen "instantaneously", they happen in either a discernible or indiscernible amount of time. Google queries may seem instantaneous, however that X milliseconds is just indiscernible to a human in some cases.

Here's some options:

1) Pre-compute query results (effectively creating an inverted index) -- better if the data doesn't change frequently

2) Use ElasticSearch, which is designed for high-scale field matching queries

Going back to the query time, you need to set parameters of expectation. Is it 100 queries or 500 queries concurrently? That's a very wide range. You should have a defined:

With X concurrent queries we want the average query time to be Y milliseconds.

I took a data engineering course a few months ago which was absolutely fantastic. I highly recommend taking a course rather than reading a book. There's a Hadoop course on Coursera right now: https://www.coursera.org/learn/hadoop

Once you've done that, another great resource is Hortonworks tutorials. That's what I'm working on right now. Figure out how to use Ambari manager to set up a cluster on Amazon ec2, then import data into Hive and run some queries. You might be able to get that done over the holidays.

Finally, when you want to dive deep O'Reilly has the best books on data engineering. I'd recommend Hadoop Application Architectures because it gives high-level summaries.

Good luck! Post again and tell us what your favorite learning resources are.

> i want to do an offline lookup - so i am looking for some information i can download, instead of API calls.

Stop and think about this. You're talking about downloading/storing the entire DNS library. Every single registered domain and the IP(s) they route to.

Do some more research on DNS to understand the protocol better before deciding on a solution.

https://mxtoolbox.com/ReverseLookup.aspx

You do understand what overfitting means, right? see: https://www.quora.com/What-is-an-intuitive-explanation-of-overfitting Is this what you want to achieve?

If you want to use regression that gives more weight to more recent datapoints look at loess (https://en.wikipedia.org/wiki/Local_regression).

You compare your model to a 50 % baseline (coin toss), how does it fare againts fitting a simple linear trend over the explanatory variable for each of the 6 data points? Or how about using some exponential smoothing time series technique over all the data points?

What sense does doing a hypothesis test have for comparing model accuracy? You may want to construct a confusion matrix of True Positives, False Positives, True negatives and False negatives.

If I understand what you are doing - you do not have a 75 % prediction rate. Your 4 runs vary in "accuracy" from 70 % to 75 % and these are only the results you've liked enough not to re-run the simulation. You run the neural network over and over again until it achieves a result you want ("so built in is a check to verify the overall error rate is < .01, if not the entire data set is reran until this is achieved."). Let me explain why this is an issue. Neural networks start with a random initialization of weights and can give slightly different results every time they are run. If I understood correctly your descriptions of re-runs, you are essentially running the simulation until you get a result you like. This means the good result you find might have more to do with the random initialization than the signal the model should be looking for. A different random initialization may work better for new data you apply this for but you do not know that until after the fact (so you can't control it).

Well if you want to get involved with Big Data specifically I would recommend learning these technologies

*Python

*Java

*Apache Spark

*Scipy

*AWS

That being said, it doesn't really sound like Big Data specifically is what you're interested in. It sounds like you might want to just explore CS in general and see what parts catch your eye.

Theres tons and tons and tons of great resources out there for learning CS in your spare time. I haven't tried Lynda. Check out www.codeacademy.com, www.coursera.com, or http://learnpythonthehardway.org/

If you want a quick survey course check out Nick Parlante's CS 101 course on coursera. https://www.coursera.org/course/cs101

Theres also a really great FAQ on /r/learnprogramming.

Im not sure formal classes are necessary. I've never taken a programming class in my life, but YMMV.

Good Luck!

If your requirement is scale + strict consistency, I'd think HBase would do the job for you. That said, if I were you, I'd review http://hbase.apache.org/book.html#quickstart and send any follow-up questions to . The HBase community is great.

Mongo could definitely be the solution, but you're going to need a lot of hardware because you'll want to shard all that data out (a billion is LOT to think about storing haha). But it's honestly one of the better solutions to what you're looking to do.

You could do this with hadoop, but it won't exactly be the "fastest" mode, but it's certainly not out of the question. Just don't expect real time results (nor should you from mongo). Hadoop is just a bit harder to get off the ground and understand fully, but scales pretty nicely.

Personally, if I were going on a project like this, I'd give mongo a much deeper look, but also price out some hardware for the end game scenario you're wishing to scale to.

Edit: Might want to look into some third party solutions for hosting that by the way. Full disclosure I do work at Rackspace so my two suggestions here are a bit biased.

For Mongo objectrocket is pretty awesome for the performance you get: http://objectrocket.com/

For Hadoop, Rackspace just launched their cloud big data platform you can apply for early access to: http://www.rackspace.com/cloud/big-data/

Edit 2: And the only reason I suggest those is to handle scaling in a MUCH easier fashion. Not just trying to coax you for business (:

https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage

U can take whole course for free. If you'r passionate about numbers start a blog, crunch some data, write about it and let the job find you :)

You should consider influxdb or other TSDBs. 225 events per second is pretty much nothing for those databases, however they have drawbacks such as very poor update capabilities, and somewhat limited query capabilities.

I'm not exactly an influx fanboy, but I do recommend it, plus it has things like kapacitor, which is essentially a stream processing framework.

I personally use influx in my projects, however we've also dedicated some time to get a zeromq based pipeline setup. I was first turned on to ZMQ when I read CERN was using it, and it works exactly as advertised, and has really good performance. The main paradigm on ZMQ that is worth its weight in gold is the PUBLISHER/SUBSCRIBER protocol, which allows you to create pipes of data which you can then send in and out of your programming language of choice, or even use as a splitter to send to influx for display in a dashboard and S3 containers containing highly efficient HD5 files or whatever else; using the latter for offline data analytics.

I think it's going to take some time to go mainstream, because Kubernetes integration for Spark is very recent and only came out in March of this year. Regarding some of the other points - the Akka scheduler operates at the application level so wouldn't be affected whether you run the application in Mesos or Kubernetes. Cassandra and Kafka are independent data stores and you could potentially run them in Mesos/Kubernetes though I wouldn't recommend doing so. More likely you'd want to run them on separate nodes to isolate the workloads of a DB/log store.

So in theory, SMACK -> SKACK (much more weird sounding) without any other changes. But I don't know of any company that has made this change yet.

To be honest not many books and courses use scala as the language to use with Spark. I would try the coursera course in functional programming in scala https://www.coursera.org/course/progfun.

I also make applications in Spark in Scala so if you need an idea of what it's like I say, keep everything immutable, and try to be as functional as you can.

The Johns Hopkins Data Science certificate on Coursera has many great courses in it. If you complete all of them, you'll be a long ways toward learning the data science side of things.

The Johns Hopkins Data Science certificate classes are well done. To finish all 9 and get the certificate would cost something like $500 and you can take 8 of them for free if you want. I just finished them and I learned a lot.

link

I know very little about data science proper but I have studied questionnaire design.

What you're calling a "direct question" actually invites an open-ended response or a measured response on a multi-point scale. It's incredibly general and will give you data on general customer satisfaction but no other specific information about customer habits.

If you were trying to determine viewers preferences for which sports they enjoy watching, I'd suggest offering more options like so:

Which of these do you like watching the most? 1) Golf 2) Baseball 3) Hockey 4) Football 5) Basketball

With a dataset of 100,000 you will probably find the answer to such a question would yield information that is pretty consistent with existing information on sports viewing habits. But another way to approach it is to ask the respondent to rank all 5 of the sports by preference.

In the previous multiple choice, you only get one data point per respondent. But by asking for a ranking, you have exponentially more data. And assuming you have demographic information on your respondents, that could allow you to find valuable viewing trends that could affect advertising revenue and how you market to the customer.

I hope this helps! If you want to really understand the depth of how to design social surveys, I highly recommend this course:

https://www.coursera.org/course/questionnairedesign

Elastic is def. good for doing search queries, especially full text searches. And it also scales well.

But if you're looking for a highly scalable relational db then you can check YugaByteDB/CockroachDB (=both based upon PostGreSql) would also do the job well.

I think what he's saying is that snapshots (EBS and RDS) are by default region specific. This is true, but they are easily transported across regions so I dont think this is a fair point.

ELB the pre-warm is a recognised thing, but only if you're going from like 10/s to 100,000/s in a step function. It features on the AWS Certification for SysAdmins - see here: https://aws.amazon.com/articles/1636185810492479#pre-warming

> A simple example I'd use is snapshots and images and cross- region networking. Why is it all so region-centric? Why are snapshots region-centric? Why have different images for different regions?

Google has half as many zones as AWS. AWS has much better international coverage (maybe I'm just bleating cause I don't live in the US). Community open source projects are far more available than GC. To be fair I have sworn many times dealing with other-region S3 buckets.

What kind of snapshots are you referring to as only region-centric in AWS? AWS supports RDS and EBS cross region in my experience. https://aws.amazon.com/about-aws/whats-new/2013/06/11/amazon-announces-faster-cross-region-ebs-snapshot-copy/ https://aws.amazon.com/blogs/aws/cross-region-snapshot-copy-for-amazon-rds/

If EMR is on S3 it's fairly easy to migrate that S3 data to another region.

> Another would be load balancing. Google's load balancer requires no phone call to pre-warm the load, and is entirely global through a single anycast IP. Seriously awesome tech :)

Phone call? Can you link me to this? I have never experienced needing a phone call to establish anything in AWS (including LB). I'm not calling you out I just have never experienced this nor with 5 minutes of "googling" could I find any reference to LB. (maybe I need to up my google game).

One thing I'm quite keen to check out is pub/sub on GC. Kinesis sucks and is really only good for prototyping or for smaller applications (money bleeding once you get over 3-4 queues). My company is a Kafka partner (Confluent.io) and we typically deploy kafka instead of kinesis on AWS.

Here's an example: https://www.coursera.org/specialization/jhudatascience/1

Just another platform for online education. However, they have a more traditional format i.e. sequential, per week release of course materials like real university courses. That's great and all but for me, since I'm learning all this stuff on the weekends and I'm not looking for any credentials, I just find the compactness of Udacity courses suitable for me.

Yes! I'm the author, and I called it "large cluster", because that's how AWS chooses to call it (they even seem to call it 'extra large').

(Search for "ds2.xlarge" and "cluster" on their https://aws.amazon.com/redshift/pricing/ page)

I would call it something else, but who am I to fight their official nomenclature?

(Cloudera emp speaking)

What is your use case? If you're managing a multi-user BI-style environment, then Impala is purpose-built for it. In contrast, Spark SQL is intended for building broader, procedural-type Spark applications.

And oh, BTW, you can run your CDH/Impala cluster on AWS. (See the Quick Start at https://aws.amazon.com/quickstart/.)

Well, that’s an incredibly broad question that’s impossible to answer. What is your specific problem? Are you trying to analyze sensor data? Text mining? Streaming data? You can visit the big names of big data (Cloudera, Hortonworks, AWS, Databricks...) to get some ideas, or pickup an O’Reilly book like: https://www.safaribooksonline.com/library/view/architecting-modern-data/9781491969267/

I'm not sure which type of resources you're looking for.

If you're looking for talks on how companies are creating data pipelines and which technologies they're using, get a Safari Online account and watch the use case talks/tracks from a Strata Conference.

If you're looking for resources on how to learn to use the technologies to create Big Data pipelines, my Professional Data Engineering course is the one.

I'd say set it up as a BOINC node, using its computing power for non-profit research. However, it's likely your company wouldn't be a big fan of paying for the power to run it all day.

It may be cool to let it run for a day or two just to top off the leaderboards though :)

If you have some programming background, preferably Python, I think the Udacity course "Intro to Hadoop and MapReduce" is a great start. It's very digestible as opposed to those marathon Coursera courses.

I've also been looking at a lot of the online masters programs, but I'm starting to think I might do something similar to the link you posted. Udacity has a decent amount of "Data Science" classes that are worth checking out and they're adding a Data Analyst "nanodegree" that looks pretty interesting.

Sounds like a machine learning problem. Does your dataset contain info about pages that those users like?

You could apply supervised learning alghorithms - multiclass classification. If your dataset is larger than your memory I recommend having a look on http://spark.apache.org/ and http://spark.apache.org/docs/latest/mllib-guide.html

Hope it helps.

But hadoop has had native support for Docker containers for quite some time?

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html

edit: I see they mean the other way around. I may be short-sighted, but I don't see the need for that? What would it give you, to run hadoop in docker containers?

For those interested, Cockroach DB whose version one stable just got released lately, is taking the same approach as Spanner and is open source. In fact, it was started by a few folks from Google who worked on Spanner and other related tech.

I know Spanner uses atomic clocks to get tight time bounds on transactions. Since CockroachDB doesn't use atomic clocks they have to use a slightly different approach:

"While Spanner provides linearizability, CockroachDB’s external consistency guarantee is by default only serializability, though with some features that can help bridge the gap in practice."

"A simple statement of the contrast between Spanner and CockroachDB would be: Spanner always waits on writes for a short interval, whereas CockroachDB sometimes waits on reads for a longer interval."

https://www.cockroachlabs.com/blog/living-without-atomic-clocks/

What are /r/bigdata's favorite Products & Services? From 3.5 billion Reddit comments

The most popular Products mentioned in /r/bigdata:

Introduction to Machine Learning with Python: A Guide for Data Scientists

The most popular Services mentioned in /r/bigdata:

Coursera

Amazon Web Services

Udacity

Hacker News

elasticsearch

CockroachDB

Apache Spark

O'REILLY Safari

Apache Hadoop

Splunk

Apache HBase

D3.js

datapine

Algorithmia

import.io

The most popular reviews in /r/bigdata:

What are /r/bigdata's favorite Products & Services?
From 3.5 billion Reddit comments