What are /r/apachespark's favorite Products & Services?
From 3.5 billion Reddit comments

Biggest issue I see is that you are using "format("memory")--- which is traditionally only for debugging purposes because it will store the output in memory--on the driver node. So I would imagine after running for a long time memory fills up causing expensive JVM garbage collection. Check out the output sinks: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks

Also note than since Spark 2, a DataFrame is a DataSet[Row].

Here's a nice presentation on how Spark DataFrame/SQL API optimizes your processing before eventually computing with RDDs: https://www.slideshare.net/Sigmasoftware/the-internals-of-spark-sql-joins-dmytro-popovich

"DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row"

Can someone explain what is the difference between df and ds?

df = table
ds = object? RDD on cocaine?

LE: Found something, but still unclear https://stackoverflow.com/questions/35424854/what-is-the-difference-between-spark-dataset-and-rdd

Theoretically yes, though I have not done it. Spark allows connections to databases using JDBC: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

That being said, I think there's a lot to consider here. If you're doing this for things like word counts like you mentioned above, then Spark could be useful. If there are only a few columns that you're querying over, maybe it's worth doing all of your work at the database side. For example, adding indices to those columns to help speed up your queries.

What do you mean by this?

>I can't dump my data into text files because I need to be able to retrieve and update individual rows daily

Spark is for analysis. I wouldn't recommend updating your database from Spark -- that can get very messy. If you're using Spark for its intended purpose -- data analysis -- then I don't think you should shy away from dumping your data our daily to text. You should play around with what you get from the JDBC connection. If it meets your requirements, great. If not, I think you should refine your objective and remove some of your constraints (e.g. no textfile dump)

are you using spark-shell or spark-submit?

If you're using spark-shell you can add

--driver-memory 8g --executor-memory 8g

to your call. For spark-submit you can edit the conf file and set those properties there, or you can pass in a --conf argument. There's also the ability to set it in the code itself. There's a few examples here: http://spark.apache.org/docs/latest/configuration.html. I could be wrong, but I think the default is only about 256m if you don't specify it.

EDIT: added further explanation

The classic tutorial for big data stuff is "WordCount" where you read a large stream of text (get some from project gutenberg ) and then you output a stream of tuples of (word, count) where the count is the total number of occurrences in the whole collection.

It can be expressed very succinctly with Spark. As a learning experience, its really worth taking the example code and building a job that performs it, then running it either locally or on EMR or something so you get a feel for how it works.

Deep Learning and Kubernetes. See here: https://www.slideshare.net/Lightbend/a-glimpse-at-the-future-of-apache-spark-30-with-deep-learning-and-kubernetes

Could you explain you question better?

What's the goal here?

Does this component do what you want: elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark?

Elasticsearch may be a good candidate.

https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html#spark-pushdown

I meant to paste this one also: http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.option.html#pyspark.sql.DataFrameReader.option :)

Understood.
Say the new dataframe with distinct rows and a new column is made then called df1 here.
What would this new join with df1 look like? If the original full dataframe is df2, does df1.join(df2, [column1,column2], 'left') look right? Note: Column1,column2 taken as source columns from this posts' image.

Please help im dying here lol been stuck for 2 days

to go from rows to columns, you can groupBy on user_id and game_id, then use the "array" function (pyspark docs) in an aggregation to create arrays for card_face, card_suit, etc. Then use the column getItem method (docs) to create a column from the first/second element of each array.

` import pyspark.sql.functions as F

cols_to_rowify = ["card_face","card_suit"] array_cols = [F.array(x).alias(x + "_array") for x in cols_to_rowify] item_cols = [F.col(x+ "_array").getItem(0).alias(x+'_1) for x in cols_to_rowify]

table\ .groupBy("user_id", "game_id")\ .agg(*array_cols)\ .select( "user_id", "game_id", *item_cols) will pull out the first item in card_face and card_suit. getItem(1) would get the second. If you have cases where the arrays could be of variable length, then you'd probably wantselect(F.when(F.size(array_col) >= 3,F.col(array_col).getItem(3)).otherwise('missing value')`

So it’s not really a connection between IntelliJ and HDFS. It’s a connection between the Spark client on my machine and HDFS. My setup is a little unique since our cluster is a MapR distro of Hadoop. This doc from Apache explains how you can submit a local JAR to a remote cluster I think.

Essentially the process is: 1. Write code in IntelliJ 2. Build JAR with SBT or Maven 3. Submit JAR to YARN with the Spark client on my machine

"Several external tools can be used to help profile the performance of Spark jobs:

Cluster-wide monitoring tools, such as Ganglia, can provide insight into overall cluster utilization and resource bottlenecks. For instance, a Ganglia dashboard can quickly reveal whether a particular workload is disk bound, network bound, or CPU bound. OS profiling tools such as dstat, iostat, and iotop can provide fine-grained profiling on individual nodes. JVM utilities such as jstack for providing stack traces, jmap for creating heap-dumps, jstat for reporting time-series statistics and jconsole for visually exploring various JVM properties are useful for those comfortable with JVM internals." [Ref. Apache Spark - Official Documentation - Advanced Instrumentation http://spark.apache.org/docs/latest/monitoring.html#advanced-instrumentation]

Okay yeah it seems my sbt file was totally wrong, I was running Scala 2.11 all along. I tried updating the depencies to use 2.11 builds of everything but now I just get an error with spark-csv:

name := "Averager" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0" libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.5.0"

But now I get:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark-csv. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569)

There's a lot we could write here, but some initial guidance: (I'm assuming you are literally working on actual continuous, ongoing streams of video, as opposed to e.g. analysis of large video files.)

For your situation, I recommend you use either Python or Scala, whichever you are most comfortable with right now. Scala lets you do a few more things, and with better performance, but the key for you personally is avoiding spending time on anything not absolutely necessary, at least right now.

Focus first on learning RDDs. Your best friend is the Spark Programming guide:

http://spark.apache.org/docs/latest/programming-guide.html

Invest particular effort in the section "Resilient Distributed Datasets (RDDs)". (You can skip over Shared Variables for now, for example.) That done, start learning about Spark Streaming:

http://spark.apache.org/docs/latest/streaming-programming-guide.html

You can put off learning about DataFrames for now. Generally speaking, I advise the opposite: most people should start with DataFrames, and then go down to RDDs only if they need to. However, Spark Streaming currently is built to operate on RDDs, so for your goal it's better to focus. Just learning RDDs + streaming will keep you busy :) Eventually I do advise moving to DataFrames (or maybe its generalization Datasets if you're using Scala), but that is a low priority given you mainly need to determine whether using Spark at all is going to be viable for what you need.

By the way: In Spark 2.0, which is just around the corner, there is an initial release of something called "Structured Streaming". It will still be a bit experimental and incomplete compared to standard streaming, though, and you won't be able to find online answers to the questions you'll have until it's been out a while. So stick with standard streaming for now.

Good luck!

[For posterity: advice above is current for Spark 1.6, and very likely, Spark 2.0 as well]

Perhaps you're looking for this: http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

>Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.

I'm a total noob at hadoop, spark, and big data in general. However I have been doing research in Spark and Hadoop for a while now.

From what I see, hadoop is harder to program in. This is an official example. It simply searches for a pattern in all the files inside a folder, and if it finds it, it prints the results inside another file, in another folder. https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/Grep.java

However in the programming guide for spark, all the code looks much much shorter http://spark.apache.org/docs/latest/programming-guide.html

...even if it's still java (but also has scala, python and some R integration)

Thanks for the suggestion. After hours of research yesterday I found that there are two major ways to build a recommendation system

1) Collaborative filtering 2) Content-based recommender system

What I'm looking for is the Content-based recommender system. Matrix factorization are usually useful for the former. I'm not looking for empty attributes to be filled. I'm looking for recommendation on new set of (complete)data which is based on old set of data.

I found that Naive Bayes classifiers is a great way to do the later. Luckily Mllib does have an implementation of the same.

I'm passing each of my song attributes("danceability","energy","key","loudness","tempo","time_signature") as a dense vector. Training the NaiveBayes on new set of data and testing it with the old set of data did the trick for me.

I would recommend to use the Big Data Interview Guide app available on Play store. It has 1000+ Interview Question in Big Data Space Programming, Scenario-Based, Fundamentals, Performance Tunning

https://play.google.com/store/apps/details?id=com.software.navnath.bigdatainterviewquestionbank

What are /r/apachespark's favorite Products & Services? From 3.5 billion Reddit comments

The most popular Products mentioned in /r/apachespark:

Spark: The Definitive Guide: Big Data Processing Made Simple

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

The most popular Services mentioned in /r/apachespark:

Apache Spark

elasticsearch

O'REILLY Safari

Stack Overflow

SlideShare

Project Gutenberg

SonarQube

Hacker News

The most popular reviews in /r/apachespark:

What are /r/apachespark's favorite Products & Services?
From 3.5 billion Reddit comments