1.07k likes | 1.15k Vues
This Edureka Apache Spark Interview Questions and Answers tutorial helps you in understanding how to tackle questions in a Spark interview and also gives you an idea of the questions that can be asked in a Spark Interview. The Spark interview questions cover a wide range of questions from various Spark components. Below are the topics covered in this tutorial:<br><br>1. Basic Questions<br>2. Spark Core Questions<br>3. Spark Streaming Questions<br>4. Spark GraphX Questions<br>5. Spark MLlib Questions<br>6. Spark SQL Questions
E N D
Spark Interview Questions and Answers | Apache Spark Interview Questions | Spark Tutorial | Edureka www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Agenda 1. Basic Questions 2. Spark Core Questions 3. Spark Streaming Questions 4. Spark GraphX Questions 5. Spark MLlib Questions 6. Spark SQL Questions www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Agenda 1. Basic Questions 2. Spark Core Questions 3. Spark Streaming Questions 4. Spark GraphX Questions 5. Spark MLlib Questions 6. Spark SQL Questions
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is Apache Spark?1 www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is Apache Spark?1
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is Apache Spark?1 Apache Spark is an open-source cluster computing framework for real-time processing 1 Thriving open-source community & the most active Apache project currently 2 Apache Spark is an open-source cluster computing framework for real-time processing 3 www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is Apache Spark?1 Apache Spark is an open-source cluster computing framework for real-time processing 1 Thriving open-source community & the most active Apache project currently 2 Apache Spark is an open-source cluster computing framework for real-time processing 3
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Compare MapReduce and Spark.2 www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Compare MapReduce and Spark.2
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Compare MapReduce and Spark.2 Properties Spark MapReduce Difficulty Spark is simpler to program & doesn’t require any abstractions Difficult to program with abstractions Interactivity Spark provides an interactive mode No inbuilt interactive mode except for Pig & Hive Streaming Allows real-time streaming of data & processing Perform batch processing on historical data Latency Ensures lower latency computations by caching the partial results across its distributed memory MapReduce is completely disk-oriented Speed Spark is 100 times faster than Hadoop MapReduce as it stores the data in memory, by placing it in RDD MapReduce is slower than Spark www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Compare MapReduce and Spark.2 Properties Spark MapReduce Difficulty Spark is simpler to program & doesn’t require any abstractions Difficult to program with abstractions Interactivity Spark provides an interactive mode No inbuilt interactive mode except for Pig & Hive Streaming Allows real-time streaming of data & processing Perform batch processing on historical data Latency Ensures lower latency computations by caching the partial results across its distributed memory MapReduce is completely disk-oriented Speed Spark is 100 times faster than Hadoop MapReduce as it stores the data in memory, by placing it in RDD MapReduce is slower than Spark
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Explain key features of Spark.3 www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Explain key features of Spark.3
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Explain key features of Spark.3 S p e e d & P e r f o r m a n c e P o l y g l o t M u l t i p l e F o r m a t s L a z y E v a l u a t i o n 01 02 03 04 www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Explain key features of Spark.3 S p e e d & P e r f o r m a n c e P o l y g l o t M u l t i p l e F o r m a t s L a z y E v a l u a t i o n 01 02 03 04
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Explain key features of Spark.4 H a d o o p I n t e g r a t i o n R e a l T i m e C o m p u t a t i o n M a c h i n e L e a r n i n g S p a r k G r a p h X 05 06 07 08 www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Explain key features of Spark.4 H a d o o p I n t e g r a t i o n R e a l T i m e C o m p u t a t i o n M a c h i n e L e a r n i n g S p a r k G r a p h X 05 06 07 08
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is YARN? Do you need to install Spark on all nodes of YARN cluster? 5
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is YARN? Do you need to install Spark on all nodes of YARN cluster? 5 Spark StreamingCSV Sequence File Avro Parquet HDFS Spark YARN MapReduce Storage Sources Input Data Resource Allocation Optional Processing Input Data Output Data • YARN provides a central resource management platform to deliver scalable operations across the cluster • YARN is a distributed container manager, whereas Spark is a data processing tool
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What file systems does Spark support?6
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What file systems does Spark support?6 The following three file systems are supported by Spark: HDFS Amazon S3 Local File System
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Illustrate some limitations of using Spark.7
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Illustrate some limitations of using Spark.7 Spark utilizes more storage space compared to Hadoop Developers need to be careful while running app in Spark Work must be distributed over multiple clusters Spark’s “in-memory” capability can become a bottleneck when it comes to cost -efficient processing of big data. Spark consumes a huge amount of data when compared to Hadoop
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers List some use cases where Spark outperforms Hadoop in processing.8
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers List some use cases where Spark outperforms Hadoop in processing.8 Real Time Processing: Spark is preferred over Hadoop for real -time querying of data. 1 Stream Processing: For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution. 2 Big Data Processing: Spark runs upto 100 times faster than Hadoop for processing medium and large -sized datasets. 3
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers How Spark uses Akka?9
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers How Spark uses Akka?9 • Spark uses Akka for scheduling • All the workers request for a task to master after registering • The master just assigns the task • Then, Spark uses Akka for messaging between the workers and masters
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Name the components of Spark Ecosystem?10
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Name the components of Spark Ecosystem?10 Spark Core Engine Spark SQL Spark Streaming (Streaming) Mlib (Machine Learning) Graph X (Graph Computation) Spark R (R on Spark)
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers How can Spark be used alongside Hadoop?11
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers How can Spark be used alongside Hadoop?11 Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS & YARN. Hadoop components can be used alongside Spark: ▪ HDFS ▪ MapReduce ▪ YARN ▪ Batch & Real Time Processing
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Spark Core
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Define RDD.12
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Define RDD.12 • RDD stands for Resilient Distribution Datasets • An RDD is a fault-tolerant collection of operational elements that run in parallel • Partitioned data in RDD is immutable and distributed in nature They perform functions on each file record in HDFS or other storage systems Here, the existing RDDs running parallel with one another Parallelized Collections Hadoop Datasets RDD
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers How do we create RDDs in Spark?13
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers How do we create RDDs in Spark?13 1 2 By parallelizing a collection in your Driver program, this makes use of SparkContext’s ‘parallelize’ method val DataArray = Array(2,4,6,8,10) val DataRDD = sc.parallelize (DataArray) By loading an external dataset from external storage like HDFS, HBase, shared file system scala> val distFile = sc.textFile("data.txt") distFile: org.apache.spark.rdd.RDD [String] = data.txt MapPartitionsRDD [10] at textFile at <console>:26
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is Executor Memory in a Spark application?14
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is Executor Memory in a Spark application?14 Spark application has fixed heap size & fixed number of cores for a Spark executor Heap size is the Spark executor memory, which is controlled with the spark.executor.memory property of the --executor-memory flag Every Spark application will have one executor on each worker node The executor memory is basically a measure on how much memory of the worker node will the application utilize
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Define Partitions in Apache Spark.15
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Define Partitions in Apache Spark.15 Partition is a smaller and logical division of a large distributed data set Partitioning is the process to derive logical units of data to speed up the processing By default, Spark tries to read data into an RDD from the nodes that are close to it Everything in Spark is a partitioned RDD Help parallelize distributed data processing with minimal network traffic file.xml 1 2 8 M B 128 MB 1 2 8 M B 128 MB
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What operations does RDD support?16
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What operations does RDD support?16 Create new RDD from existing RDD like map, reduceByKey and filter. Transformations are executed on demand Actions return final results of RDD computations. Actions triggers execution & carry out all intermediate transformations and return final results Transformations Actions RDD Operations An RDD has distributed a collection of objects RDDs are immutable (Read Only) data structure
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What do you understand by Transformations in Spark?17
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What do you understand by Transformations in Spark?17 Transformations are functions applied on RDD, resulting into another RDD Does not execute until an action occurs val rawData=sc.textFile("path to/movies.txt") val moviesData=rawData.map(x=>x.split(" t")) rawData RDD is transformed into moviesData RDD Lazily evaluated Example: map() and filter(), where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements from current RDD that pass function argument.
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Define functions of Spark Core.18
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Define functions of Spark Core.18 ▪ Spark Core is the distributed execution engine for large- scale parallel and distributed data processing ▪ The Java, Scala, and Python APIs offer a platform for distributed ETL application development ▪ Additional libraries, built atop the core allow diverse workloads for streaming, SQL, & machine learning Responsibilities Memory management and fault recovery Scheduling, distributing and monitoring jobs on a cluster Interacting with storage systems
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What do you understand by Pair RDD?19
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What do you understand by Pair RDD?19 Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs Pair RDDs allow users to access each key in parallel Apache defines PairRDD functions class as: class PairRDDFunctions[K, V] extends Logging with HadoopMapReduceUtil with Serializable
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is RDD Lineage?20
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is RDD Lineage?20 Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage RDD lineage is a process that reconstructs lost data partitions Best is that RDD always remembers how to build from other datasets
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is Spark Driver?21
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is Spark Driver?21 Spark Driver is the program that runs on the master node and declares transformations and actions on data RDDs. Driver in Spark creates SparkContext, connected to a given Spark Master. The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Name types of Cluster Managers in Spark?22
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers Name types of Cluster Managers in Spark?22 1 2 3 Yarn: Responsible for resource management in Hadoop. Standalone: A basic manager to set up a cluster. Apache Mesos: Generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications.
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What do you understand by worker node?23
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What do you understand by worker node?23 • Worker node (slave) refers to any node that can run the application code in a cluster • Master node assigns work and worker node actually performs the assigned tasks • Worker nodes process the data stored on the node and report the resources to the master • Based on the resource availability, the master schedule tasks
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is a Sparse Vector?24
. www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Interview Questions & Answers What is a Sparse Vector?24 A sparse vector has two parallel arrays; one for indices and the other for values These vectors are used for storing non-zero entries to save space Vectors.sparse(7,Array(0,1,2,3,4,5,6),Array(1650d,50000d,800d,3.0,3.0,2009,95054)) The above sparse vector can be used instead of dense vectors. val myHouse = Vectors.dense(4450d,2600000d,4000d,4.0,4.0,1978.0,95070d,1.0,1.0,1.0,0.0)