1 / 69

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial: <br><br>1) Big Data Introduction <br>2) Batch vs Real Time Analytics <br>3) Why Apache Spark? <br>4) What is Apache Spark? <br>5) Using Spark with Hadoop <br>6) Apache Spark Features <br>7) Apache Spark Ecosystem <br>8) Demo: Earthquake Detection Using Apache Spark

EdurekaIN
Télécharger la présentation

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 5 Best Practices in DevOps Culture EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  2. What to expect? 2 Spark Features 1 Why Apache Spark? 3 Spark Ecosystem 5 4 Use Case Hands-On Examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  3. Big Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  4. Data Generated Every Minute! EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  5. IOT: 50 Billion Devices By 2020 “~6 things online” per person Sensors, Smart, Objects, Device Clustered Systems 50 Rapid adoption rate of digital infrastructure 5x faster than electricity & telephony Billion SmartObjects Tablets, Laptops, Phones Inflection Point World Population 6.307 6.894 7.83 6.721 7.347 2003 2010 2020 2008 2015 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  6. Big Data Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  7. Big Data Analytics ➢ Big Data Analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. ➢ Big Data Analytics is of two types: 1. Batch Analytics 2. Real-Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  8. Batch vs Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  9. Batch vs Real Time Analytics Time ETL Analytics based on the data collected over a period of time is Batch Analytics Stored Stored Stored Client Client Client Client Time Analytics based on immediate data for instant result is Real- Time (Stream) Analytics ms ms ms Client Client Client EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  10. Use Cases of Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  11. Use Cases of Real Time Analytics Banking Government Healthcare Telecommunications Stock Market EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  12. Why Spark When Hadoop Is Already There? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  13. Batch Processing In Hadoop Processing Data Using MapReduce Processing Data Processing Data Processing Data Processing Data Day 1 Day 2 Day 3 Day 4 Day N Hadoop processes the data stored over a period of time Time Lag Day 1 Day 2 Day 3 Day 4 Day N Input Data Input Data Input Data Input Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  14. Real Time Processing In Spark Processing Data Processing Data Processing Data Processing Data Day 1 Day 2 Day 3 Day 4 Day N Spark overcomes the time lag issue No Time Lag Day 1 Day 2 Day 3 Day 4 Day N Input Data Input Data Input Data Input Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  15. Spark Vs Hadoop Hadoop implements Batch processing on Big Data. It thus cannot deliver to our Real-Time use case needs. Our Requirements: Process data in real-time Handle input from multiple sources Easy to use Faster processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  16. Spark Success Story EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  17. Spark Success Story Twitter Sentiment Analysis With Spark NYSE: Real Time Analysis of Stock Market Data Trending be campaigns and attract larger audience Topics to can Sentiment crisis service adjusting and target marketing helps in used create management, Banking: Credit Card Fraud Detection Genomic Sequencing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  18. Spark Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  19. What Is Spark?  Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation Figure: Real Time Processing In Spark  Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance Serial Parallel Reduction in time  It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  20. Why Spark? Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  21. Using Hadoop Through Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  22. Spark And Hadoop Spark applications can also be run on YARN (Hadoop NextGen) Spark can run on top of Hadoop’s distributed file system Hadoop Distributed File System (HDFS) to leverage the distributed replicated storage Spark can be used along with MapReduce in the same Hadoop cluster or can be used alone as a processing framework EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  23. Spark And Hadoop Spark is not intended to replace Hadoop but it can regarded as an extension to it MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real- time processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  24. Spark Features Speed Polyglot Advanced Analytics In-Memory Computation Hadoop Integration Machine Learning EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  25. Spark Features 1 Spark runs upto 100x times faster than MapReduce vs 2 Polyglot: Programming in Scala, Python, Java and R EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  26. Spark Features 3 Lazy Evaluation: Delays evaluation till needed 4 Real time computation & low latency because of in-memory computation EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  27. Spark Features 5 Hadoop Integration 1 2 6 Machine Learning for iterative tasks EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  28. Spark Ecosystem Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  29. Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  30. Spark Ecosystem ML pipelines makes it easier to combine multiple algorithms or workflows Tabular data abstraction introduced by Spark SQL DataFrames ML Pipelines Spark Streaming (Streaming) MLlib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  31. Spark Core Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  32. Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing Table Row It is responsible for: Row  Memory management and fault recovery  Scheduling, distributing and monitoring jobs on a cluster  Interacting with storage systems Result Row Row Figure: Spark Core Job Cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  33. Spark Architecture Worker Node Cache Executor Task Task Driver Program Cluster Manager Spark Context Worker Node Cache Executor Task Task Figure: Components of a Spark cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  34. Spark Streaming Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  35. Spark Streaming  Spark Streaming is used for processing real-time streaming data  It is a useful addition to the core Spark API  Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams  The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data Figure: Streams In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  36. Spark Streaming MLlib Machine Learning Streaming Data Sources Data Storage Systems Spark Streaming Static Data Sources Spark SQL SQL + DataFrames Figure: Overview Of Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  37. Spark Streaming Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Figure: Input data stream divided into discrete chunks of data Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  38. Spark SQL Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  39. Spark SQL Features 1 Spark SQL integrates relational processing with Spark’s functional programming 2 Spark SQL is used for the structured/semi structured data analysis in Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  40. Spark SQL Features 3 Support for various data formats 4 SQL queries can be converted into RDDs for transformations RDD 2 RDD 1 Shuffle transform Drop split point Invoking RDD 2 computes all partitions of RDD 1 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  41. Spark SQL Overview 5 Performance And Scalability EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  42. Spark SQL Features 6 Standard JDBC/ODBC Connectivity User Defined Functions lets users define new Column-based functions to extend the Spark vocabulary 7 User EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  43. Spark SQL Flow Diagram  Spark SQL has the following libraries: 1. Data Source API 2. DataFrame API 3. Interpreter & Optimizer 4. SQL Service Data Source API  The flow diagram represents a Spark SQL process using all the four libraries in sequence DataFrame API Named Columns Interpreter & Optimizer Spark SQL Service Resilient Distributed Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  44. MLlib Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  45. MLlib Machine learning may be broken down into two classes of algorithms: Machine Learning 1. Supervised algorithms use labelled data in which both the input and output are provided to the algorithm Supervised Unsupervised Clustering - K Means • Classification - Naïve Bayes - SVM 2. Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without labels. • Dimensionality Reduction - Principal Component Analysis - SVD • Regression - Linear - Logistic • EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  46. Mllib Techniques Techniques for Machine Learning. There are 3 common categories of techniques: 1. Classification 2. Clustering 3. Collaborative Filtering EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  47. Mllib - Techniques 1. Classification: It is a family of supervised machine learning algorithms that designate input as belonging to one of several pre-defined classes Some common use cases for classification include: i) Credit card fraud detection ii) Email spam detection 2. Clustering: In clustering, an algorithm groups objects into categories by analyzing similarities between input examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  48. Mllib - Techniques Collaborative Filtering: Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part) 3. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  49. GraphX Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

  50. GraphX Graph Concepts A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that connect them. The vertices are the objects and the edges are the relationships between them. Relationship: Friends Carol Bob Edge Vertex A directed graph is a graph where the edges have a direction associated with them. E.g. User Bob follows Carol on Twitter. Relationship: Friends Carol Bob Follows EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

More Related