1 / 39

Introduction to

Introduction to . Matei Zaharia spark -project.org. What is Spark?. Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through:

nau
Télécharger la présentation

Introduction to

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Matei Zaharia spark-project.org

  2. What is Spark? • Fast and expressive cluster computing system interoperable with Apache Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Scala, Java, Python • Interactive shell Up to 100×faster (2-10× on disk) Often 5× less code

  3. Project History • Started in 2009, open sourced 2010 • 25 companies now contributing code • Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, … • Entered Apache incubator in June

  4. A General Stack • Spark is the basis of a wide set of projects in the Berkeley Data Analytics Stack (BDAS) Shark(SQL) Spark Streaming (real-time) GraphX (graph) MLbase (machine learning) … Spark More details: amplab.berkeley.edu

  5. This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification

  6. Why a New Programming Model? • MapReduce greatly simplified big data analysis • But as soon as it got popular, users wanted more: • More complex, multi-pass analytics (e.g. ML, graph) • More interactive ad-hoc queries • More real-time stream processing • All 3 need faster data sharing across parallel jobs

  7. Data Sharing in MapReduce HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . Slow due to replication, serialization, and disk IO

  8. Data Sharing in Spark iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk

  9. Spark Programming Model • Key idea: resilient distributed datasets (RDDs) • Distributed collections of objects that can be cached in memory across cluster • Manipulated through parallel operators • Automatically recomputed on failure • Programming interface • Functional APIs in Scala, Java, Python • Interactive use from Scala shell

  10. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) messages.cache() Worker results tasks Driver Block 1 Action messages.filter(lambda x: “foo” in x).count messages.filter(lambda x: “bar” in x).count Cache 2 Worker . . . Cache 3 Worker Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 2 Block 3

  11. Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

  12. Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

  13. Demo

  14. Example: Logistic Regression 110 s / iteration first iteration 80 s further iterations 1 s

  15. Spark in Scala and Java // Scala: val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();

  16. Supported Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...

  17. Spark Community • 1200+ meetup members • 90+ devs contributing • 25 companies contributing

  18. This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification

  19. Software Components • Spark client is library in user program (1 instance per app) • Runs tasks locally or on cluster • Mesos, YARN, standalone mode • Accesses storage systems via Hadoop InputFormat API • Can use HBase, HDFS, S3, … Your application SparkContext Cluster manager Local threads Worker Worker Spark executor Spark executor HDFS or other storage

  20. Task Scheduler General task graphs Automatically pipelines functions Data locality aware Partitioning awareto avoid shuffles B: A: F: Stage 1 groupBy E: C: D: join map filter Stage 2 Stage 3 = RDD = cached partition

  21. Advanced Features • Controllable partitioning • Speed up joins against a dataset • Controllable storage formats • Keep data serialized for efficiency, replicate to multiple nodes, cache on disk • Shared variables: broadcasts, accumulators • See online docs for details!

  22. This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification

  23. Shark: Hive on Spark • Columnar SQL analytics engine for Spark • Support both SQL and complex analytics • Up to 100X faster than Apache Hive • Compatible with Apache Hive • HiveQL, UDF/UDAF, SerDes, Scripts • Runs on existing Hive warehouses • In use at Yahoo! for fast in-memory OLAP

  24. Hive Architecture Client Meta store CLI JDBC Driver Query Optimizer SQL Parser Physical Plan Execution MapReduce HDFS

  25. Shark Architecture Client Meta store CLI JDBC Cache Mgr. Driver SQL Parser Physical Plan Query Optimizer Execution Spark HDFS [Engle et al, SIGMOD 2012]

  26. Performance 1.7 TB Warehouse Data on 100 EC2 nodes

  27. Spark Integration • Unified system for SQL, graph processing, machine learning • All share the same set of workers and caches

  28. Other Stack Projects • Spark Streaming:stateful, fault-tolerant stream processing (out since Spark 0.7) • sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _) • MLlib: Library of high-quality machine learning algorithms (out since 0.8)

  29. This Talk • Spark introduction & use cases • Implementation • Other stack projects • The power of unification

  30. The Power of Unification • Spark’s speed and programmability is great, but what makes the project unique? • Unification:multiple programming models (SQL, stream, graph, …) on the same engine • This had powerful benefits: • For the engine • For users Shark Streaming GraphX MLbase … Spark

  31. Code Size non-test, non-example source lines

  32. Code Size Shark non-test, non-example source lines

  33. Code Size Streaming Shark non-test, non-example source lines

  34. Code Size GraphX Streaming Shark non-test, non-example source lines

  35. Performance SQL Streaming Graph

  36. Performance • One benefit of unification is that optimizations we make for one project affect all the others! • E.g. scheduler optimizations in Spark Streaming (0.6) resulted in 2x faster Shark queries

  37. What it Means for Users • Separate frameworks: HDFS read HDFS read HDFS read query train ETL HDFS write HDFS write HDFS write … Spark: train query HDFS read ETL HDFS

  38. Getting Started • Visit spark-project.org for videos & tutorials • Easy to run in local mode, private clusters, EC2 • Built-in ML library since Spark 0.8 • Conference December 2nd:www.spark-summit.org

  39. Conclusion • Big data analytics is evolving to include: • More complex analytics (e.g. machine learning) • More interactive ad-hoc queries • More real-time stream processing • Spark is a fast platform that unifies these apps • More info: spark-project.org

More Related