1 / 45

CS 294-110: Project Suggestions

CS 294-110: Project Suggestions. September 14, 2015. Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/). Projects. This is a project-oriented class Reading papers should be a means to a great project, not a goal in itself!

dslusher
Télécharger la présentation

CS 294-110: Project Suggestions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 294-110: Project Suggestions September 14, 2015 Ion Stoica and Ali Ghodsi (http://www.cs.berkeley.edu/~istoica/classes/cs294/15/)

  2. Projects • This is a project-oriented class • Reading papers should be a means to a great project, not a goal in itself! • Strongly prefer groups of two students • Today, I’ll present some suggestions • But, you are free to come up with your own proposal • Main goal: just do a great project

  3. Projects • Many projects around Spark • Local expertise • Great platform to disseminate your work • Short review based on log mining example to provide context

  4. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns

  5. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Driver Worker Worker

  6. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker lines = spark.textFile(“hdfs://...”) Driver Worker Worker

  7. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Worker lines = spark.textFile(“hdfs://...”) Driver Worker Worker

  8. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Driver Worker Worker

  9. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Worker lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Driver Worker Worker

  10. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Driver messages.filter(lambda s: “mysql” in s).count() Worker Worker

  11. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Driver Action messages.filter(lambda s: “mysql” in s).count() Worker Worker

  12. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Block 1 Driver messages.filter(lambda s: “mysql” in s).count() Worker Worker Block 2 Block 3

  13. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker tasks Block 1 Driver tasks messages.filter(lambda s: “mysql” in s).count() tasks Worker Block 2 Worker Block 3

  14. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Block 1 Driver Read HDFS Block messages.filter(lambda s: “mysql” in s).count() Worker Block 2 Worker ReadHDFS Block ReadHDFS Block Block 3

  15. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Block 1 Driver Process& Cache Data messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker Cache 3 Block 2 Worker Process& Cache Data Process& Cache Data Block 3

  16. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker results Block 1 Driver results messages.filter(lambda s: “mysql” in s).count() Cache 2 results Worker Cache 3 Block 2 Worker Block 3

  17. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Block 1 Driver messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 Block 2 Worker Block 3

  18. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker tasks Block 1 Driver tasks messages.filter(lambda s: “mysql” in s).count() Cache 2 tasks Worker messages.filter(lambda s: “php” in s).count() Cache 3 Block 2 Worker Block 3

  19. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Block 1 Driver ProcessfromCache messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 Block 2 Worker ProcessfromCache ProcessfromCache Block 3

  20. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker results Block 1 Driver results messages.filter(lambda s: “mysql” in s).count() Cache 2 results Worker messages.filter(lambda s: “php” in s).count() Cache 3 Block 2 Worker Block 3

  21. Spark Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache() Worker Block 1 Driver messages.filter(lambda s: “mysql” in s).count() Cache 2 Worker messages.filter(lambda s: “php” in s).count() Cache 3 • Cache your data  Faster Results • Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from mem vs. 20s for on-disk Block 2 Worker Block 3

  22. Spark DataFrames ML Pipelines Spark SQL Spark Streaming MLlib GraphX Spark Core ? Data Sources {JSON}

  23. Pipeline Shuffle • Problem • Right now shuffle senders write data on storage after which the data is shuffled to receivers • Shuffle often most expensive communication pattern, sometimes dominates job comp. time • Project • Start sending shuffle data as it is being produced • Challenge • How do you do recovery and speculation? • Could store data as being sent, but still not easy….

  24. Fault Tolerance & Perf. Tradeoffs • Problem: • Maintaining lineage in Spark provides fault recovery, but comes at performance cost • E.g., hard to support super small tasks due to lineage overhead • Project: • Evaluate how much you can speed up Spark by ignoring fault tolerance • Can generalize to other cluster computing engines • Challenge • What do you do for large jobs, how do you treat stragglers? • Maybe a hybrid method, i.e., just don’t do lineage for small jobs? Need to figure out when a job is small…

  25. (Eliminating) Scheduling Overhead • Problem: with Spark, driver schedules every task • Latency 100s ms or higher; cannot run ms queries • Driver can become a bottleneck • Project: • Have workers perform scheduling • Challenge: • How do you handle faults? • Maybe some hybrid solution across driver and workers?

  26. Cost-based Optimization in SparkSQL • Problem: • Spark employs a rule-based Query Planner (Catalyst) • Limited optimization opportunities especially when operator performance varies widely based on input data • E.g., join and selection on skewed data • Project: cost-based optimizer • Estimate operators’ costs, and use these costs to compute the query plan

  27. Streaming Graph Processing • Problem: • With GraphX, queries can be fast but updates are typically in batches (slow) • Project: • Incrementally update graphs • Support window based graph queries • Note: • Discuss with Anand Iyer and Ankur Dave if interested

  28. Streaming ML • Problem: • Today ML algorithms typically performed on static data • Cannot update model in real-time • Project: • Develop on-line ML algorithms that update the model continuously as new data is streamed • Notes: • Also contact Joey Gonzalez if interested

  29. Beyond JVM: Using Non-Java Libraries • Problem: • Spark tasks are executed within JVMs • Limits performance and use of non-Java popular libraries • Project: • General way to add support for non-Java libraries • Example: use JNI to call arbitrary libraries • Challenges: • Define interface, shared data formats, etc • Notes • Contact Guanhua and Shivaram, if needed

  30. Beyond JVM: Dynamic Code Generation • Problem: • Spark tasks are executed within JVMs • Limits performance and use of non-Java popular libraries • Project: • Generate non-Java code, e.g., C++, CUDA for GPUs • Challenges: • API and shared data format • Notes • Contact Guanhua and Shivaram, if needed

  31. Beyond JVM: Resource Management and Scheduling • Problem • Need to schedule processes hosting non-Java code • GPU cannot be invoked by more than one process • Project: • Develop scheduling, and resource management algorithms • Challenge: • Preserve fault tolerance, straggler mitigation • Notes • Contact Guanhua and Shivaram, if needed

  32. Time Series for DataFrames • Insprired by Pandas and R DataFrames, Spark recently introduced DataFrames • Problem • Spark DataFrames don’t support time series • Project: • Develop and contribute distributed time series operations for Data Frames • Challenge: • Spark doesn’t have indexes • http://pandas.pydata.org/pandas-docs/stable/timeseries.html

  33. ACID transactions to Spark SQL • Problem • Spark SQL is used for Analytics and doesn’t support ACID • Project: • Develop and add row-level ACID tx on top of Spark SQL • Challenge: • Challenging to provide transactions and analytics in one system • https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

  34. Typed Data Frames • Problem • DataFrames in Spark, unlike Spark RDDs, do not provide type safety • Project: • Develop a typed DataFrame framework for Spark • Challenge: • SQL-like operations are inherently dynamic (e.g. filter(“col”) and make it hard to have static typing unless fancy reflection mechanisms are used

  35. General pipelines for Spark • Problem • Spark.ml provides a pipeline abstraction for ML, generalize it to cover all of Spark • Project: • Develop a pipeline abstraction (similar to ML pipelines) that spans all of Spark, allowing users to perform SQL operations, GraphX operations, etc

  36. Beyond BSP • Problem • With BSP each worker executes the same code • Project • Can we extend Spark (or other cluster computing framework) to support non-BSP computation • How much better than emulating everything with BSP? • Challenge • Maintain simple APIs • More complex scheduling, communication patterns

  37. Project idea: cryptography & big data (Alessandro Chiesa) As data and computations scale up to larger sizes… … can cryptography follow? One direction: zero knowledge proofs for big data

  38. Classical setting: zero knowledge proofs on 1 machine Here is the result of your computation. add crypto magic I don’t believe you. I don’t want to give you my private data. Send me a ZK proof of correctness? + generate ZK proof + verify ZK proof result & ZK proof client server

  39. New setting for big data: zero knowledge proofs on clusters Problem: cannot generate ZK proof on 1 machine (as before) Challenge: generate the ZK proof over a cluster (e.g., using Spark) + generate ZK proof + verify ZK proof result & ZK proof client End goal:“scaling up” ZK proofs to computations on big data & explore security applications! cluster

  40. Succinct (quick overview) • Queries on compressed data • Basic operations: • Search: given a substring “s” return offsets of all occurrences of “s” within the input • Extract: given an offset “o” and a length “l” uncompress and return “l” bytes from original file starting at “o” • Count: given a substring “s” return the number of occurrences of “s” within the input • Can implement key-value store on top of it

  41. Succinct: Efficient Point Query Support • Problem: • Spark implementation: expensive, as always queries all workers • Project: • Implement Succinct on top of Tachyon (storage layer) • Provide efficient key-value store lookups, i.e., lookup a single worker if key is there • Note: • Contact Anurag and Rachit, if interested

  42. Succinct: External Memory Support • Problem: • Some data increases faster than main memory • Need to execute queries on external storage (e.g., SSDs) • Project: • Design & implement compressed data structures for efficient external memory execution • A lot of work in theory community, that could be exploited • Note: • Contact Anurag and Rachit, if interested

  43. Succinct: Updates • Problem: • Current systems use a multi-store architecture • Expensive to update compressed representation • Project: • Develop a low overhead update solution with minimal impact on memory overhead and query performance • Start from multi-store architecture (see NSDI paper) • Note: • Contact Anurag and Rachit, if interested

  44. Succinct: SQL • Problem: • Arbitrary sub-string search powerful but not as many workloads • Project: • Support SQL on top of Succinct • Start from SparkSQL and Succinct Spark package? • Note: • Contact Anurag and Rachit, if interested

  45. Succinct: Genomics • Problem: • Genomics pipeline still expensive • Project: • Genome processing on a single machine (using compressed data) • Enable queries on compressed genomes • Challenges: • Domain specific query optimizations • Note: • Contact Anurag and Rachit, if interested

More Related