1 / 12

Enhancing Spark Debugging with Arthur Interactive Replay

Debugging large parallel jobs in Spark presents significant challenges. Current approaches involve re-running code or operating in isolated environments, but they often fall short. The Arthur Interactive Replay Debugger addresses these difficulties by allowing users to reconstruct and query intermediate datasets, visualize data flow, and rerun tasks within a single-process debugger. With features like trace records across transformations and aggregation of exceptions at the master, Arthur facilitates the debugging of Spark programs, enhancing both efficiency and accuracy.

jack
Télécharger la présentation

Enhancing Spark Debugging with Arthur Interactive Replay

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Arthur Ankur Dave, MateiZaharia, Murphy McCauley,Scott Shenker, Ion Stoica The Spark Debugger UC BERKELEY

  2. Motivation Debugging large parallel jobs is hard Current approaches to debugging: • Repeatedly modify and rerun the program • Run isolated code in Spark shell

  3. Introducing Arthur Interactive replay debugger for Sparkprograms • Reconstruct and query intermediate datasets • Visualize the program’s data flow • Rerun any task in a single-process debugger • Trace records across transformations • Aggregate exceptions at the master

  4. Spark Programming Model Example: Find how many Wikipedia articles match a search term HDFS file map(_.split(‘\t’)(3)) Resilient Distributed Datasets (RDDs) articles Deterministic transformations filter(_.contains( “Berkeley”)) matches count() 10,000

  5. Approach lineage, checksums, events Master Workers Log results, checksums, events tasks

  6. Approach Master Workers lineage Log user input results,checksums tasks

  7. Detecting Nondeterministic Transformations Re-running a nondeterministic transformation may yield different results Arthur checksums RDD contents and alerts the user if necessary

  8. Demo Example dataset: 1 GB partial Wikipedia dump • Reconstruct and query intermediate datasets • Visualize the program’s data flow • Rerun any task in a single-process debugger

  9. Record Tracing Example: query a databaseof users and groups HDFS file A HDFS file B map(_.split(‘\t’)) map(_.split(‘\t’)) users groups join() groupCounts

  10. Performance Event logging introduces minimal overhead

  11. Future Plans • More analyses like backward tracing and culprit detection • Profiling tools for GC and memory • Real bugs

  12. Arthur is in development at https://github.com/mesos/spark, branch arthur Documentation: https://github.com/mesos/spark/wiki/Spark-Debugger Ankur Dave ankurd@eecs.berkeley.edu http://ankurdave.com

More Related