120 likes | 284 Vues
Debugging large parallel jobs in Spark presents significant challenges. Current approaches involve re-running code or operating in isolated environments, but they often fall short. The Arthur Interactive Replay Debugger addresses these difficulties by allowing users to reconstruct and query intermediate datasets, visualize data flow, and rerun tasks within a single-process debugger. With features like trace records across transformations and aggregation of exceptions at the master, Arthur facilitates the debugging of Spark programs, enhancing both efficiency and accuracy.
E N D
Arthur Ankur Dave, MateiZaharia, Murphy McCauley,Scott Shenker, Ion Stoica The Spark Debugger UC BERKELEY
Motivation Debugging large parallel jobs is hard Current approaches to debugging: • Repeatedly modify and rerun the program • Run isolated code in Spark shell
Introducing Arthur Interactive replay debugger for Sparkprograms • Reconstruct and query intermediate datasets • Visualize the program’s data flow • Rerun any task in a single-process debugger • Trace records across transformations • Aggregate exceptions at the master
Spark Programming Model Example: Find how many Wikipedia articles match a search term HDFS file map(_.split(‘\t’)(3)) Resilient Distributed Datasets (RDDs) articles Deterministic transformations filter(_.contains( “Berkeley”)) matches count() 10,000
Approach lineage, checksums, events Master Workers Log results, checksums, events tasks
Approach Master Workers lineage Log user input results,checksums tasks
Detecting Nondeterministic Transformations Re-running a nondeterministic transformation may yield different results Arthur checksums RDD contents and alerts the user if necessary
Demo Example dataset: 1 GB partial Wikipedia dump • Reconstruct and query intermediate datasets • Visualize the program’s data flow • Rerun any task in a single-process debugger
Record Tracing Example: query a databaseof users and groups HDFS file A HDFS file B map(_.split(‘\t’)) map(_.split(‘\t’)) users groups join() groupCounts
Performance Event logging introduces minimal overhead
Future Plans • More analyses like backward tracing and culprit detection • Profiling tools for GC and memory • Real bugs
Arthur is in development at https://github.com/mesos/spark, branch arthur Documentation: https://github.com/mesos/spark/wiki/Spark-Debugger Ankur Dave ankurd@eecs.berkeley.edu http://ankurdave.com