270 likes | 384 Vues
This presentation by Chen He explores the challenges of reliability in large-scale clusters, particularly in MapReduce jobs, where worker node deaths and disk failures are common. The CARDIO model is introduced as a solution, focusing on replication dynamics and regeneration mechanisms. Key components include analyzing performance costs, implementing dynamic replication, and discussing the implications of resource utilization. The presentation outlines the system architecture (CardioSense, CardioSolve, CardioAct) and evaluates factors affecting performance to minimize costs, ultimately improving the reliability of data-intensive workflows.
E N D
CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He
Motivation • Is large scale cluster reliable? • 5 average worker deaths per Map-Reduce job • At least 1 disk failure in every run of a 6- hour MapReduce job on a 4000-node cluster
Motivation • How to prevent node failure from affecting performance? • Replication • Capacity constraint • Replication time, etc • Regeneration through re-execution • Delay program progress • Cascaded re-execution
Motivation COST AVAILABILITY All pictures adopted from the Internet
Outline • Problem Exploration • CARDIO Model • Hadoop CARDIO System • Evaluation • Discussion
Problem Exploration • Performance Costs • Replication cost (R) • Regeneration cost (G) • Reliability cost (Z) • Execution cost (A) • Total cost (T) • Disk cost (Y) T=A+Z Z=R+G
Problem Exploration • Experiment Environment • Hadoop 0.20.2 • 25 VMs • Workloads: Tagger->Join->Grep->RecordCounter
Problem Exploration Summary • Replication Factor for MR Stages
Problem Exploration Summary • Detailed Execution Time of 3 Cases
CARDIO Model • Block Failure Model • Output of stage i is • Replication factor is • Total block number is • Single block failure probability is • Failure probability in stage i:
CARDIO Model • Cost Computation Model • Total time of stage i: • Replication cost of stage i: • Expected regeneration time of stage i: • Reliability cost for all stages: • Storage Constraint C of all stages: • Choose to minimize Z
CARDIO Model • Dynamic Replication • Replication number x may vary during the program approaching • Job is in Step k, the replication factor at this step is:
CARDIO Model • Model for Reliability • Minimize • Based on • In the condition of
CARDIO Model • Resource Utilization Model • Model Cost = resource utilized • Resource type Q • CPU, Network, Disk, and Storage resource, etc. • Utilization of q resource in stage i: • Normalize usage by • Relative costs weights:
CARDIO Model • Resource Utilization Model • The cost for A is: • Total Cost: • Optimization target: • Choose to minimize T
CARDIO Model • Optimization Problem • Job optimality (JO) • Stage optimality (SO)
Hadoop CARDIO System • CardioSense • Obtain progress from JT periodically • Be triggered by pre-configured threshold-value • Collect resource usage statistics for running stages • Rely on HMon on each worker node • HMon based on Atop has low overhead
Hadoop CARDIO System • CardioSolve • Receive data from CardioSense • Solve SO problem • Decide the replication factors for current and previous stages
Hadoop CARDIO System • CardioAct • Implement the command from CardioSolve • Use HDFS API setReplication(file, replicaNumber)
Evaluation • Several Important Parameters • p is the failure rate 0.2 if not specified • is the time to replicate a data unit, 0.2 as well • is the computation resource of stage i, it follows uniform distribution U(1,Cmax),Cmax=100 in general. • is the output of stage i, it is obtained from a uniform distribution U(1, Dmax), Dmax varies within the [1,Cmax]. • C is the storage constraint for the whole process. Default value is
Evaluation • Effect of Dmax
Evaluation • Effect of Failure rate p
Evaluation • Effect of block size
Evaluation • Effect of different resource constraints ++ means over-utilzed, and this type of resource is regarded as expensive P=0.08, C=204GB, delta=0.6 S3 is CPU intensive DSK has similar performance pattern as NET CPU 0010, NET 0011, DSKIO 0011,STG0011
Evaluation S2 re-execute more frequently due to the failure injection. Because it has large data output. P=0.02, 0.08 and 0.1 1 , 3, 21 API reason
Discussion • Problems • Typos and misleading symbols • HDFS API setReplication() • Any other ideas?