Fault Tolerant Parallel Data-Intensive Algorithms

Fault Tolerant Parallel Data-Intensive Algorithms Mucahid Kutlu, Gagan Agrawal, Oguz Kurt (Ohio State University) Introduction and Motivation ◆The Mean-Time-To-Failure (MTTF) of the systems is decreasing with growing number of cores. ◆For the future exascale systems, it is being argued that check- pointing and recovery time (with current methods) will even exceed the MTTF. ◆Algorithm-based fault-tolerance can be alternative method • Our Approach • - Intelligent Replication • ◆Minimum Data Intersection by dividing data into blocks and distributing them in different processors. • ◆ Passive Replicas • - Summarization • ◆After processing one block, a summary is generated for that block and sent to the master node. • ◆ No need to re-process the blocks that a summary is already sent before the failure. Experimental Setup ◆ Implemented k-means and apriori algorithms in C programming language by using MPI library. ◆ Used 2.5 GHz Opterons processors and 24 GB memory ◆ The number of processors is 8 ◆ In the experiments with Hadoop: ◆Replication factor(R) : 3 ◆Summarization frequency(S) : 4 P1 P2 P3 P4 P5 P6 P7 Experimental Results primary 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Impact of Summary Exchange Frequency in Apriori: Varying Number of Failures Our Goal ◆We focus on only fail-stop failures. ◆We do not use any back up node and continue the process with remaining nodes after the failure. ◆We have two main goals for faster recovery: ◆minimize the data loss, since the lost data needs to be reread from the storage cluster ◆minimize re-processing of the lost data replica 12 13 1 14 2 3 4 5 6 7 8 9 10 11 Recovery Scenario ◆ P1 and P2 fail at the beginning of the iteration. ◆ Master node notifies - P3 to process D2 and D3 - P4 to process D4 ◆ Since all D1 blocks are lost, master node reads D1 from the file system/storage cluster and notifies P4 to process it. Master File System Total Execution Time that Changes with the Number of Failures P1 P2 P3 P4 D1 D2 D3 D4 D5 D6 D7 D8 D6 D7 D1 D8 D2 D3 D4 D5

Fault Tolerant Parallel Data-Intensive Algorithms

Fault Tolerant Parallel Data-Intensive Algorithms

Presentation Transcript

Fault-Tolerant Broadcast

Making Cloud Intermediate Data Fault-Tolerant

Fault-Tolerant Broadcast

Integrated Fault Tolerant Techniques Using Parallel Processing

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault Tolerant Parallel Data-Intensive Algorithms

Fault-Tolerant Consensus

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

A Fault Tolerant Protocol for Massively Parallel Machines

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast

Fault-tolerant Computing