1 / 1

Fault Tolerant Parallel Data-Intensive Algorithms

Fault Tolerant Parallel Data-Intensive Algorithms Mucahid Kutlu, Gagan Agrawal, Oguz Kurt (Ohio State University). Introduction and Motivation ◆ The Mean-Time-To-Failure (MTTF) of the sys- tems is decreasing with growing number of cores.

milo
Télécharger la présentation

Fault Tolerant Parallel Data-Intensive Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerant Parallel Data-Intensive Algorithms Mucahid Kutlu, Gagan Agrawal, Oguz Kurt (Ohio State University) Introduction and Motivation ◆The Mean-Time-To-Failure (MTTF) of the sys- tems is decreasing with growing number of cores. ◆For the future exascale systems, it is being argued that check- pointing and recovery time (with current methods) will even exceed the MTTF. ◆Algorithm-based fault-tolerance can be alternative method • Our Approach • - Intelligent Replication • ◆Minimum Data Intersection by dividing data into blocks and distributing them in different processors. • ◆ Passive Replicas • - Summarization • ◆After processing one block, a summary is generated for that block and sent to the master node. • ◆ No need to re-process the blocks that a summary is already sent before the failure. Experimental Setup ◆ Implemented k-means and apriori algorithms in C programming language by using MPI library. ◆ Used 2.5 GHz Opterons processors and 24 GB memory ◆ The number of processors is 8 ◆ In the experiments with Hadoop: ◆Replication factor(R) : 3 ◆Summarization frequency(S) : 4 P1 P2 P3 P4 P5 P6 P7 Experimental Results primary 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Impact of Summary Exchange Frequency in Apriori: Varying Number of Failures Our Goal ◆We focus on only fail-stop failures. ◆We do not use any back up node and continue the process with remaining nodes after the failure. ◆We have two main goals for faster recovery: ◆minimize the data loss, since the lost data needs to be reread from the storage cluster ◆minimize re-processing of the lost data replica 12 13 1 14 2 3 4 5 6 7 8 9 10 11 Recovery Scenario ◆ P1 and P2 fail at the beginning of the iteration. ◆ Master node notifies - P3 to process D2 and D3 - P4 to process D4 ◆ Since all D1 blocks are lost, master node reads D1 from the file system/storage cluster and notifies P4 to process it. Master File System Total Execution Time that Changes with the Number of Failures P1 P2 P3 P4 D1 D2 D3 D4 D5 D6 D7 D8 D6 D7 D1 D8 D2 D3 D4 D5

More Related