130 likes | 243 Vues
This study explores advanced fault tolerance mechanisms in multi-processor systems on chips (MPSoCs), focusing on dynamic run-time recovery approaches. As transistor sizes shrink below 90 nm and billions of devices are integrated, high performance and reliability become challenging. We propose a dual model combining hardware and software strategies for effective fault recovery. The hardware model dynamically allocates checker processors to prevent transient faults, while the software model deploys SPMD techniques to manage permanent faults. Our goal is to achieve optimal resource utilization and graceful degradation with minimal performance costs.
E N D
Dynamic Run-time Fault Tolerance in Multi-Processor System on Chips Prem Kumar Ramesh Department of Electrical and Computer Engineering
Deep Sub Micron Era • Shrinking Transistors • Feature Size < 90 nm • Billion Device Processors • High Performance ICs • Multi-Processor System on Chip
Multi-Processor System on Chip • 10’s of Processors on a single chip • Much more harder than single processor system • Processor Configuration • Communication and Synchronization • Poses a challenge to reliability!
Background – Fault Model • Duration • Transient • Permanent • Location • Processing Element • Network on Chip • Time to Failure • Before-Shelf • After-Shelf • Graceful Degradation
Previous Works • Static Redundancy Approach • N-copies of same program on different PEs • Majority Voting • Not very efficient! • Run-time Recovery Approach • Checker Processor is assigned to each Processor • Checker ‘commits’ only when the result matches with PE • If not, the task gets re-assigned to some other PE
Proposed Work • Extends the run-time recovery approach • Dynamic • Resourse Utilization • Graceful Degradation • Combines two models • Hardware model • Software model
Hardware Model • Dynamically allocate checkers to PE • Commits only when both PEs agree • Detects and Corrects Transient Faults • In case of failure of one, the other could be re-allocated to some other PE, allowing a graceful degradation
Software Model • Addresses Permanent Faults • SPMD-Single Program Multiple Data suits the situation • MPI-based approach • Splitter-Parallel Tasks-Joiner • In case of permanent fault, only the data associated with that task need to be migrated, as all Pes work on same program
Things to Explore Further • MPSoC with Heterogeneous Processors • Simultaneous Multiple Application Processing • Recovering from Control Faults
Simulation Framework • System C to model the Framework • C/C++ for the Application to be mapped
Expected Result • Achieve Run-time Dynamic Fault-Recovery with negligible performance (speed-up) cost • Better Resource Utilization • Achieve graceful degradation
Time Line • First and Second Week • Literature Survey • Third Week • Design of the models • Fourth and Fifth Week • Implementation, Coding and Debugging
References [1] Xinping Zhu and Wei Qin, “Prototyping a Fault-Tolerant Multiprocessor SoC with Run-time Fault Recovery,” DAC 2006, July 24-28, 2006, San Francisco, California, USA. [2] Grant Martin, “Overview of the MPSoC Design Challenge,” DAC 2006, July 24-28, 2006, San Francisco, California, USA. [3] Peter Flake and Simon Davidmann and Frank Schirrmeister, “System-Level Exploration Tools for MPSoC Designs,” DAC 2006, July 24-28, 2006, San Francisco, California, USA.