1 / 118

Chapter 3

Chapter 3. Fault-Tolerant Design. What is this chapter about?. Gives Overview of Fault-Tolerant Design Focus on Basic Concepts in Fault-Tolerant Design Metrics Used to Specify and Evaluate Dependability Review of Coding Theory Fault-Tolerant Design Schemes Hardware Redundancy

vern
Télécharger la présentation

Chapter 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3 Fault-Tolerant Design

  2. What is this chapter about? • Gives Overview of Fault-Tolerant Design • Focus on • Basic Concepts in Fault-Tolerant Design • Metrics Used to Specify and Evaluate Dependability • Review of Coding Theory • Fault-Tolerant Design Schemes • Hardware Redundancy • Information Redundancy • Time Redundancy • Examples of Fault-Tolerant Applications in Industry

  3. Fault-Tolerant Design • Introduction • Fundamentals of Fault Tolerance • Fundamentals of Coding Theory • Fault Tolerant Schemes • Industry Practices • Concluding Remarks

  4. Introduction • Fault Tolerance • Ability of system to continue error-free operation in presence of unexpected fault • Important in mission-critical applications • E.g., medical, aviation, banking, etc. • Errors very costly • Becoming important in mainstream applications • Technology scaling causing circuit behavior to become less predictable and more prone to failures • Needing fault tolerance to keep failure rate within acceptable levels

  5. Faults • Permanent Faults • Due to manufacturing defects, early life failures, wearout failures • Wearout failures due to various mechanisms • e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. • Temporary Faults • Only present for short period of time • Caused by external disturbance or marginal design parameters

  6. Temporary Faults • Transient Errors (Non-recurring errors) • Cause by external disturbance • e.g., radiation, noise, power disturbance, etc. • Intermittent Errors (Recurring errors) • Cause by marginal design parameters • Timing problems • e.g., races, hazards, skew • Signal integrity problems • e.g., crosstalk, ground bounce, etc.

  7. Redundancy • Fault Tolerance requires some form of redundancy • Time Redundancy • Hardware Redundancy • Information Redundancy

  8. Time Redundancy • Perform Same Operation Twice • See if get same result both times • If not, then fault occurred • Can detect temporary faults • Cannot detect permanent faults • Would affect both computations • Advantage • Little to no hardware overhead • Disadvantage • Impacts system or circuit performance

  9. Hardware Redundancy • Replicate hardware and compare outputs • From two or more modules • Detects both permanent and temporary faults • Advantage • Little or no performance impact • Disadvantage • Area and power for redundant hardware

  10. Information Redundancy • Encode outputs with error detecting or correcting code • Code selected to minimize redundancy for class of faults • Advantage • Less hardware to generate redundant information than replicating module • Drawback • Added complexity in design

  11. Failure Rate • (t) = Component failure rate • Measured in FITS (failures per 109 hours)

  12. System Failure Rate • System constructed from components • No Fault Tolerance • Any component fails, whole system fails

  13. Reliability • If component working at time 0 • R(t) = Probability still working at time t • Exponential Failure Law • If failure rate assumed constant • Good approximation if past infant mortality period

  14. Reliability for Series System • Series System • All components need to work for system to work

  15. System Reliability with Redundancy • System reliability with component B in Parallel • Can tolerate one component B failing

  16. Mean-Time-to-Failure (MTTF) • Average time before system fails • Equal to area under reliability curve • For Exponential Failure Law

  17. Maintainability • If system failed at time 0 • M(t) = Probability repaired and operational at time t • System repair time divided into • Passive repair time • Time for service engineer to travel to site • Active repair time • Time to locate failing component, repair/replace, and verify system operational • Can be improved through designing system so easy to locate failed component and verify

  18. Repair Rate and MTTR •  = rate at which system repaired • Analogous to failure rate  • Maintainability often modeled as • Mean-Time-to-Repair (MTTR) = 1/

  19. Normal system operation failures S 1 0 t0 t1 t2 t3 t4t Availability • System Availability • Fraction of time system is operational

  20. Availability • Telephone Systems • Required to have system availability of 0.9999 (“four nines”) • High-Reliability Systems • May require 7 or more nines • Fault-Tolerant Design • Needed to achieve such high availability from less reliable components

  21. Coding Theory • Coding • Using more bits than necessary to represent data • Provides way to detect errors • Errors occur when bits get flipped • Error Detecting Codes • Many types • Detect different classes of errors • Use different amounts of redundancy • Ease of encoding and decoding data varies

  22. Block Code • Message = Data Being Encoded • Block code • Encodes m messages with n-bit codeword • If no redundancy • m messages encoded with log2(m) bits • minimum possible

  23. Block Code • To detect errors, some redundancy needed • Space of distinct 2n blocks partitioned into codewords and non-codewords • Can detect errors that cause codeword to become non-codeword • Cannot detect errors that cause codeword to become another codeword

  24. Separable Block Code • Separable • n-bit blocks partitioned into • k information bits directly representing message • (n-k) check bits • Denoted (n,k) Block Code • Advantage • k-bit message directly extracted without decoding • Rate of Separable Block Code = k/n

  25. Example of Separable Block Code • (4,3) Parity Code • Check bit is XOR of 3 message bits • message 101  codeword 1010 • Single Bit Parity

  26. Example of Non-Separable Block Code • One-Hot Code • Each Codeword has single 1 • Example of 8-bit one-hot • 10000000, 01000000, 00100000, 00010000 00001000, 00000100, 00000010, 00000001 • Redundancy = 1 - log2(8)/8 = 5/8

  27. Linear Block Codes • Special class • Modulo-2 sum of any 2 codewords also codeword • Null space of (n-k)xn Boolean matrix • Called Parity Check Matrix, H • For any n-bit codeword c • cHT = 0 • All 0 codeword exists in any linear code

  28. Linear Block Codes • Generator Matrix, G • kxn Matrix • Codeword c for message m • c = mG • GHT = 0

  29. Systematic Block Code • First k-bits correspond to message • Last n-k bits correspond to check bits • For Systematic Code • G = [Ikxk : Pkx(n-k)] • H = [I(n-k)x(n-k) : PT(n-k)xk] • Example

  30. Distance of Code • Distance between two codewords • Number of bits in which they differ • Distance of Code • Minimum distance between any two codewords in code • If n=k (no redundancy), distance = 1 • Single-bit parity, distance = 2 • Code with distance d • Detect d-1 errors • Correct up to (d-1)/2 errors

  31. Error Correcting Codes • Code with distance 3 • Called single error correcting (SEC) code • Code with distance 4 • Called single error correcting and double error detecting (SEC-DED) code • Procedure for constructing SEC code • Described in [Hamming 1950] • Any H-matrix with all columns distinct and no all-0 column is SEC

  32. Hamming Code • For any value of n • SEC code constructed by • setting each column in H equal to binary representation of column number (starting from 1) • Number of rows in H equal to log2(n+1) • Example of SEC Hamming Code for n=7

  33. Error Correction in Hamming Code • Syndrome, s • s = HvT for received vector v • If v is codeword • Syndrome = 0 • If v non-codeword and single-bit error • Syndrome will match one of columns of H • Will contain binary value of bit position in error

  34. Example of Error Correction • For (7,3) Hamming Code • Suppose codeword 0110011 has one-bit error changing it to 1110011

  35. SEC-DED Code • Make SEC Hamming Code SEC-DED • By adding parity check over all bits • Extra parity bit • 1 for single-bit error • 0 for double-bit error • Makes possible to detect double bit error • Avoid assuming single-bit error and miscorrecting it

  36. Example of Error Correction • For (7,4) SEC-DED Hamming Code • Suppose codeword 0110011 has two-bit error changing it to 1010011 • Doesn’t match any column in H

  37. Hsiao Code • Weight of column • Number of 1’s in column • Constructing n-bit SEC-DED Hsiao Code • First use all possible weight-1 columns • Then all possible weight-3 columns • Then weight-5 columns, etc. • Until n columns formed • Number check bits is log2(n+1) • Minimizes number of 1’s in H-matrix • Less hardware and delay for computing syndrome • Disadvantage: Correction logic more complex

  38. Example of Hsiao Code • (7,3) Hsiao Code • Uses weight-1 and weight-3 columns

  39. Unidirectional Errors • Errors in block of data which only cause 01 or 10, but not both • Any number of bits in error in one direction • Example • Correct codeword 111000 • Unidirectional errors could cause • 001000, 000000, 101000 (only 10 errors) • Non-unidirectional errors • 101001, 011001, 011011 (both10 and 01)

  40. Unidirectional Error Detecting Codes • All unidirectional error detecting (AUED) Codes • Detect all unidirectional errors in codeword • Single-bit parity is not AUED • Cannot detect even number of errors • No linear code is AUED • All linear codes must contain all-0 vector, so cannot detect all 10 errors

  41. Two-Rail Code • Two-Rail Code • One check bit for each information bit • Equal to complement of information bit • Two-Rail Code is AEUD • 50% Redundancy • Example of (6,3) Two-Rail Code • Message 101 has Codeword 101010 • Set of all codewords • 000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000

  42. Berger Codes • Lowest redundancy of separable AUED codes • For k information bits, log2(k+1) check bits • Check bits equal to binary representation of number of 0’s in information bits • Example • Information bits 1000101 • log2(7+1)=3 check bits • Check bits equal to 100 (4 zero’s)

  43. Berger Codes • Codewords for (5,3) Berger Code • 00011, 00110, 01010, 01101, 10010, 10101, 11001, 11100 • If unidirectional errors • Contain 10 errors • increase 0’s in information bits • can only decrease binary number in check bits • Contain 01 errors • decrease 0’s in information bits • can only increase binary number in check bits

  44. Berger Codes • If 8 information bits • Berger code requires log28+1=4 check bits • (16,8) Two-Rail Code • Requires 50% redundancy • Redundancy advantage of Berger Code • Increases as k increased

  45. Constant Weight Codes • Constant Weight Codes • Non-separable, but lower redundancy than Berger • Each codeword has same number of 1’s • Example 2-out-of-3 constant weight code • 110, 011, 101 • AEUD code • Unidirectional errors always change number of 1’s

  46. Constant Weight Codes • Number codewords in m-out-of-n code • Codewords maximized when m close to n/2 as possible • n/2-out-of-n when n even • (n/2-0.5 or n/2+0.5)-out-of-n when n odd • Minimizes redundancy of code

  47. Example • 6-out-of-12 constant weight code • 12-bit Berger Code • Only 28 = 256 codewords

  48. Constant Weight Codes • Advantage • Less redundancy than Berger codes • Disadvantage • Non-separable • Need decoding logic • to convert codeword back to binary message

  49. Burst Error • Burst Error • Common, multi-bit errors tend to be clustered • Noise source affects contiguous set of bus lines • Length of burst error • number of bits between first and last error • Wrap around from last to first bit of codeword • Example: Original codeword 00000000 • 00111100 is burst error length 4 • 00110100 is burst error length 4 • Any number of errors between first and last error

  50. Cyclic Codes • Special class of linear code • Any codeword shifted cyclically is another codeword • Used to detect burst errors • Less redundancy required to detect burst error than general multi-bit errors • Some distance 2 codes can detect all burst errors of length 4 • detecting all possible 4-bit errors requires distance 5 code

More Related