Fault-Tolerant Platforms for Automotive Safety-Critical Applications

Fault-Tolerant Platforms for Automotive Safety-Critical Applications Baver Şahin 2006701344

Agenda • Introduction • Fault-Tolerance in SOC • Fault-Tolerant Multi-Processor Architectures • SOC Fault-Tolerant Architecture Implementation • Implementation Issues & Comparisons • Concluding Remarks

Introduction • Electonics in the car -In the late 70’s - digitally controlled combustion engines - digitally controlled anti-lock brake systems(ABS) -Synergy between mechanics and electronics - better fuel economy - better vehicle performance - driver assisting functions (ABS, TCS, ESP, BA & safety features)

Introduction -X-by-wire systems: To design cars with better performance and higher level of safety, engineers must substitute mechanical interfaces between the driver and the vehicle with electronic systems. -throttle pedal, brakepedal, gear selector, steering wheel -electrical outputis processed by micro-controllers that manage the power-train,braking and steering activities via electrical actuators.

Introduction - An example of a Brake-by Wire system: It consists ofseveral computer nodes controlling varioussensors andactuators that communicate through a faulttolerant realtime network, and form together a distributedreal-timecomputer system.

Introduction • Fault-ToleranceRequirements: Because of the fact that drive-by-wire systems haveno mechanical backup, they are assigned a high SafetyIntegrity Level. This means that their designmust incorporate all the necessary techniques forachieving fault-tolerance. • Fault-TolerantDesign Approaches -hardware redundancy: 1) Static redundancy that is based on the voting of the outputs of a number of modules to mask the effects of a fault within these units. The simplest form of this arrangement consists of three modules and a voter and is termed a triple modular redundant system (TMR). 2) Dynamic redundancy on the other hand is based on fault detection rather than fault masking. This is achieved by using two modules and some sort of comparison on their outputs that can detect possible faults. This method has lower component count but is not suitable for real-time applications. 3) Hybrid redundancy uses a combination of voting, fault-detection and module switching, thus combining static and dynamic redundancy.

Fault-Tolerance in SOC • New trends in the automotive industry, like the development of drive-by-wire systems, have generatedthe need for computer systems with high levels offault tolerance and also low cost. This can be achievedby using system-on-chip (SoC) design methods. • Common mode failures - clock tree - power supply - silicon substrate

Fault-Tolerance in SOC • Experienced Faults - hard fails: permanentfailures that are caused by an irreversible physicalchange and derive from the term ‘hardware failure’. -soft fails (single event upsets SEU): Soft fails (or soft errors) are defined as a spontaneouserror or change in stored information which cannotbe reproduced. -external electronic noise -nuclearparticles that come either from the decay of radioactiveatoms

Fault-Tolerance in SOC • While the occurrence of a permanent fault may impairor even stop the correct functionality of the system, soft errors caused by transient faults often drastically reduce thesystem availability. As a matter of fact, it is often the casethat soft error avoidance is strongly required to maintainthe system availability at an acceptable level. -static temporal redundancy - triple execution and majority voting -mask any single soft error -dynamic technique -duplicationand comparison -deploying error detection

Fault-Tolerance in SOC • While the error detection drastically simplifiesthe system roll-back and restart, error masking eliminate(or at least reduce) this need thus maintaining the providedavailability at an acceptable level.

Fault-Tolerant Multi-Processor Architectures • Lock-StepDual Processor Architecture:

Fault-Tolerant Multi-Processor Architectures • Lock-StepDual Processor Architecture: -two processors (master & checker): execute thesame code being strictly synchronized. - master: hasaccess to the system memory and drives all system outputs. - checker: continuously executes the instructions moving on the bus (i.e. those fetched by the master processor)

Fault-Tolerant Multi-Processor Architectures -compare logic (monitor):consistingof a comparatorcircuit at the master’s and checker’sbus interfaces, that checks the consistency of their data-address- and control-lines.The detection of a disagreementon the value of any pair of duplicated bus lines reveals thepresence of a fault on either CPU without giving the chanceto identify the faulty CPU. -source of common-modefailure: bus and memoryerrors -error detection (correction) techniques -parity bits

Fault-Tolerant Multi-Processor Architectures • The lock-step architecture can be employed as a fail-silentnode providing the capability of detecting any (100% coverage)single error (permanent or transient) occurring indifferentlyon the CPU, memory or communication sub-system.Error correcting codes are required when errors occurringon busses and memories turn out to be relatively frequentdue to the occurrence of transient faults.

Fault-Tolerant Multi-Processor Architectures • Loosely-SynchronizedDual Processor Architecture:

Fault-Tolerant Multi-Processor Architectures • Loosely-SynchronizedDual Processor Architecture: - two CPU’s: run independently having access to distinct memory subsystems. -A real-time operating system running on bothCPUs -interprocessor communication -synchronization - error detection (e.g. by meansof cross-checks), correction and containment (e.g. memoryprotection) -A subset of the tasks executed by the processorsare defined as critical. The image of critical tasks isduplicated on both memories. Critical tasks are executed inparallel as software replicas and theiroutputs are exchangedafter each run on a timetriggered basis. Both processorsare responsible for checking theirconsistency.

Fault-Tolerant Multi-Processor Architectures -A mismatch: indicates a fault on the CPU, memory or communicationsub-system and prevents outputs from being committed. - cross-check mismatch - sanity-check - self-testing -commitment of agreed outputs - First technique: to prevent outputs from being committed before being cross-checked, time guardians can restrict CPU access to system outputs to a predefined time-window.

Fault-Tolerant Multi-Processor Architectures - Second Technique: Each processor adds its own signature to the outputs of criticaltasks and the receiver checks for both signatures beforeaccepting the data. -According to the subset of critical task, the architecture can appear in several different configurations. At the one end, fully critical applications must be entirely replicated, thus requiring twice as much memory while providing the same performance as a single processor architecture.

Fault-Tolerant Multi-Processor Architectures • The execution of a function on both CPUs guarantees thedetection of any error (100% coverage) occurring indifferentlyon one of the CPUs, busses or memories. Since bussesand memories (at least for critical tasks) are replicated, noother form of redundancy (e.g. parity bits) is needed to detecterrors on these components. Nevertheless, ECCs maybe employed in the case of high memory (or bus) failurerate.

Fault-Tolerant Multi-Processor Architectures • Triple Modular Redundant (TMR) Architecture:

Fault-Tolerant Multi-Processor Architectures • Triple Modular Redundant (TMR) Architecture: -three identical CPUs: execute the same code in lock-step. - majority voter: majority vote of theoutputs masks any possible single CPU fault. • The memoryand communication sub-system faults can be masked employingECC (Error Correcting Codes) techniques.

Fault-Tolerant Multi-Processor Architectures • Dual Lock-StepArchitecture:

Fault-Tolerant Multi-Processor Architectures • Dual Lock-StepArchitecture: A configuration largely employed in multi-chip fault-tolerantsystems consists of the combination of two fail-silent channels, each one consisting of a lock-step architecture as theone presented inLock-StepDual Processor Architecture, building up a single fail-operationalunit. In this case, the architecture provides fault-tolerance only for the replicated tasks, whose outputs are checked before being committed. • Softwaredesign errors can be prevented as well.

Fault-Tolerant Multi-Processor Architectures • In contrast to solutionpresented in Loosely-SynchronizedDual Processor Architecture, the execution of sanity-checksis no more required, since self-checking capabilities are alreadyprovided in hardware by means of duplication andyield a 100% fault coverage.

SOC Fault-Tolerant Architecture Implementation • Cost: Due to the costs associated to the higher integration level, single-chip implementations should have enough flexibilityto support a wide range of applications in order to sharethe silicon development cost across a set of different finalelectronic systems. • Flexibility: the capability of a siliconsolution to correctly adapt to performance, cost andfault-tolerance requirements of a set of applications, aftersilicon production.

SOC Fault-Tolerant Architecture Implementation • In contrast to multi-chip solutions, in a single-chip dual-processorarchitecture the memory sub-system can be sharedbetween the processors at much lower cost.Since the twocores can run independently, the memory and communicationsub-systems are likely to become a major performancebottleneck. For this reasonthe memory sub-system is split into 4 banks (2 for code anddata respectively) and the traditional bus is replaced by amore performant crossbar switch, which guarantees sufficientbandwidth between the processor and memory subsystems.

SOC Fault-Tolerant Architecture Implementation • The single-chip loosely-synchronized dual processorarchitecture, called Shared-Memory (SM) LooselySynchronized Dual-Processor: Since the memorysub-system is shared between the processors, the duplicationof critical code becomes a trade-off between systemintegrity, memory size and performance: while critical codetakes up costly memory space, non-duplicated critical code,which must be executed on both cores, runs at half the speedof a single processor.

SOC Fault-Tolerant Architecture Implementation • Shared-Memory (SM)Loosely Synchronized DualProcessor Architecture:

SOC Fault-Tolerant Architecture Implementation • SM Dual Lock-Step architecture: The two fail-silent channels share the samememory sub-system. This solution largely enhances flexibility,since it covers the TMR solution (same fault-toleranceproperties), while implementing the dual lock-step architecture. • Lock-Step mode: When fail-operational capability is required,the two channels can be arranged in lock-step mode, inwhich case the architecture provides masking capabilities ofCPU’s faults as in the TMR solution.

SOC Fault-Tolerant Architecture Implementation • Parallel Mode: Two channels can beused as two completely parallel fail-silent channels providingdouble performance. • Memories and buses are protectedusing ECCs in order to retain error masking capabilities onthese components when operating in lock-step mode.

SOC Fault-Tolerant Architecture Implementation • SM Dual Lock-Step architecture:

Implementation Issues & Comparisons • The performance andthe fault-tolerance features of the different solutions are compared and their costs are evaluated on the basis of the area estimates. • Table summarizes the area of memory components (bothRAM and FLASH) and buses, normalized to the CPU footprint. Area of embedded memory components normalized to CPU footprint

Implementation Issues & Comparisons Cost of different architectures for low-/mid-range X-by-wire systems

Implementation Issues & Comparisons • The single CPU architecture can be considered as a referencedesign satisfying computational and memory requirementsbut not providing any fault-tolerance capability. • The locksteparchitecture: The lock-step architecture cannotprovide any performanceboost over the single processorsolution, since the two coresare bound to execute thesame code cycle by cycle. Rather,due to the introductionof the compare logic and the ECC coders/decoders in thecritical path, the clock rate may be decreased.

Implementation Issues & Comparisons • However, with a relatively low area overhead, this solution provides a 100% fault coverage within an error detection time in the order of the clock period. • Both processors execute the same code, the lockstep configuration does not provide any protection against software design errors.

Implementation Issues & Comparisons • SMloosely-synchronizeddual-processor • Architecture: In the SM loosely-synchronized dual processor architecturethe two CPUs can run independently having full accessto the memory sub-system and system I/O. Since only criticaltasks must be duplicated for safety Requirements. • As the lock-step configuration, the SM loosely synchronizedarchitectureprovides a 100% error detection when runningfull-critical applications.However, this requires roughly twiceas much memory space to accommodatethe duplicated code.Memory footprint is mostly responsible for the huge areaoverhead as shown in Table.

Implementation Issues & Comparisons • Moreover, fault diagnosisis complicated by thelonger errordetection time, proportionalto the check execution period, and bythe fact that errordetection only performed on selected outputs. Nonetheless,in contrast to the lock step solution, the SM loosely-synchronizedarchitecturehas the ability of supporting bothhardware and software design diversity andprovides a degradedmode of operation. • Both configurations presented above provide no fault maskingmechanism, except for the possible implementation ofECCs on busesand memories. This may be a major draw-back especially in the case ofa high transient fault rate.

Implementation Issues & Comparisons • Triple modular redundant architecture:The TMR configurationrepresents a “low-cost” solution.In fact, the area overheadover the lock-step architecture isas low as 9% and 1.5%for low- and mid-range systems respectively. However, italso inherits almost all of the featuresand flaws of thelock-step architecture. Excepting its unique capability ofmasking any single fault, at the cost of an additionalCPU,it offers a 100% error detection coverage withina singleclock period.

Implementation Issues & Comparisons • SM dual locksteparchitecture: The SM dual lock step architecture combines the advantagesof the SM loosely-synchronized solution in terms offlexibility with the fault masking capabilities provided bythe TMR architecture. • When the two cores execute thesame code in lock-step, they provide fault-tolerance capabilities.On the other hand, if the fail-silence property sufficesfor the application at hand, the two channels can operatecompletely independently and the architecture behaves likea “traditional” dual processor solution.

Implementation Issues & Comparisons • This great deal of flexibility comes at a relatively low price.In fact, if compared with the fault-tolerant TMR architecture,while the introduction of the 4th CPU yields a 10%overhead for low-range applications, the overhead falls downto just 2-3% for more memory demanding applications. Notice that to cover softwaredesign faults2 viadesign diversity, we need to double thememory footprint asdone for the SM loosely-synchronized architecture. Also inthis case, comparing the two alternatives, we come out witha modest increase in area, in the order of about 8% and 2%for low- and mid-range applications respectively.

Implementation Issues & Comparisons • Tradeoffanalysis: -SM loosely-synchronized architecture -most area-demanding solution -lock-step and the TMR architectures - cannot provide any performance improvement over the single processor solution, while representing “low-cost” solutions -SM dual lock-step architecture - 100% single fault-tolerance - wider range of applications - reducing engineering costs -best alternative between the four architectures

Concluding Remarks • A single-chip solution is proposed,devised for fault tolerant automotive applications, which isbased on the use of two lock-step channels (4 CPUs overall),a cross-bar communication architecture and embeddedmemories.

References • [1] R. Baumann. The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction. In Digest of the Internation Electron Devices Meeting IEDM’02., pages 329–332, 2002. • [2] R.C. Baumann. Soft errors in advanced semiconductor devices - part I: The three radiation sources. IEEE Transaction on Device and Materials Reliability, 1(1):17–22, Mar 2001. • [3] E. Böhl, Th. Lindenkreuz, and R. Stephan. The fail-stop controller AE11. In Proceedings of the International Test Conference, pages 567–577, Nov 1997. • [4] M. Baleani, A. Ferrari, L. Mangeruca, Maurizio Peri, Saverio Pezzini. Fault-Tolerant Platforms for Automotive Safety Critical Applications In: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, pages 170 – 177, 2003. • [5] R. Iserman, R. Schwarz, and S. Stölzl, “Fault-Tolerant Drive-by-Wire Systems,” IEEE Control Systems Magazine, vol. 22, no. 5, pp. 64–81,October 2002. • [6] K. Ahlström and J. Torin, “Future Architectureof Flight Control Systems,” IEEE Aerospace andElectronic Systems Magazine, vol. 17, no. 12, pp.21–27, December 2002. • [7] P. H. Jesty, K. M. Hobley, R. Evans, and I. Kendall,“Safety Analysis of Vehicle-Based Systems,” inProceedings of the Eighth Safety-critical SystemsSymposium, 2000, pp. 90–110. • [8] C. Constantinescu, “Trends and Challenges in VLSICircuit Reliability,” IEEE Micro, vol. 23, no. 4, pp.14–19, July-August 2003.

THANKS Q&A

Fault-Tolerant Platforms for Automotive Safety-Critical Applications

Fault-Tolerant Platforms for Automotive Safety-Critical Applications

Presentation Transcript

Fault-Tolerant Broadcast

Fault-Tolerant Broadcast

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault-Tolerant Consensus

Fault Tolerant Design of Distributed Automotive Systems

Fault Tolerant Backplane

Fault Tolerant Configuration

Middleware for Fault Tolerant Applications

FAULT-TOLERANT COMPUTING

FAULT-TOLERANT COMPUTING

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

FAULT-TOLERANT TECHNIQUES FOR NANOCOMPUTERS

Building Fault-Tolerant Enterprise Applications

Fault-tolerant routing

Fault-Tolerant Consensus