Fault-tolerant Typed Assembly Language Q Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker Princeton University Transient Faults TALFT: An Idealized Fault-tolerant System Formalizing the TALFT System Performance Results • Operational semantics !’ specifies how a program executes. • is a machine state (R,C,M,Q,ir). • Special operational rules randomly introduce faults by arbitrarily modifying a value in the register file R or the queue Q • Judgment `Z states that machine state is well-typed under zap tag Z. • Z is either empty or a color c, representing the color of the computation that has been corrupted. • If Z is empty then all standard typing invariants must hold. • If Z is a color c, then values colored c do not need to conform to their given types. • We simulated the TALFT hardware using the Velocity Research Compiler. • Despite essentially doubling the number of instructions, TALFT is only 34% slower than the equivalent fault-intolerant code. • Goal: Detect all faults that change a program’s observable behavior and prove that well-typed programs always detect a single fault. • General Strategy: Two redundant and independent copies of each computation (labeled green and blue) that are compared before stores and control flow transfers. • Transient faults (also known as soft fails) occur when an energetic particle strikes a transistor, causing it to change state. • Does not permanently damage hardware. • May corrupt computation by altering stored values and signal transfers. 2014 (estimated) 2006 1 + 1 = 18 • In 2004, a typical laptop with 1GB RAM had 1 soft failure per year. • Fault rates increase with altitude. • Fault rates on a trans-pacific flight increase 300x to nearly 1 fault per roundtrip. The TALFT Machine • A machine state consists of a code memory C, a value memory M, a register file R, a current instruction ir, and a store queue Q. Conclusion • Sun, HP, and Cypress Semiconductor have admitted that transient faults caused crashes and problems at eBay, AOL, Los Alamos, and other major sites. • Faster clock rates, increasing transistor density, decreasing voltages, and smaller feature size all contribute to increasing fault rates of approximately 8% per generation. • Transient faults are already a significant concern, and their impact is increasing. • TALFT formalizes idealized hardware for a hybrid hardware-software fault-tolerance scheme. • TALFT’s type system captures the invariants necessary to reason about the system. • We formally proved that all well-typed programs are fault-tolerant relative to the fault model. . . . 0x0393 mov r1 4 0x0394 mov r3 4 0x0395 stG r2 r1 0x0396 stB r4 r3 . . . C The Fault-tolerance Theorem • A program is fault tolerant when all the faulty executions simulate fault-free executions of the program. • The relationship 1simc2 says that 1 and 2 are identical modulo values colored c. • (Simplified) Fault-Tolerance Theorem: Either a faulty execution is indistinguishable from the equivalent fault-free execution, or it terminates in the fault state. • If ` and simc 1 and ! ’ • then either 1! 2 and ’ simc 2 • or 1!fault. • A value is observed when it is stored to memory. • Q buffers a store from the green computation until the corresponding blue store is performed. • Special hardware is used by memory and control flow instructions to detect mismatches between the computations and trigger the special fault state. For More Information… • Transient faults are a well-known problem in the architecture community. • There are many existing solutions, but do these solutions actually work? • Solutions generally consist of algorithms stated in English, and it is left to the audience to judge correctness. • See our upcoming paper in PLDI ’07 • http://www.cs.princeton.edu/~frances/tal_ft-pldi07.pdf • Visit the Project ap Homepage • http://www.cs.princeton.edu/sip/proj/zap Frances Perry. CRA-W/CDC Programming Languages Summer School. May 9-11, 2007.