300 likes | 397 Vues
Explore memory faults in Linux, understand fault injection methods, delve into fault propagation, and implement solutions to protect the system. Learn about Software-Implemented Fault Injection (SWIFI), data structuring, and Error Correcting Codes (ECC). Dive into Hamming Codes and Majority Vote for safeguarding memory. Discover implementation design goals and challenges faced in protecting Linux from corruption. Follow the journey from diagnosis to solution with Jeffrey Freschl and Di Xue.
E N D
Memory Faults: Injection & Solutions Jeffrey Freschl, Di Xue
The Problem “Memory meets corruption, it happens everyday, it could happen to you…” • --famous quote modified from the People Store Commercial • Can Linux handle cheap memory? • Can we protect ourselves from memory faults?
Talk Outline • Some Preparation (The How) • Actual Corruption and Results • A Solution (Methods and Implementation)
Software Fault Injection • SWIFI – Software implemented fault injection is a common way to validate system design. • SWIFI gives the freedom we need.
What We Inject? Task_struct • Process – An instance of a program in execution. • Kernel must know process’s state to properly manage. • Task_struct contains information about a process.
Data Members • prio: process’s priority • run_list: address of entry in runqueue which contains list of TASK_RUNNING processes. • time_slice: amount of time to run • lock_depth: locking for simultaneous access. • policy: fifo, round robin, etc. • mmap_base: below thestack's low limit (the base) • vm_start: start address of the VM area
Fault Propagation • EIP locates fault point • Call Trace illustrates path to fault
Part III – A Solution Protecting Linux from Di’s Corruption
Methods (Update & Access) • Error Correcting Codes (ECC) • Majority Vote What are the tradeoffs? Time? Space? Recoverability?
Intro to Hamming Code (Magic) • Hamming Rule d + p + 1 ≤ 2p (d is # of input bits, p is # of parity bits) • Generator Matrix G G = [I:A] A is a (d X p) dim matrix A must have unique rows and columns
Hamming cont. (More Magic) • To encode input string codeword = input x G • To check if input string is corrupt H = [AT : I ] syndrome = H * codeword if( syndrome == 0 ) then no corruption otherwise, match syndrome to column in H
Hamming (Back to Reality) • Redundancy • Can only recover from 1 bit corruption • Space • Almost constant (optimal # of parity bits) • Time • Lots of bitwise XORs and ANDs
Majority Vote • Time to update very fast! • Space Overhead! • Simple Implementation!! If( copy1 != copy2 ) use copy3 else everything is ok
Design Goals • Want a “redundancy repository” for entire kernel • Minimize Programmer’s Pain! • On demand backup • Scalability
“Just give me a location and I’ll take care of you!” - Redundancy Repository
Redundancy Repository Redundancy HashTable Member Entry int size long id char parity
How to Protect? Redundancy API • checkParity( addressOfMember, size ) • Add before a read access • updateParity( addressOfMember, addressOfNewValue, size ) • Add before an update
Some Challenges • Dealing with different sized data members. • Originally focused on protecting address • Solution: Need to know size of data • What about recursive redundancy? • User Registration • Manual Integration
Updated Results Di + Kernel + Solution Harmony
Summary • 20% of the critical data members we tested caused a crash. • Finding every location that updates memory is difficult. • The system no longer crashed with our redundancy solution.
Thank You • Jeffrey Freschl jfreschl@cs.wisc.edu • Di Xue goldenspaceship@gmail.com