1 / 35

Fault Analysis Using Pin

Fault Analysis Using Pin. Srilatha (Bobbie) Manne Intel. What are we trying to do?. Purpose: Simulate the occurrence of transient (or persistent) faults and analyze their impact on applications. Why Pin? Easy to model faults and measure their impact.

ayala
Télécharger la présentation

Fault Analysis Using Pin

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Analysis Using Pin Srilatha (Bobbie) Manne Intel

  2. What are we trying to do? • Purpose: Simulate the occurrence of transient (or persistent) faults and analyze their impact on applications. • Why Pin? • Easy to model faults and measure their impact. • Relatively fast (5-10 minutes per fault injection) • Provides full program analysis

  3. Pros & Cons Software Instrumentation Architectural Simulator RTL Silicon Accuracy Ease of Use

  4. Pin’s View of the world uArch State Arch Reg Memory

  5. Modeling Microarchitectural Faults in Pin • Accuracy of fault methodology depends on the complexity of the underlying microarchitecture • Easier to model faults in an in-order, single issue machine • Build a microarchitectural model into Pin • A low fidelity model may suffice • Adds complexity and slows down simulation time • Mimic certain types of microarchitectural faults in Pin

  6. Example: Destination Register Transmission Fault Fault occurs in latches when forwarding instruction output • Change architectural value of destination register at the instruction where fault occurs • NOTE: This is different than inserting fault into register file because the destination is selected based on the instruction where fault occurs Exec Unit Bypass Logic ROB RS Latches

  7. Example: Load Data Transmission Faults Fault occurs when loading data from the memory system • Before load instruction, insert fault into memory • Execute load instruction • After load instruction, remove fault from memory (Cleanup) • NOTE: This models a fault occurring in the transmission of data from the STB or L1 Cache STB Load Buffer Latches DCache

  8. Five Step Program for Fault Analysis • Determine ‘when’ the fault occurs • Determine ‘where’ the fault occurs • Inject Fault • Cleanup (Optional) • Determine Outcome

  9. Step 1: WHEN • Reality: • Assuming that environmental conditions stay the same, transient faults can occur with equal probability at any time during the run of the application. • Approximation: • Transient faults occur on any dynamic instruction with equal probability

  10. Step 1: WHEN • Sample Pin Tool: InstCount.C • Purpose: Efficiently determines the number of dynamic instances of each static instruction. • Output: For each static instruction • Function name • Dynamic instructions per static instruction IP: 135000941 Count: 492714322 Func: propagate_block.104 IP: 135000939 Count: 492714322 Func: propagate_block.104 IP: 135000961 Count: 492701800 Func: propagate_block.104 IP: 135000959 Count: 492701800 Func: propagate_block.104 IP: 135000956 Count: 492701800 Func: propagate_block.104 IP: 135000950 Count: 492701800 Func: propagate_block.104

  11. Step 2: WHERE • Reality: • Where the transient fault occurs is a function of the size of the structure on the chip. • Faults can occur in both architectural and microarchitectural state. • Approximation: • Pin only provides architectural state, not microarchitectural state (no uops, for instance) • Either inject faults only into architectural state • Build an approximation for some microarchitectural state

  12. Step 3: Injecting Fault • Pass context and other relevant information to analysis routine to modify the architectural state • Inject fault • Flush code cache to force immediate reinstrumentation • Force execution at a particular point using the context

  13. Step 4: Cleanup • Cleanup is an optional step and is only necessary for modeling microarchitectural faults, not architectural faults • Modeling a fault in the transmission of data to load op

  14. Step 5 :Determining Outcome • Outcomes that can be tracked: • Did the program complete? • Did the program complete and have the correct IO result? • If the program crashed, how many instructions were executed after fault injection before program crashed? • If the program crashed, why did it crash (trapping signals)?

  15. Fault Insertion State Diagram START Insert Fault Count Insts After Fault Clear Code Cache No Reached CheckPoint? Count By Basic Block Yes No Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes Cleanup? No Count Every Instruction Reached Max HB? No Yes No Yes Yes Cleanup Fault Found Inst? Detach From Pin & Run to Completion Post Fault Pre-Fault Fault

  16. Register Fault Pin Tool: RegFault.C main(int argc, char * argv[]) { if (PIN_Init(argc, argv)) { return Usage(); }; out_file.open(KnobOutputFile.Value().c_str()); faultInst = KnobFaultInst.Value(); TRACE_AddInstrumentFunction (Trace, 0); INS_AddInstrumentFunction(Instruction, 0); PIN_AddFiniFunction(Fini, 0); PIN_AddSignalInterceptFunction(SIGSEGV, SigFunc, 0); PIN_AddSignalInterceptFunction(SIGFPE, SigFunc, 0); PIN_AddSignalInterceptFunction(SIGILL, SigFunc, 0); PIN_AddSignalInterceptFunction(SIGSYS, SigFunc, 0); PIN_StartProgram(); return 0; } MAIN

  17. Fault Insertion State Diagram START Insert Fault Count Insts After Fault Clear Code Cache No Reached CheckPoint? Count By Basic Block Yes No Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes Cleanup? No Count Every Instruction Reached Max HB? No Yes No Yes Yes Cleanup Fault Found Inst? Detach From Pin & Run to Completion Fault Post Fault Pre-Fault

  18. if (fineGrainCount == false) { for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_InsertIfCall(bbl, IPOINT_BEFORE, (AFUNPTR)FindFineGrainThreshold, IARG_UINT32, BBL_NumIns(bbl), IARG_END); BBL_InsertThenCall(bbl, IPOINT_BEFORE,(AFUNPTR) SwitchToFineGrainCounting, IARG_END); } } TRACE Instrumentation UINT32 FindFineGrainThreshold(UINT32 i) { curDynInst += i; return ( curDynInst >= (faultInst - fineGrainTrigger) ); } VOID SwitchToFineGrainCounting() { if (fineGrainCount == false) { fineGrainCount = true; PIN_RemoveInstrumentation(); } } TRACE Analysis

  19. Fault Insertion State Diagram START Insert Fault Count Insts After Fault Clear Code Cache No Reached CheckPoint? Count By Basic Block Yes No Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes Cleanup? No Count Every Instruction Reached Max HB? No Yes No Yes Yes Cleanup Fault Found Inst? Detach From Pin & Run to Completion Fault Post Fault Pre-Fault

  20. VOID Instruction(INS ins, VOID *v) { if (fineGrainCount == true) { if (faultDone == 0) { INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)FindFaultInst, IARG_END); INS_InsertThenCall(ins, IPOINT_BEFORE, (AFUNPTR)InsertFault, IARG_CONTEXT, IARG_END); } if (faultDone == 1) { …. Instruction Instrumentation INT32 FindFaultInst() { curDynInst++; return ( curDynInst >= faultInst ); } Instruction Analysis

  21. Fault Insertion State Diagram START Insert Fault Count Insts After Fault Clear Code Cache No Reached CheckPoint? Count By Basic Block Yes No Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes Cleanup? No Count Every Instruction Reached Max HB? No Yes No Yes Yes Cleanup Fault Found Inst? Detach From Pin & Run to Completion Fault Post Fault Pre-Fault

  22. VOID InsertFault(CONTEXT* _ctxt) { srand(curDynInst); GetFaultyBit(_ctxt, &faultReg, &faultBit); UINT32 old_val; UINT32 new_val; old_val = PIN_GetContextReg(_ctxt, faultReg); faultMask = (1 << faultBit); new_val = old_val ^ faultMask; PIN_SetContextReg(_ctxt, faultReg, new_val); PIN_RemoveInstrumentation(); faultDone = 1; PIN_ExecuteAt(_ctxt); } Fault Insertion Analysis Routine

  23. Fault Insertion State Diagram START Insert Fault Count Insts After Fault Clear Code Cache No Reached CheckPoint? Count By Basic Block Yes No Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes Cleanup? No Count Every Instruction Reached Max HB? No Yes No Yes Yes Cleanup Fault Found Inst? Detach From Pin & Run to Completion Pre-Fault Fault Post Fault

  24. VOID Instruction(INS ins, VOID *v) { if (fineGrainCount == true) { if (faultDone == 0) { …. } if (faultDone == 1) { if (INS_HasFallThrough(ins)) { INS_InsertCall(ins, IPOINT_AFTER, (AFUNPTR)PrintHeartbeat, IARG_END); } if (INS_IsBranchOrCall(ins)) { INS_InsertCall(ins, IPOINT_TAKEN_BRANCH, (AFUNPTR)PrintHeartbeat, IARG_END); } } } } Post Fault Instruction Instrumentation

  25. VOID PrintHeartbeat() { postFaultInsts++; if (postFaultInsts & dumpMask) { out_file << "H: " << dec << dumpMask << endl; out_file.flush(); dumpMask = dumpMask << 1; } if (dumpMask > maxHB) { PIN_Detach(); } } Post Fault Analysis

  26. OUTPUT IP: 8192fcf COUNT: 937440391 REG: esi FBIT: 24 MASK: 1000000 OLD: bffeca90 NEW: befeca90 H: 1 H: 2 H: 4 H: 8 . . . H: 8388608 Fault Masked IP: 80babc0 COUNT: 92958481 REG: ebp FBIT: 20 MASK: 100000 OLD: 0 NEW: 100000 H: 1 H: 2 H: 4 H: 8 H: 16 H: 32 Signal: 11 PostFaultInsts: 38 Program Failure

  27. Sample Results

  28. Step 5: Determining Outcome, Extreme Edition • In the InjectFault step (STEP 3) • Fork a process and inject fault into one process (parent process) • Communicate information between processes (mkfifo) • After fault injection, keep track of all writes to memory • At each checkpoint, compare architectural state and stores • What if there’s a control deviation? • For every control operation, compare the next IP between processes • If the control flow deviates, then wait until both routines return from the function where the deviation occurred before checking state.

  29. Step 5: Extreme Edition • Adding this fork and compare feature takes time but it can be done. • What does it buy? • Does the fault propagate? • How far does it propagate? • How many registers, bytes of memory does it impact? • What happens when there is a control deviation? • Is there a higher incidence of program failure or IO error in the presence of a control deviation?

  30. Pin Based Fault Checker START Insert Fault Count Insts After Fault Clear Code Cache No Reached CheckPoint? Count By Basic Block Yes No Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter No Change Yes Cleanup? No Count Every Instruction Reached Max HB? No Yes No Yes Yes Cleanup Fault Found Inst? Detach From Pin & Run to Completion Pre-Fault Fault Post Fault

  31. Parent Child Both Fault Checker: Fault Insertion Fault Insertion Fork Process & Setup Communication Links Parent Process? Yes Insert Fault No Restart Using Context Parent Process? Cleanup Required? Yes Yes Cleanup Fault No No Post Fault

  32. Parent Child Both Fault Checker: Post Fault Post Fault Get Next Inst & Count Insts Old Data!= New Data? Yes Yes Store OP? Save Data No No Parent IP != Child IP? Ctrl OP? Yes Yes Ctrl Deviation No No No CheckPoint? Checkpoint Comparison Yes Parent State == Child State Yes Parent? Read Info From Child & Compare state No Send Continue Signal to Child No Yes Communicate Reg & Store Data to Parent Send Done Signal to Child & Detach Yes Yes No Done Or Cont? Done? Detach & Exit No

  33. Parent Child Both Fault Checker: Ctrl Deviation Ctrl Deviation Call Counter = 0 Get Next Inst Old Data!= New Data? Yes Yes Store OP? Save Data No No Function Call Yes Call Counter ++ No Call Counter < 0 ? No No Function Return? Yes Call Counter -- No Yes Checkpoint Comparison

  34. Fault Checker: Additional Info • Cannot check faults beyond a system call • Kill child process and detach parent process from Pin • Run parent/faulty process to completion • Although not shown in flow chart, the Pin tool detaches after reaching a max number of check points • Providing tighter bounds on ctrl deviation: • May take a long time before returning from function call • On a control deviation • For both parent and child processes, save each store address and data • For the parent process, tag the store with the number of instructions executed since control deviation occurred. • After control merges and if architectural state is the same between the two processes, walk the list of stores from oldest to youngest and determine where the two processes matched.

  35. Conclusion • Fault insertion using Pin is a great way to determine the impacts faults have within an application • Easy to use • Enables full program analysis • Accurately describes fault behavior once it has reached architectural state

More Related