Day 3: Using Process Virtualization Systems

Day 3: Using Process Virtualization Systems Kim Hazelwood ACACES Summer School July 2009

Research Applications Computer Architecture • Trace Generation • Fault Tolerance Studies • Emulating New Instructions Program Analysis • Code coverage • Call-graph generation • Memory-leak detection • Instruction profiling Multicore • Thread analysis • Thread profiling • Race detection • Cache simulations Compilers • Compare programs from competing compilers Security • Add security checks and features

Specific Applications • Sample tools in the Pin distribution: • Cache simulators, branch predictors, address tracer, syscall tracer, edge profiler, stride profiler • Some tools developed and used inside Intel: • Opcodemix (analyze code generated by compilers) • PinPoints (find representative regions in programs to simulate) • A tool for detecting memory bugs • Companies are writing their own Pintools • Universities use Pin in teaching and research

Compiler Bug Detection • Opcodemix uncovered a compiler bug for crafty

Thread Checker Basics • Detect common parallel programming bugs: • Data races, deadlocks, thread stalls, threading API usage violations • Instrumentation used: • Memory operations • Synchronization operations (via function replacement) • Call stack • Pin-based prototype • Runs on Linux, x86 and x86_64 • A Pintool ~2500 C++ lines

Thread Checker Results Potential errors in SPECOMP01 reported by Thread Checker (4 threads were used)

a documented data race in the artbenchmark is detected

Instrumentation-Driven Simulation • Fast exploratory studies • Instrumentation ~= native execution • Simulation speeds at MIPS • Characterize complex applications • E.g. Oracle, Java, parallel data-mining apps • Simple to build instrumentation tools • Tools can feed simulation models in real time • Tools can gather instruction traces for later use

Performance Models • Branch Predictor Models: • PC of conditional instructions • Direction Predictor: Taken/not-taken information • Target Predictor: PC of target instruction if taken • Cache Models: • Thread ID (if multi-threaded workload) • Memory address • Size of memory operation • Type of memory operation (Read/Write) • Simple Timing Models: • Latency information

Branch Predictor Model API data Branch instr info Pin BPSim Pin Tool BP Model API() Instrumentation Tool Instrumentation Routines Analysis Routines • BPSim Pin Tool • Instruments all branches • Uses API to set up call backs to analysis routines • Branch Predictor Model: • Detailed branch predictor simulator

BP Implementation BranchPredictor myBPU; VOID ProcessBranch(ADDRINT PC, ADDRINT targetPC, bool BrTaken) { BP_Info pred = myBPU.GetPrediction( PC ); if( pred.Taken != BrTaken ) { // Direction Mispredicted } if( pred.predTarget != targetPC ) { // Target Mispredicted } myBPU.Update( PC, BrTaken, targetPC); } VOID Instruction(INS ins, VOID *v) { if( INS_IsDirectBranchOrCall(ins) || INS_HasFallThrough(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) ProcessBranch, ADDRINT, INS_Address(ins), IARG_UINT32, INS_DirectBranchOrCallTargetAddress(ins), IARG_BRANCH_TAKEN, IARG_END); } int main() { PIN_Init(); INS_AddInstrumentationFunction(Instruction, 0); PIN_StartProgram(); } ANALYSIS INSTRUMENT MAIN

Performance Model Inputs • Branch Predictor Models: • PC of conditional instructions • Direction Predictor: Taken/not-taken information • Target Predictor: PC of target instruction if taken • Cache Models: • Thread ID (if multi-threaded workload) • Memory address • Size of memory operation • Type of memory operation (Read/Write) • Simple Timing Models: • Latency information

Cache Simulators API data Mem Addr info Pin Cache Pin Tool Cache Model API() Instrumentation Tool Instrumentation Routines Analysis Routines • Cache Pin Tool • Instruments all instructions that reference memory • Use API to set up call backs to analysis routines • Cache Model: • Detailed cache simulator

Cache Implementation CACHE_t CacheHierarchy[MAX_NUM_THREADS][MAX_NUM_LEVELS]; VOID MemRef(int tid, ADDRINT addrStart, int size, int type) { for(addr=addrStart; addr<(addrStart+size); addr+=LINE_SIZE) LookupHierarchy( tid, FIRST_LEVEL_CACHE, addr, type); } VOID LookupHierarchy(int tid, int level, ADDRINT addr, int accessType){ result = cacheHier[tid][cacheLevel]->Lookup(addr, accessType ); if( result == CACHE_MISS ) { if( level == LAST_LEVEL_CACHE ) return; LookupHierarchy(tid, level+1, addr, accessType); } } VOID Instruction(INS ins, VOID *v) { if( INS_IsMemoryRead(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) MemRef, IARG_THREAD_ID, IARG_MEMORYREAD_EA, IARG_MEMORYREAD_SIZE, IARG_UINT32, ACCESS_TYPE_LOAD, IARG_END); if( INS_IsMemoryWrite(ins) ) INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) MemRef, IARG_THREAD_ID, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_UINT32, ACCESS_TYPE_STORE, IARG_END); } int main() { PIN_Init(); INS_AddInstrumentationFunction(Instruction, 0); PIN_StartProgram(); } ANALYSIS INSTRUMENT MAIN

Moving from 32-bit to 64-bit Applications How to identify the reasons for these performance results? • Profiling! Ye06, IISWC2006

Main Observations • In 64-bit mode: • Code size increases (10%) • Dynamic instruction count decreases • Code density increases • L1 icache request rate increases • L1 dcache request rate decreases significantly • Data cache miss rate increases

Changing Program Code (Probe-mode) • PIN_ReplaceProbed (RTN, AFUNPTR) • Redirect control flow to new functions in the Pin Tool • PIN_ReplaceSignatureProbed (RTN, AFUNPTR, …) • (1) Redirect control flow (2) Rewrite function prototypes(3) Use Pin arguments (IARG’s) foo() Original Stream foo foo’ (Original) (Replacement)

Replacing malloc() in an Application • typedef VOID * (*FUNCPTR_MALLOC)(size_t); • VOID * MyMalloc(FUNCPTR_MALLOC orgMalloc, UINT32 size, ADDRINT returnIp) { • FUNCPTR_MALLOC poolMalloc = LookupMallocPool(returnIp, size); • return (poolMalloc) ? poolMalloc(size) : orgMalloc(size); • } • VOID ImageLoad(IMG img, VOID *v) { • RTN mallocRTN = RTN_FindByName(img, "malloc"); • if (RTN_Valid(rtn)) { • PROTO prototype = PROTO_Allocate(PIN_PARG(void *), CALLINGSTD_CDECL, • "malloc", PIN_PARG(int), PIN_PARG_END()); • RTN_ReplaceSignatureProbed(mallocRTN, (AFUNPTR) MyMalloc, • IARG_PROTOTYPE, prototype, /* Function prototype */ • IARG_ORIG_FUNCPTR, /* Handle to application’s malloc */ • IARG_FUNCARG_ENTRYPOINT_VALUE, 0, /* First argument to malloc */ • IARG_RETURN_IP, /* IP of caller */ • IARG_END); PROTO_Free( proto_malloc ); • } • }

Reducing Overhead in Your Tools Total Overhead = Pin Overhead + Pintool Overhead ~5% for SPECfp and ~50% for SPECint Pin team’s job is to minimize this Usually much larger than pin overhead Pintool writers can help minimize this!

Adding User Instrumentation

Reducing the Pintool’s Overhead Pintool’s Overhead Analysis Routines Overhead Instrumentation Routines Overhead + Work required in the Analysis Routine Frequency of calling an Analysis Routine x Work required for transiting to Analysis Routine Work done inside Analysis Routine

Reducing Work in Analysis Routines • Key: Shift computation from analysis routines to instrumentation routines whenever possible • This usually has the largest speedup

Edge Counting: Slower Version • ... • void docount2(ADDRINT src, ADDRINT dst, INT32 taken) • { • COUNTER *pedg = Lookup(src, dst); • pedg->count += taken; • } • void Instruction(INS ins, void *v) { • if (INS_IsBranchOrCall(ins)) • { • INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount2, • IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, • IARG_BRANCH_TAKEN, IARG_END); • } • } • ...

Edge Counting: Faster Version • void docount(COUNTER* pedge, INT32 taken) { • pedg->count += taken; • } • void docount2(ADDRINT src, ADDRINT dst, INT32 taken) { • COUNTER *pedg = Lookup(src, dst); • pedg->count += taken; • } • void Instruction(INS ins, void *v) { • if (INS_IsDirectBranchOrCall(ins)) { • COUNTER *pedg = Lookup(INS_Address(ins), • INS_DirectBranchOrCallTargetAddress(ins)); • INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount, • IARG_ADDRINT, pedg, IARG_BRANCH_TAKEN, IARG_END); • } else • INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR) docount2, • IARG_INST_PTR, IARG_BRANCH_TARGET_ADDR, • IARG_BRANCH_TAKEN, IARG_END); • } • …

Analysis Routines: Reduce Call Frequency • Key: Instrument at the largest granularity whenever possible • Instead of inserting one call per instruction • Insert one call per basic block or trace

counter++; counter++; counter++; counter++; counter++; Slower Instruction Counting Recall our standard example: sub $0xff, %edx cmp %esi, %edx jle <L1> mov $0x1, %edi add $0x10, %eax

Faster Instruction Counting Counting at BBL level Counting at Trace level counter += 3 sub $0xff, %edx cmp %esi, %edx jle <L1> mov $0x1, %edi add $0x10, %eax counter += 5 sub $0xff, %edx cmp %esi, %edx jle <L1> mov $0x1, %edi add $0x10, %eax counter += 2 counter+=3 L1

Reducing Work for Analysis Transitions • Reduce number of arguments to analysis routines • Inline analysis routines • Pass arguments in registers • Instrumentation scheduling

Reduce Number of Arguments • Eliminate arguments only used for debugging • Instead of passing TRUE/FALSE, create 2 analysis functions • Instead of inserting a call to: Analysis(BOOL val) • Insert a call to one of these: AnalysisTrue() AnalysisFalse() • IARG_CONTEXT is very expensive (> 10 arguments)

Inlining Not-inlinable Inlinable int docount1(int i) { if (i == 1000) x[i]++; return x[i]; } int docount0(int i) { x[i]++ return x[i]; } Not-inlinable Not-inlinable int docount2(int i) { x[i]++; printf(“%d”, i); return x[i]; } void docount3() { for(i=0;i<100;i++) x[i]++; } Pin will inline analysis functions into application code

Inlining • Inlining decisions are recorded in pin.log with log_inline • pin –xyzzy –mesgon log_inline –t mytool – app • Analysis function at 0x2a9651854c CAN be inlined • Analysis function at 0x2a9651858a is not inlinable because the last instruction of the first bbl fetched is not a ret instruction. The first bbl fetched: • ============================================================================ • bbl[5:UNKN]: [p: ? ,n: ? ] [____] rtn[ ? ] • ---------------------------------------------------------------------------- • 31 0x000000000 0x0000002a9651858a push rbp • 32 0x000000000 0x0000002a9651858b mov rbp, rsp • 33 0x000000000 0x0000002a9651858e mov rax, qword ptr [rip+0x3ce2b3] • 34 0x000000000 0x0000002a96518595 inc dword ptr [rax] • 35 0x000000000 0x0000002a96518597 mov rax, qword ptr [rip+0x3ce2aa] • 36 0x000000000 0x0000002a9651859e cmp dword ptr [rax], 0xf4240 • 37 0x000000000 0x0000002a965185a4 jnz 0x11

Passing Arguments in Registers • 32 bit platforms pass arguments on stack • Passing arguments in registers helps small inlined functions • VOID PIN_FAST_ANALYSIS_CALL docount(ADDRINT c) { icount += c; } • BBL_InsertCall(bbl, IPOINT_ANYWHERE, AFUNPTR(docount), IARG_FAST_ANALYSIS_CALL, IARG_UINT32, BBL_NumIns(bbl), IARG_END);

Conditional Inlining • Inline a common scenario where the analysis routine has a single “if-then” • The “If” part is always executed • The “then” part is rarely executed • Useful cases: • “If” can be inlined, “Then” is not • “If” has small number of arguments, “then” has many arguments (or IARG_CONTEXT) • Pintool writer breaks analysis routine into two: • INS_InsertIfCall(ins, …, (AFUNPTR)doif, …) • INS_InsertThenCall(ins, …, (AFUNPTR)dothen, …)

IP-Sampling (a Slower Version) const INT32 N = 10000; const INT32 M = 5000; INT32 icount = N; VOID IpSample(VOID* ip) { --icount; if (icount == 0) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between <N, N+M> } } VOID Instruction(INS ins, VOID *v) { INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)IpSample, IARG_INST_PTR, IARG_END); }

IP-Sampling (a Faster Version) INT32 CountDown() { --icount; return (icount==0); } VOID PrintIp(VOID *ip) { fprintf(trace, “%p\n”, ip); icount = N + rand()%M; //icount is between <N, N+M> } inlined not inlined VOID Instruction(INS ins, VOID *v) { // CountDown() is always called before an inst is executed INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)CountDown, IARG_END); // PrintIp() is called only if the last call to CountDown() // returns a non-zero value INS_InsertThenCall(ins, IPOINT_BEFORE, (AFUNPTR)PrintIp, IARG_INST_PTR, IARG_END); }

Instrumentation Scheduling • If an instrumentation can be inserted anywhere in a basic block: • Let Pin know via IPOINT_ANYWHERE • Pin will find the best point to insert the instrumentation to minimize register spilling

ManualExamples/inscount1.cpp • #include <stdio.h> • #include "pin.H“ • UINT64 icount = 0; • void docount(INT32 c) { icount += c; } • void Trace(TRACE trace, void *v) { • for (BBL bbl = TRACE_BblHead(trace); • BBL_Valid(bbl); bbl = BBL_Next(bbl)) { • BBL_InsertCall(bbl,IPOINT_ANYWHERE,(AFUNPTR)docount, • IARG_UINT32, BBL_NumIns(bbl), IARG_END); • } • } • void Fini(INT32 code, void *v) { • fprintf(stderr, "Count %lld\n", icount); • } • int main(int argc, char * argv[]) { • PIN_Init(argc, argv); • TRACE_AddInstrumentFunction(Trace, 0); • PIN_AddFiniFunction(Fini, 0); • PIN_StartProgram(); • return 0; • } analysis routine instrumentation routine

Optimizing Your Tools - Summary • Baseline system has fairly low overhead (~5-20%) • Adding instrumentation can increase overhead significantly, but you can help! • Move work from analysis to instrumentation routines • Explore larger granularity instrumentation • Explore conditional instrumentation • Understand when inlining will occur

What Did We Learn Today? • How use of process VMs for: • Program Analysis • Bug detection, thread analysis • Computer architecture • Branch predictors, cache simulators, timing models, architecture width • Moving from 32-bit to 64-bit • How to use the tools “well” • Simple tricks for good performance

Want More Info? • Talk to your friends! (Lots of people writing tools with Pin, DynamoRIO, Valgrind) • Each tool has a mailing list • Pin www.pintool.org • DynamoRIO code.google.com/p/dynamorio • Valgrind www.valgrind.org Day 1 – What is Process Virtualization? Day 2 – Building Process Virtualization Systems Day 3 – Using Process Virtualization Systems Day 4 – Symbiotic Optimization

Day 3: Using Process Virtualization Systems

Day 3: Using Process Virtualization Systems

Presentation Transcript

Using Virtualization in the Classroom

Distributed (Operating) Systems - Virtualization - -Server Design Issues- - Process and Code Migration -

Course: Government Process Re-engineering Day 3

Using Virtualization to Improve Testing

Course: Government Process Re-engineering Day 3

INTELLIGENT CONTROL SYSTEMS Day 1/3

Live Updating Operating Systems Using Virtualization

Embedded Systems Day-3

Addressing Threats Against DoD Networked Systems Using Virtualization

Virtualization in Mobile Systems

Day 2: Building Process Virtualization Systems

Day 3: NWS Warning Process

Kernel Development using Virtualization

Hanscom Virtualization Day CITS PMO

Tuning our Agile Process Using Systems Engineering

Networking Virtualization Using FPGAs

COMP25212: Virtualization 3

Unit 3 Virtualization