Profiling, Instrumentation, and Profile Based Optimization
Profiling, Instrumentation, and Profile Based Optimization. Robert Cohn Robert.Cohn@compaq.com Mark T. Vandevoorde. Introduction. Understanding the dynamic interaction between programs and processors What do programs do? How do processors perform? How can we make it faster?. What to do?.
Profiling, Instrumentation, and Profile Based Optimization
E N D
Presentation Transcript
Profiling, Instrumentation, and Profile Based Optimization Robert Cohn Robert.Cohn@compaq.com Mark T. Vandevoorde
Introduction Understanding the dynamic interaction between programs and processors • What do programs do? • How do processors perform? • How can we make it faster? Profiling Tutorial
What to do? Build tools! • Profiling • Instrumentation • Profile based optimization Profiling Tutorial
The Big Picture Sampling Instrumentation Profiling Profile Based Optimization Analysis Modeling Profiling Tutorial
Instrumentation • User level view • Executable editing Profiling Tutorial
TOOL V V Code Instrumentation Trojan Horse • Application appears unchanged • Data collected as a side effect of execution Profiling Tutorial
Instrumentation Example if (b > c) t = 1; else b = 3; • Add extra code if (b > c) { bb[0]++; t = 1; } else { bb[1]++; b = 3; } Instrumentation Profiling Tutorial
Instrumentation Uses • Profiles • Model new hardware • What will this new branch predictor do? • What is the miss rate of this new cache? • Optimization opportunities • find unnecessary loads and stores • find divides by 1 Profiling Tutorial
What Tool Does Instrumentation? • Compiler • Compiler inserts extra operations • Requires recompile, access to source code • Executable editor • Post-link tool inserts instrumentation code • No rebuild, source code not required • More difficult to relate back to source Profiling Tutorial
Instrumentation Tools for Alpha • All executable based • General instrumentation: • Atom on Digital Unix • Distributed with Digital Unix • Ntatom on Windows NT • New! Download from web • Specialized tools based on above • hiprof, pixie, 3rd, ... Profiling Tutorial
ATOM • Tool for customized instrumentation • User writes program that describes how to instrument application • Instrumentation program applied to application, generates instrumented application • Instrumented application is run • Data is collected Profiling Tutorial
User Supplies • Instrumentation routines: user written program that inserts instrumentation • calls to analysis routines • Analysis routines: do the instrumentation work at runtime (e.g. count a basic block) Profiling Tutorial
Iterate Iterate Atom Programming Model spice libc.so libm.so main() Compute() _exit() block2 block3 block1 block5 block4 ldq r1, 8(sp) addq r1, 0x1, r2 stq r2, 8(sp) bne r1, 0x1ffc40 Profiling Tutorial
ATOM Instrumentation API: Navigation • Objects (binary, shared library) • GetFirstObj, GetNextObj • Procedures • GetFirstProc, GetNextProc • Basic blocks • GetFirstBlock, GetNextBlock • Instructions • GetFirstInst, GetNextInst Profiling Tutorial
ATOM Instrumentation API: Interrogation • GetObjInfo, GetProcInfo, GetBlockInfo, GetInstInfo • IsBranchTarget • GetInstRegUsage • InstPC • InstLineNo • ... Profiling Tutorial
ATOM Instrumentation API: Definition • AddCallProto • tells atom the types of the arguments for calls to analysis routines Profiling Tutorial
ATOM Instrumentation API: Instrumentation • AddCallProgram, AddCallObj, AddCallProc, AddCallBlock, AddCallInst, ReplaceProcedure • Insert before or after Profiling Tutorial
Arguments to analysis routines • Constants • variables in instrumentation program, but constant at instrumentation point • e.g. uninstrumented PC, function name • VALUE computed at runtime • effective address, branch taken predicate • Register • r3, arguments, return value Profiling Tutorial
Sample #1: Cache Simulator Write a tool that computes the miss rate of the application running in a 64KB, direct mapped data cache with 32 byte lines. > atom spice cache.inst.o cache.anal.o -o spice.cache > spice.cache < ref.in > ref.out > more cache.out 5,387,822,402 620,855,884 11.523% Profiling Tutorial
Reference(0(a0)) Reference (0(a0)); Cache Tool Implementation Application Instrumentation main: clr t0 loop: ldl t2,0(a0) addl t0,4,t0 addl t2,0x10,t2 stl t2,0(a0) bne t3,loop ret VALUE PrintResults(); Profiling Tutorial
Cache Analysis File #include <stdio.h> #define CACHE_SIZE 65536 #define BLOCK_SHIFT 5 long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses; Reference(long address) { int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT; long tag = address >> BLOCK_SHIFT; if (cache[index] != tag) { misses++; cache[index] = tag ; } refs++;} Print() { FILE *file = fopen("cache.out","w"); fprintf(file,"%ld %ld %.2f\n",refs, misses, 100.0 * misses / refs); fclose(file);} Profiling Tutorial
Cache Instrumentation File #include <stdio.h> #include <cmplrs/atom.inst.h> unsigned Instrument(int argc, char **argv, Obj *o) { Inst *i;Block *b;Proc *p; AddCallProto("Reference(VALUE)"); AddCallProto("Print()"); AddCallProgram(ProgramAfter,"Print"); for (p = GetFirstProc(); p != NULL; p = GetNextProc(p)) for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b)) for (i = GetFirstInst(b); i != NULL; i = GetNextInst(i)) if (IsInstType(i, InstTypeLoad) || IsInstType(i,InstTypeStore)) AddCallInst(i, InstBefore, "Reference", EffAddrValue); } Profiling Tutorial
Sample #2: Profiler Write a tool that outputs the address of each basic block and the number of times it is executed. vssad-27> atom a.out prof.inst.c prof.anal.c vssad-28> a.out.atom Hello world vssad-29> head prof.out 120001030 1 120001038 1 12000103c 1 120001058 33 120001064 1 Profiling Tutorial
Count(1) Init(3) Count(0) Count(2) Profiler Tool Implementation Application Instrumentation main: clr t0 loop: ldl t2,0(a0) addl t0,4,t0 addl t2,0x10,t2 stl t2,0(a0) bne t3,loop ret Constant PrintResults(addresses,3); Profiling Tutorial
Profiler: prof.anal.c #include <stdio.h> long * counts; void Init(int nblocks) { counts = (long *)malloc(nblocks * sizeof(long)); memset(counts,0,nblocks * sizeof(long));} void Count(int index){ counts[index]++; } void Print(long *blocks,int nblocks) { int i; FILE *file = fopen("prof.out","w"); for (i = 0; i < nblocks; i++) fprintf(file,"%lx %ld\n",blocks[i],counts[i]); fclose(file); } Profiling Tutorial
Profiler: prof.inst.c #include <stdio.h> #include <cmplrs/atom.inst.h> void CallInitPrint(); void Instrument(int argc, char **argv,Obj * o) { Block *b;Proc *p;int index=0; int nblocks = GetObjInfo(o,ObjNumberBlocks); long *addresses = (long *)malloc(nblocks * sizeof(long)); CallInitPrint(addresses,nblocks); for (p = GetFirstProc(); p != NULL; p = GetNextProc(p)) for (b = GetFirstBlock(p); b != NULL; b = GetNextBlock(b)) { addresses[index] = InstPC(GetFirstInst(b)); AddCallInst(GetFirstInst(b), InstBefore, "Count",index++); }} Profiling Tutorial
Profiler: prof.inst.c void CallInitPrint(long * addresses, int nblocks) { char buffer[100]; AddCallProto("Count(int)"); AddCallProto("Init(int)"); AddCallProgram(ProgramBefore,"Init",nblocks); sprintf(buffer,"Print(const stable int[%d],int)"); AddCallProto(buffer); AddCallProgram(ProgramAfter,"Print",addresses,nblocks); } Profiling Tutorial
Executable editors • Input: executable, ouput: executable • Instrument, optimize, translate • Executable = image = binary = shared library = shared object = dynamically linked library (DLL) • Executable editor, executable optimizer, binary rewriter, binary translator, post link optimizer Profiling Tutorial
Executable Editing • Insert/delete/reorder instructions and data • Obstacle to modification • Addresses are bound • Registers are bound Profiling Tutorial
lda a0,0x1000 bsr Reference Obstacles if (a) a = b; beq r1,+2 ldl r1,0x1000 • Is a0 free? • Adjust branch offsets • Adjust literal addresses Profiling Tutorial
Phases 1. Decompose 2. Build IR 3. Insert instrumentation 4. Convert IR to executable Profiling Tutorial
1. Decompose Executable Executable Header Text (code) Program code & data Data Rdata Exception Info Meta data Relocations Debug Profiling Tutorial
Decompose • Break executable into units • unit: minimum data that must be kept together • code: unit is instruction • data: unit is data section • alternative: unit is data item Profiling Tutorial
Instruction list Data sections add Data load Sdata beq MetaData Exception Info Relocations 2. Build Internal Representation Profiling Tutorial
Intermediate Representation • Similar to compiler • except unstructured, untyped data • 1 to 1 mapping for IR and machine instructions • Base representation should be compact • fit in physical memory • initial/final phases do multiple passes • Representations built/thrown away for procedures Profiling Tutorial
Bound addresses Data: 1 2 0x12345678 3 Code: br +4 ldah r0,0x1234 lda r0,0x5678(r0) Metadata: Begin: 0x12345678 End: 0x12345680 Profiling Tutorial
Adjusting addresses • No translation • Dynamic translation • Static translation Profiling Tutorial
No translation • Leave code and data at same address beq r1,L2 ldl r1,0x1234 L2: beq r1,L2 br L1 L2: ... ... L1: lda a0,0x1234 bsr Reference ldl r1,0x1234 br L2 Profiling Tutorial
Dynamic translation • Address computation is unchanged • Image has map of old->new address • Code inserted to map old->new address at runtime for load/store/branch • Better: • Do PC relative branches statically • Keep data section at original address • Still: indirect calls and jumps (not returns) Profiling Tutorial
Static translation • Address computation is altered for new layout • Find addresses • Determine what they point to: • unit, offset • Insert instrumentation • Adjust literals or offsets to compute new address of unit Profiling Tutorial
Other tools that change addresses • Linker • combine separately compiled objects • adjust addresses based on assigned load address • unit is section of object (data, text) • Loader • Load address != link address for DLL • unit is entire image • Use relocations Profiling Tutorial
Relocations Data: 1 2 0x12345678 3 No relocation required Code: br +4 ldl r1,10(gp) ldah r0,0x1234 lda r0,0x5678(r0) May require relocation Relocation example: address: 0x200 type: ldah literal object: 0x12345670 external: Requires relocation Profiling Tutorial
How to recognize addresses? • Metadata • example: procedure begin, procedure end • implicit in structure of data • Absolute addresses • example: literal address in data section • use relocations • Relative addresses: address offset • example: pc relative branch, offset for base pointer • may not need adjustment,usually no relocation Profiling Tutorial
Relative Addresses • Address computed as offset of another address • Address and Address + Offset point to same unit: ok, unit moved as a unit • Example: a->field1 ar[4] ldl r0,field1(a) ldl r0,16(ar) Profiling Tutorial
Relative Addresses • Offset spans multiple units • example: Jump table: ad = base + i jmp ad base: br l1 br l2 br l3 br l4 PC relative branch br +4 Must be 1 unit Profiling Tutorial
Map address to unit and offset Reference -> address • in code: interpret instructions br +4 ldah r0,0x1234 lda r0,0x5678(r0) • in data: data is address .data 0x12345678 Profiling Tutorial
Map address to unit and offset (relocation,address) -> (unit,offset) • to code: pointer to instruction • to data: data section and offset • alternative: data item and offset • offset = address - unit address Profiling Tutorial
3. Insert Instrumentation Instruction list add Data sections load Data load Sdata beq Ndata MetaData Exception Info Relocations Profiling Tutorial
Adding instrumentation code • Instrumentation requires free registers • wrapper routine saves and restores registers beq r1,+2 save registers lda a0,0x1000 bsr ra,wrapper restore registers ldl r1,0(r2) Save registers on stack bsr ra,Reference Restore registers return Reference • Local/global/interprocedural analysis finds free registers Profiling Tutorial
4. Convert IR to Executable Executable Header Text Program code data Data Rdata Ndata Exception Info Meta data Relocations Debug Profiling Tutorial