480 likes | 629 Vues
Other Applications of Dependence. Allen and Kennedy, Chapter 12. Overview. So far, we’ve discussed dependence analysis in Fortran Dependence analysis can be applied to any language and translation context where arrays and loops are useful Application to C and C++
E N D
Other Applications of Dependence Allen and Kennedy, Chapter 12
Overview • So far, we’ve discussed dependence analysis in Fortran • Dependence analysis can be applied to any language and translation context where arrays and loops are useful • Application to C and C++ • Application to hardware design
Problems of C • C as “typed assembly language” versus Fortran as “high performance language” • C focuses more on ease of use and hardware operations • Post-increments, Pre-increments, Register variable • Fortran focus is on ease of optimization
Problems of C • In many cases, optimization is not desired while (!(t=*p)); • Optimizers would moves p outside the loop • C++ as well as other new languages focus more on simplified software development, at the expense of optimizability • Use of new languages has expanded into areas where optimization is required
Problems of C • Pointers • Memory locations accessed by pointers is not clear • Aliasing • C does not guarantee that arrays passed into subroutine do not overlap • Side-effect operators • Operators such as pre and post increment encourage a style where array operations are strength-reduced by the programmers
Problems of C • Loops • Fortran loops provides values and restrictions to simplify optimizations
Pointers • Two fundamental problems • A pointer variable can point to different memory locations during its use • A memory location can be accessed by more than one pointer variable at any given time, produces aliases for the location • Resulting in a much more difficult and expensive dependence testing
Pointers • Without knowledge of all possible references of an array, compilers must assume dependence • Analyzing entire program to find out dependence is solvable, but still unsatisfactory • Lead to the use of compiler options / pragmas • Safe parameters • All pointer parameters to a function point to independent storage • Safe pointers • All pointer variables (parameter, local, global) point to independent storage
Naming and Structures • In Fortran, a block of storage can be uniquely identified by a single name • Consider these constructs: p; *p; **p; *(p+4); *(&p+4);
Naming and Structures • Troublesome structures, such as unions • Naming problem • What is the name of ‘a.b’ ? • Different sized objects to overlap same storage • Reduce references to the same common unit of smallest storage possible
Loops • Lack of constraints in C • Jumping into loop body is permitted • Induction variable (if there’s any) can be modified in the body of the loop • Loop increment value may also be changed • Conditions controlling the initiation, increment, and termination of the loop have no constraints on their form
Loops • Rewrite as a DO loop • It must have one induction variable • That variable must be initialized with the same value on all paths into the loop • The variable must have one and only one increment within the loop • The increment must be executed on every iteration • The termination condition must match • No jumps from outside of the loop body
Scoping and Statics • Create unique symbols for variables with same name but different scopes • Static variables • Which procedures have access to the variable can be determined from the scope information • If it contains an address, then the content of that address can be modified by any other procedures
Problematic C Dialects • Use of pointers rather than arrays • Use of side effect operators • Complicates the work of optimizers • Need to be removed • Use of address and dereference operators
Problematic C Dialects • Requires enhancements in some transformations • Constant propagation • Treat address operators as constants and propagate them where is essential • Replace generic pointer inside a dereference with actual address • Expression simplification and recognition • Need stronger recognition within expression where variable is actually the ‘base variable’
Problematic C Dialects • Conversion into array references • Useful to convert pointers into array references • Induction variable substitution • Problem with strength reduction of array references • Expanding side-effect operators also requires changes
C Miscellaneous • Volatile variables • Functions with these variables are best left without optimization • Setjmp and Longjmp • Commonly used for error handling • Storing and loading current state of computation which is complex when optimization is performed and variables are allocated to registers • No optimization
C Miscellaneous • Varags and stdargs • Variable number of arguments • No optimization
Hardware Design: Overview • Today, most hardware design is language-based • Textual description of hardware in languages that are similar to those to develop software • Level of abstraction moving towards low level detailed implementation to high level behavioral specification • Key factor: compiler technology
Hardware Design: Overview • Four level of abstraction • Circuit / Physical level • Diagrams of electronic components • Logic level • Boolean equations • Register transfer level (RTL) • Control state transitions and data transfers, timing • Synthesis: conversion from RTL to its implementation • System level • Concentrate on behavior • Behavioral synthesis
Hardware Design • Behavior Synthesis is really a compilation problem • Two fundamental tasks • Verification • Implementation • Simulation of hardware is slow
Hardware Description Languages • Verilog and VHDL • Extensions in Verilog • Multi-valued logic: 0, 1, x, z • x = unknown state, z = conflict • E.g. division by zero produces x state • Operations with x will result in x state -> can’t be executed directly • Reactivity • Propagation of changes automatically • “always” statement -> continuous execution • “@” operator -> blocks execution until one of the operands change in value
Verilog • Reactivity always @(b or c) a = b + c; • Objects • Specific area of silicon • Completely separate area on the chip • Connectivity • Continuous passing of information • Input port and output port
Verilog • Connectivity module add(a,b,c) output a; input b, c; integer a, b, c; always @(b or c) a = b + c; endmodule
Verilog • Instantiation • Verilog only allows static instantiation integer x, y, z; add adder1(x,y,z); • Vector operations • Viewing other data structures as vector of scalars
Verilog • Advantages • No aliasing • Restriction of form of subscripts • Entire hardware design given to compilers at one time
Verilog • Disadvantages • Non-procedural continuation semantics • Lack of loops • Loops are implicitly represented by always blocks and the scheduler • Size
Optimizing simulation • Philosophy • Increases level of abstraction • Opts for less details • Inlining modules • HDLs have two properties that make module inlining simpler • Whole design is reachable at one time • Recursion is not permitted
Optimizing simulation • Execution ordering • The order in which the statement is executed can have a dramatic effect on the efficiency • Fast in hardware, but not in software • Grouping increases performance • Execute blocks in topological order based on the dependence graph of individual array elements • No memory overhead
Dynamic versus Static Scheduling • Dynamic scheduling • Dynamically track changes in values and propagate them • Mimics hardware • Overhead of change checks • Static scheduling • Blindly sweeps through all values for all objects regardless any changes • No need for change checks
Dynamic versus Static Scheduling • If the circuit is highly active, static scheduling is more suitable • In general, using dynamic scheduling guided by static analysis provides the best results
Fusing always blocks • High cost of change checks motivates fusing always blocks • Output of a design may change
Vectorizing always block • Regrouping low level operations back together to bring higher lever abstractions • Vectorizing the bit operations
Two state versus four state • Extra overhead in four state hardware • Few people like hardware that enters unknown states • Two state logic can be 3-5x faster • Utilization of two valued logic where ever possible • Finding out part executable in two state logic is difficult • Use interprocedural analysis
Two state versus four state • Test for detecting unknown is low cost, 2-3 instructions • Check for unknowns but default quickly to two state execution
Rewriting block conditions always @(posedge(clk)) begin sum = op1 ^ op2 ^ c_in; c_out = (op1 & op2) | (op2 & c_in) | (c_in & op1) end always @(op1 or op2 or c_in) begin t_sum = op1 ^ op2 ^ c_in; t_c_out = (op1 & op2) | … end always @(posedge(clk)) begin sum = t_sum; c_out = t_c_out; End
Basic Optimizations • Raise level of abstraction • Constant propagation and dead code elimination • Common subexpression elimination
Synthesis Optimization • Goal is to insert the details • Analogous to standard compilers • Harder than standard compilers • Not targeted towards a fixed target • No single goal. Minimize cycle time, area, power consumption
Basic Framework • Selection outweigh scheduling • Analogous to CISC • Body of tree matching algorithms • Needs constraints
Loop Transformations for(i=0; i<100;i++) { t[i] = 0; for(j=0; j<3; j++) t[i] = t[i] + (a[i-j]>>2); } for(i=0; i<100; i++) { o[i] = 0; for(j=0; j<100; j++) o[i] = o[i] +m[i][j] * t[j] }
Loop Transformations for(i=o; i<100; i++) t[i] = 0; for(i=0; i<100; i++) o[i] = 0; for(i=0; i<100; i++) for(j=0; j<3; j++) t[i] = t[i] + (a[i-j] >> 2) for(i=0; i<100; i++) for(j=0; j<100; j++) o[i] = o[i] + m[i][j] * t[j];
Loop Transformations for(i=0; i<100; i++) o[i] = 0; for(i=0; i<100; i++) t[i] = 0; for(j=0; j<3; j++) t[i] = t[i] + (a[i-j] >> 2); for(j=0; j<100; j++) o[j] = o[j] + m[j][i] * t[i];
Loop Transformation for(i=0; i<100; i++) { o[i] = 0; a0 = a[0]; a1 = a[-1]; a2 = a[-2]; a3 = a[-3]; for(i=0; i<100; i++) { t = 0; t = t + (a0>>2) + (a1>>2) + (a2>>2) + (a3>>2) a3 = a2; a2 = a1; a1 = a0; a0 = a[i+1]; for(j=0; j<100; j++) o[j] = o[j] + m[j][I] * t; } }
Control and Data Flow • Von Neumann architecture • Data movement among memory and registers • Control flow encapsulated in the program counter and effected with branches • Synthesized hardware • Data movement among functional units • Control flow is which functional unit should be active on what data at which time steps
Control and Data Flow • Wires • Immediate transfer • Latches • Values hold throughout one clock cycle • Registers • Static variables in c • Held in one or more clock cycle • Memories
Memory Reduction • Memory access is slow compared to unit access • Application of techniques • Loop interchange • Loop fusion • Scalar replacement • Strip mining • Unroll and jam • Prefetching
Summary • Not limited to Fortran • Have other applications • Early stage of research