*partially supported by ST Microelectronics

A Retargetable Preprocessor for Multimedia Instructions* (work in progress)INRIA F. Bodin, G. Pokam, J. Simonnet *partially supported by ST Microelectronics

Multimedia Instructions • Instruction set extension to achieve high performance • many different ones • crucial for embedded systems • difficult to use • Retargetability is an issue

Multimedia Instructions • Exploits sub word parallelism

An Example (Trimedia) char *back, *forward, *idct, *destination; for (i = 0; i<64; i += 1){ destination[i] = ((back[i]+ forward[i] + 1) >> 1) + idct[i]; } int *i_back = (int *) back; int *i_forward = (int *) forward; int *i_idct = (int *) idct; int *i_dest = (int *) destination; for (i = 0; i<16; i += 1){ temp = QUADAVG(i_back[i], i_forward[i]); i_dest[i] = DSPUQUADADDUI(temp, i_idct[i]); }

MMI Automatic Exploitation Vectorization [Bik01] [Krall00] [...] Src code Pre- processing Idioms/MMI Recognition Code Generation Alignment Loop Unrolling [Larsen00] [Leupers2000] machine independent machine dependent

The MMI Recognition Phase • Find the instruction available on the machine • after vectorization • after unrolling • User interaction • fast retargetability • not only for compiler writer • no compiler recompilation needed

A MMI Example temp[i] =(back[i]+forward[i]+1)>>1; rather than t1[i] = (back[i] + forward[i]); t2[i] = t1[i] + 1; temp[i] = t2[i] >> 1;

SWARecog : a Retargetable Engine for MMI • Front-end independent • CoSy • Sage++ • Retargetable • configurable intermediate form • Uses a rewriting system based on U. Assmann’s work [Assmann96]

An Overview of SWARecog CoSy Sage++ IR description based

The Intermediate Format • Identical for code and rules • Attributes declaration • Node declaration • Edge declaration [NODES] Operator:ENUM = {cast, mul, add, sright, assg,...}; [NODES] VariableName:STRING = {}; [EDGES] distance:INTEGER = {} DEFAULT 0; NODELabel Assign: NodeType = {operator} Operator = {assg} ValueType = {int} (flowdep ObjectAddr:14 Assign:8 1) (flowdep Plus:9 Assign:8 2) [negated] (same ObjectAddr:14 ObjectAddr:11 0)

A Rule Description Example b = a+a b = a<<1 NODELabel v: NodeType = {scalar} Operator = {obj} Aliased = {0} ValueType = {int} VariableName = {*} LoopSector = {body} v v 1 v + << * * • RULE [1] MulToShift: • (flowdep v:1 Plus:6 0) • (flowdep v:2 Plus:6 0) • (flowdep Plus:6 Exp:0 *) 1 • (same v:1 v:2 0) • (same v:2 v:1 0) (flowdep v:1 Shift:7 1) (flowdep IntConst1:8 Shift:7 2) (flowdep Shift:7 Exp:0 *) 1

Example-1 /*$pragma[VectorLoop("NoAlias")]*/ for (i = xa; i < xb; i = i+4){ sum = sum + (s[i] * om[i]); sum = sum + (s[i+1] * om[i+1]); sum = sum + (s[i+2] * om[i+2]); sum = sum + (s[i+3] * om[i+3]); } for (i = xa ; i < xb ; i = i + 4){ sum = sum + dualDotProd(packCont(s[i], s[i + 1]), packCont(om[i], om[i + 1])); sum = sum + dualDotProd(packCont(s[i + 2], s[i + 3]), packCont(om[i + 2], om[i + 3])); }

Example-2 for (i = xa; i < xb; i = i+4) /*$pragma[VectorLoop("NoAlias")]*/{ d[i] = j + 4 + (s[i] + om[i]); d[i+1] = j + 4 + (s[i+1] + om[i+1]); d[i+2] = j + 4 + (s[i+2] + om[i+2]); d[i+3] = j + 4 + (s[i+3] + om[i+3]); } instance number ($INSTANCE) for (i = xa ; i < xb ; i = i + 4){ NEWVAR_temp2_1 = dualAdd(packCont(s[i + 2], s[i + 3]), packCont(om[i + 2], om[i + 3])); NEWVAR_temp2_2 = dualAdd(packCont(s[i], s[i + 1]), packCont(om[i], om[i + 1])); NEWVAR_temp1_1 = unpackCont(NEWVAR_temp2_1, 0); d[i + 2] = j + 4 + (NEWVAR_temp1_1); NEWVAR_temp3_1 = unpackCont(NEWVAR_temp2_1, 1); d[i + 3] = j + 4 + (NEWVAR_temp3_1); NEWVAR_temp1_2 = unpackCont(NEWVAR_temp2_2, 0); d[i] = j + 4 + (NEWVAR_temp1_2); NEWVAR_temp3_2 = unpackCont(NEWVAR_temp2_2, 1); d[i + 1] = j + 4 + (NEWVAR_temp3_2); }

Combining the Rules • Strata or alternative based • normalization based Rule Desc. Rule Desc. Rewriting Engine Rewriting Engine C code IR Form Rewriting Engine IR Form IR Form ... .... Rewriting Engine Rewriting Engine C code IR Form Rule Desc. Rule Desc.

Rule Generation • C rules description RHS Generator C code Rule description LHS Generator C code SWARecog Front-end Front-end C code C code

A Rule Description Example the engine generates same_address_+1 defines the properties of the leaf expressions to match. for (i = 0; i < LOOP_BOUND1 -1; i = i+2) /*$pragma[LHS()]*/ { ROOT_1(LEAF_3(tab1[i]) + LEAF_4(tab2[i])); ROOT_2(LEAF_5(tab1[i+1]) + LEAF_6(tab2[i+1])); } for (i = 0; i < LOOP_BOUND1 -1; i = i+2) /*$pragma[RHS()]*/ { NEWVAR_temp2 = dualAdd(packCont(LEAF_3(tab1[i]),LEAF_5(tab1[i+1])), packCont(LEAF_4(tab2[i]),LEAF_6(tab2[i+1]))); NEWVAR_temp1 = unpackCont(NEWVAR_temp2,0); NEWVAR_temp3 = unpackCont(NEWVAR_temp2,1); ROOT_1(NEWVAR_temp1); ROOT_2(NEWVAR_temp3); }

A Rule Description Example for (i = 0; i < LOOP_BOUND1 -1; i = i+2) /*$pragma[LHS()]*/ { ROOT_1(LEAF_7(sum) = LEAF_8(sum) + (LEAF_3(tab1[i]) * LEAF_4(tab2[i]))); ROOT_2(LEAF_9(sum) = LEAF_10(sum) + (LEAF_5(tab1[i+1]) * LEAF_6(tab2[i+1]))); } for (i = 0; i < LOOP_BOUND1 -1; i = i+2) /*$pragma[RHS()]*/ { ROOT_2(LEAF_9(sum) = LEAF_10(sum) + dualDotProd(packCont(LEAF_3(tab1[i]),LEAF_5(tab1[i+1])), packCont(LEAF_4(tab2[i]),LEAF_6(tab2[i+1])))); }

Conclusion and Perspectives • The prototype is running • Vectorization and alignment phases are under development • Next step : study the tradeoff between unrolling and vectorization

Bibliography • [Assmann96] Graph Rewrite Systems for Program Optimization, U. Assman, Technical Report RR2955, INRIA Rocquencourt, 1996 • [bik01] Experiments with Automatic Vectorization for the Pentium 4 Processor, A. Bik, M. Girkar, P. Grey and X. Tian, CPC, 2001 • [Cheong97] An Optimizer for Multimedia Instruction Sets, G. Cheong and M. Lam, Proceedings of the Second SUIF Compiler Workshop, 1997

Bibliography (cont.) • [Krall00] Compilation Technique for Multimedia Processors, A. Krall and S. Lelait, IJPP, vol. 28, No 4, 2000 • [Larsen00] Exploiting Superword Level Parallelism with Multimedia Instruction Sets, S. Larsen and S. Amarasinghe, PLDI 2000 • [Leupers2000] Code Selection for Media Processors with SIMD Instructions, R. Leupers, DATE 2000 • [Sreraman00] A Vectorizing Compiler for Multimedia Extensions, N. Sreraman and R. Govindarajan, IJPP, vol. 28, No 4, 2000

*partially supported by ST Microelectronics