Advancing Compiler Performance with Open64: Strategies & Innovations for Multi-Core Architectures

Open64: A Framework for High performance Compiler March 2007

Outline • Open64 History • Osprey Project • Research Activities • Retargetability

Open64 Based Research Activites at University of Delaware • Open64 Code Porting for Large-Scale Multi-Core Architectures • Code Optimization for Large-Scale Multi-Core Architectures • Research on a point-to alias analysis under a SSA framework • Landing Software Pipelining on Large-Scale Multi-Core Architectures

Front End IPA Machine model (ISA, uArch, ABI etc) LNO WOPT CG Tool chains Port Open64 to Cyclops64 based on Pathscale 2.2.1/x8664 Begin with gcc 3.2.1/MIPS C FE, we change the MD so that it can generate AST compatible with cyclops64’s ABI Rewrite from scratch for C64 Only dep-test are enabled, the loop transformation are not enabled because org loop transformation is not readily applicable for arch without cache Changed heavily: CGIR lowering, scheduling, EBO etc tools chains (as, ld, simulator etc) are provided by ETI.

Some researches on Open64/C64 • Scratch pad utilization • Divide scratch pad memory into 3 areas: 2nd level general purpose register (L2 GPR) , software rotating register (SRR), free area • L2 GPR: further divide into caller/callee-save, color live ranges with L2 GPR when RA run out of real registers. • SRR: prefetching, improve temporal locality • E.g1 prefech 5 iterations ahead for (…) { = x; } => for (…) { rrx = x; = rrx+5 } • E.g2 improve temporal locality: for (i=0; i <10000; i++ { a[i] = b[i] + a[i-5]; } => for (i=0; i <10000; i++ { rrx = b[i] + rrx-5 ; a[i]=rrx } • Use LDM (load multiple word) to reduce bandwidth bandwidth requirement

Unification based points-to analysis using SSA • Motivations • Incremental change to existing Steensgaard’s PT analysis with better precision • Retain almost linear time • Limited flow sensitivity: improve the precision of analysis of *p and *q where p and q are global variable/pointer, or it may be modified by callees. • Reduce the imprecision due to unification • Limited Flow sensitivity by SSA form: • build (preliminary) SSA form for all variables (inc global variables and local var with address taken). Do not take into account the alias. • Perform Points-to on the preliminary SSA form, update the SSA form during PT analysis p3 initially points-to n, after analyzing stmt 4, p3points to both n and z

Unification based points-to analysis using SSA (cont) • Differentiate flat unification and updating unification • Flat unification: let s1=points_to(p1), s2=points_to(p2), statements p = cond ? p1 : p2 make s1 and s2 unified simply because p may points to both set. The s1 and s2 themselves don’t need updated at the moment unification happens. • Incremental update: points_to(p1) => {a, b}, “*q1 = some_ptr”, may change p’s value, hence points_to(p1) should be updated into {a,b} U points_to(some_ptr). • The final unified set encode the type of unification of smaller subset. Flat-unified sub-sets are still disjointed.

Software Pipelining of Multi-Core Architectures – A Brief Introduction • Problem description • Software toolchain • Where Open64 helped • Some results.

Problem Description • Software-pipelining on multi-threaded architectures • Single-dimension Software-Pipelining (SSP) • Workload distribution • Data communication • Data synchronization

Software Pipelining Toolchain Based on Open64

Implementation

What Open64 features are used in multi-core software pipelining • Multi-dimensional dependence analysis • WHIRL clean interface • Machine model • Reservation tables • Register allocation • Modulo-scheduler • Code generator • No need to implement everything to test • Clean code despite lack of documentation!

Cyclops64 architecture

Some Results

Advancing Compiler Performance with Open64: Strategies & Innovations for Multi-Core Architectures