Advancing Compiler Performance with Open64: Strategies & Innovations for Multi-Core Architectures
This document outlines pivotal advancements in high-performance compilation through the Open64 framework, emphasizing its historical context, recent research initiatives, and code optimization techniques tailored for large-scale multi-core architectures. Key topics include scratch pad memory utilization, enhanced points-to analysis under the SSA framework, and effective software pipelining strategies. The text also discusses the significant role of Open64 in facilitating the transition to Cyclops64 architecture and highlights specific optimization methodologies that improve computational efficiency and resource allocation.
Advancing Compiler Performance with Open64: Strategies & Innovations for Multi-Core Architectures
E N D
Presentation Transcript
Open64: A Framework for High performance Compiler March 2007
Outline • Open64 History • Osprey Project • Research Activities • Retargetability
Open64 Based Research Activites at University of Delaware • Open64 Code Porting for Large-Scale Multi-Core Architectures • Code Optimization for Large-Scale Multi-Core Architectures • Research on a point-to alias analysis under a SSA framework • Landing Software Pipelining on Large-Scale Multi-Core Architectures
Front End IPA Machine model (ISA, uArch, ABI etc) LNO WOPT CG Tool chains Port Open64 to Cyclops64 based on Pathscale 2.2.1/x8664 Begin with gcc 3.2.1/MIPS C FE, we change the MD so that it can generate AST compatible with cyclops64’s ABI Rewrite from scratch for C64 Only dep-test are enabled, the loop transformation are not enabled because org loop transformation is not readily applicable for arch without cache Changed heavily: CGIR lowering, scheduling, EBO etc tools chains (as, ld, simulator etc) are provided by ETI.
Some researches on Open64/C64 • Scratch pad utilization • Divide scratch pad memory into 3 areas: 2nd level general purpose register (L2 GPR) , software rotating register (SRR), free area • L2 GPR: further divide into caller/callee-save, color live ranges with L2 GPR when RA run out of real registers. • SRR: prefetching, improve temporal locality • E.g1 prefech 5 iterations ahead for (…) { = x; } => for (…) { rrx = x; = rrx+5 } • E.g2 improve temporal locality: for (i=0; i <10000; i++ { a[i] = b[i] + a[i-5]; } => for (i=0; i <10000; i++ { rrx = b[i] + rrx-5 ; a[i]=rrx } • Use LDM (load multiple word) to reduce bandwidth bandwidth requirement
Unification based points-to analysis using SSA • Motivations • Incremental change to existing Steensgaard’s PT analysis with better precision • Retain almost linear time • Limited flow sensitivity: improve the precision of analysis of *p and *q where p and q are global variable/pointer, or it may be modified by callees. • Reduce the imprecision due to unification • Limited Flow sensitivity by SSA form: • build (preliminary) SSA form for all variables (inc global variables and local var with address taken). Do not take into account the alias. • Perform Points-to on the preliminary SSA form, update the SSA form during PT analysis p3 initially points-to n, after analyzing stmt 4, p3points to both n and z
Unification based points-to analysis using SSA (cont) • Differentiate flat unification and updating unification • Flat unification: let s1=points_to(p1), s2=points_to(p2), statements p = cond ? p1 : p2 make s1 and s2 unified simply because p may points to both set. The s1 and s2 themselves don’t need updated at the moment unification happens. • Incremental update: points_to(p1) => {a, b}, “*q1 = some_ptr”, may change p’s value, hence points_to(p1) should be updated into {a,b} U points_to(some_ptr). • The final unified set encode the type of unification of smaller subset. Flat-unified sub-sets are still disjointed.
Software Pipelining of Multi-Core Architectures – A Brief Introduction • Problem description • Software toolchain • Where Open64 helped • Some results.
Problem Description • Software-pipelining on multi-threaded architectures • Single-dimension Software-Pipelining (SSP) • Workload distribution • Data communication • Data synchronization
What Open64 features are used in multi-core software pipelining • Multi-dimensional dependence analysis • WHIRL clean interface • Machine model • Reservation tables • Register allocation • Modulo-scheduler • Code generator • No need to implement everything to test • Clean code despite lack of documentation!