An IM plicitly PA rallel C ompiler T echnology Based on Phoenix

An IMplicitly PArallel Compiler Technology Based on Phoenix For thousand-core microprocessors Wen-mei Hwu with Ryoo, Ueng, Rodrigues, Lathara, Kelm, Gelado, Stone, Yi, Kidd, Barghsorkhi, Mahesri, Tsao, Stratton, Navarro, Lumetta, Frank, Patel University of Illinois, Urbana-Champaign

Background • Academic compiler research infrastructure is a tough business • IMPACT, Trimaran, and ORC for VLIW and Itanium processors • Polaris and SUIF for multiprocessors • LLVM for portability and safety • In 2001, IMPACT team moved into many-core compilation with MARCO FCRC funding • A new implicitly parallel programming model that balance the burden on programmers and the compiler in parallel programming • Infrastructure work has slowed down ground-breaking work • Timely visit by the Phoenix team in January 2007 • Rapid progress has since been taking place • Future IMPACT research will be built on Phoenix

Big picture The Next Software Challenge • Today, multi-core make more effective use of area and power than large ILP CPU’s • Scaling from 4-core to 1000-core chips could happen in the next 15 years • All semiconductor market domains converging to concurrent system platforms • PCs, game consoles, mobile handsets, servers, supercomputers, networking, etc. We need to make these systems effectively execute valuable, demanding apps.

To meet this challenge, the compiler must Allow simple, effective control by programmers Discover and verify parallelism Eliminate tedious efforts in performance tuning Reduce testing and support cost of parallel programs The Compiler Challenge “Compilers and tools must extend the human’s ability to manage parallelism by doing the heavy lifting.”

GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 An Initial Experimental Platform • A quiet revolution and potential build-up • Calculation: 450 GFLOPS vs. 32 GFLOPS • Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s • Until last year, programmed through graphics API • GPU in every PC and workstation – massive volume and potential impact

Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Load/store Load/store Load/store Load/store Load/store Global Memory GeForce 8800 16 highly threaded SM’s, >128 FPU’s, 450 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

Some Hand-code Results [HKR HotChips-2007]

Computing Q: Performance 446x CPU (V6): 230 MFLOPS GPU (V8): 96 GFLOPS

Lessons Learned • Parallelism extraction requires global understanding • Most programmers only understand parts of an application • Algorithms need to be re-designed • Programmers benefit from clear view of the algorithmic effect on parallelism • Real but rare dependencies often needs to be ignored • Error checking code, etc., parallel code is often not equivalent to sequential code • Getting more than a small speedup over sequential code is very tricky • ~20 versions typically experimented for each application to move away from architecture bottlenecks

Stylized C/C++ or DSL w/ assertions Implicitly Parallel Programming Flow Deep analysis w/ feedback assistance Human Concurrency discovery For increased composability Visualizable concurrent form Systematic search for best/correct code gen Code-gen space exploration For increased scalability Visualizable sequential assembly code with parallel annotations parallel execution w/ sequential semantics Parallel HW w/sequential state gen For increased supportability Debugger

Key Ideas • Deep program analyses that extend programmer and DSE knowledge for parallelism discovery • Key to reduced programmer parallelization efforts • Exclusion of infrequent but real dependences using HW STU (Speculative Threading with Undo) support • Key to successful parallelization of many real applications • Rich program information maintained in IR for access by tools and HW • Key to integrate multiple programming models and tools • Intuitive, visual presentation to programmers • Key to good programmer understanding of algorithm effects • Managed parallel execution arrangement search space • Key to reduced programmer performance tuning efforts

Parallelism in Algorithms(H.263 motion estimation example)

MPEG-4 H.263 EncoderParallelism Redicovery (b) (c) (d) (e) (a)

Code Gen Space Exploration

Moving an Accurate Interprocedural Analysis into Phoenix Unification Based Fulcra

Getting Started with Phoenix • Meetings with Phoenix team in January 2007 • Determined the set of Phoenix API routines necessary to support IMPACT analyses and transformations • Received custom build of Phoenix that supports full type information

Fulcra to Phoenix – Action! • Four step process: • Convert IMPACT’s data structure to Phoenix’s equivalents, and from C to C++/CLI. • Creating the initial constraint graph using Phoenix’s IR instead of IMPACT’s IR. • Convert the solver – pointer analysis. • Consist of porting from C to C++/CLI and dealing with any changes to Fulcra ported data structures. • Annotate the points-to information back into Phoenix's alias representation.

Phoenix Support Wish List • Access to code across file boundaries • LTCG • Access to multiple files within a pass • Full (Source code level) type information • Feed results from Fulcra back to Phoenix • Need more information on Phoenix alias representation • In the long run, we need highly extendable IR and API for Phoenix

Conclusion • Compiler research for many-cores will require a very high quality infrastructure with strong engineering support • New language extensions, new user models, new functionalities, new analyses, new transformations • We chose Phoenix based on its robustness, features and engineering support • Our current industry partners are also moving into Phoenix • We also plan to share our advanced extensions to the other academic Phoenix users

An IM plicitly PA rallel C ompiler T echnology Based on Phoenix

An IM plicitly PA rallel C ompiler T echnology Based on Phoenix

Presentation Transcript

B UILDING T ECHNOLOGY

The DIRECT Project D elaware I nterprocedural RE gion-based C ompiler T oolset

B luetooth T echnology

Marine Communications T echnology

I nformation T echnology A dvisory C ommittee

P ARTICLE T ECHNOLOGY

T echnology

T echnology E lective C ourse

Veterinary T echnology Diploma

Knowledge , Message , T echnology

I nformation T echnology

Let’s Talk T echnology

A ssistive T echnology

B luetooth T echnology

SPACE T ECHNOLOGY

S PEECH T ECHNOLOGY

T echnology Lab

B luetooth T echnology

O DYSSIAN T ECHNOLOGY

E DUCATIONAL T ECHNOLOGY

An IM plicitly PA rallel C ompiler T echnology Based on Phoenix