Enhancing Parallel Code Generation with Domain-Specific High-Level Runtime Support

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation • Languages, compilers, and runtime systems for high-end computing • Typically focus on scientific applications • Can commercial applications benefit ? • A majority of top 500 parallel configurations are used as database servers • Is there a role for parallel systems research ? • Parallel relational databases – probably not • Data mining, OLAP, decision support – quite likely

Data Mining • Extracting useful models or patterns from large datasets • Includes a variety of tasks - mining associations, sequences, clustering data, building decision trees, predictive models - several algorithms proposed for each • Both compute and data intensive • Algorithms are well suited for parallel execution • High-level interfaces can be useful for application development

Project Overview

Project Components • A middleware system called FREERIDE (Framework for Rapid Implementation of Datamining Engines) (SDM 01, SDM 02) • Performance modeling and prediction (for parallelization strategy selection) SIGMETRICS 2002 • Runtime and compiler support for shared memory parallelization (LCPC 02) • Translation from mining operators (not yet ) • Focus on language and compiler support for distributed memory parallelization in this talk

Common Processing Structure • Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } } • Applies to major association mining, clustering and decision tree construction algorithms • Parallelization approach • Compute local copy of reduction objects • Perform global reduction

Middleware Support for Distributed Memory Parallelization • Interface Requires: • Specification of an iterator and termination condition • Local reduction for each parallel loop • Global reduction for each loop • Functionality • Fetch data elements chunk by chunk, apply local reduction • Broadcast the reduction object after finishing one pass on data • Perform global reduction, broadcast the results • Check termination condition, move to next iteration

Compilation Approach • Support a general high-level language • Use middleware functionality in compilation • Exploit the domain-specific common structure • Reduction loop with associative and commutative operations • Disk-resident input datasets, smaller output

Language Support ·A data parallel dialect of Java: to give compiler information about independent collections of objects, parallel loops and reduction operations — domain & rectdomain — foreach loop — reduction variables: - can only be updated inside a foreach loop by operations that are associative & commutative - intermediate value of the reduction variables may not be used within the loop, except for self-updates

Example code public class kNN { static buffer kbuffer; public static void main(String[] args) { double dis; Point<3>lowend = … Point<3> hiend = … Point<3> p; RectDomain<3>InputDomain=[lowend:hiend]; kPoint[3d]Input=new kPoint[InputDomain]; foreach (p in InputDomain) { if (Input[p].inRange(R)) { dis=Input[p].distance(W); kbuffer.insert(Input[p],dis); }

Compilation Task • Extract local reduction function • Simple from body of data parallel loop • Extract an iterator and termination condition • Simple from the overall code • Extract a global reduction function • Can be quite challenging in the presence of complex control flow and data-structures • A new algorithm developed

Extracting Global Reduction from Local Reduction : Motivating Example For( j = 0; j < k ; j++) { I = k – 1 ; While (buf.dis[j] < distance) && I >= 0) { if(I>0) { x1[I] = x1[I-1] ; x2[I] = x2[I-1] ; … } I = I – 1 ; } If(I < k-1) { x1[I+1] = buf..x1[j] ; x2[I+1] = buf..x2[I] ; … } } I = k – 1 ; While (newdis < distance) && I >= 0) { if(I>0) { x1[I] = x1[I-1] ; x2[I] = x2[I-1] ; … } I = I – 1 ; } If(I < k-1) { x1[I+1] = kpoint.x1 ; x2[I+1] = kpoint.x2 ; … } I = k – 1 ; While (kpoint.dis < distance) && I >= 0) { if(I>0) { x1[I] = x1[I-1] ; x2[I] = x2[I-1] ; … } I = I – 1 ; } If(I < k-1) { x1[I+1] = kpoint.x1 ; x2[I+1] = kpoint.x2 ; … }

Overall Approach • Classify each assignment to a data member of reduction object into following types: • O.x =g(e), where e is the input element • O.x = O.x op g(e), op is an associative and commutative operator • Expression involving loop constants and other members of the reduction object • Classify control dependence on any of the above assignment statements as: • Loop constant • Non-loop constant

Code Generation: Handling Different Types of Assignment Statements • Three types of assignment statements: • O.x = g(e) (Type a) If x can represent many fields, iterate over all of them • O.x = O.x op g(e) (Type b) Replace by O.x = O.x op O1.x If x can represent many fields, iterate over all of them • Expression involving loop constants and other data members (Type c) Keep as it is

Handling Control Flow • Control predicates for Type (b) assignments: • Remove non-loop constant control predicates • Keep loop constant control predicates • Control predicates for Type (a) and Type (c) statements: • Keep loop constant control predicates • Classify non-loop constant into two types: • Predicate involves a value that is assigned to a data member Replace that value by the data member • Other predicates - Simply remove

Experimental Platform Cluster of Workstations • Sun Ultra Enterprise 450 • 250 MHz Ultra-II processors • 1 GB of 4-way interleaved main memory • Myrinet as the interconnect

Results from k-means clustering 1 GB dataset with 3 dimensional points K = 3

Results from Apriori Association Mining 3 GB dataset

Results from k-nearest neighbors 1 GB dataset 3 dimensional pts. k = 100

Summary • Focus on a new class of applications • Exploit the common structure within the class • Develop a runtime system supporting this structure • Use it as a compiler target • Very simple compiler implementation (< 1000 lines of code) • A new algorithm for synthesizing global reduction functions • Performance of compiler generated code is very competitive

Enhancing Parallel Code Generation with Domain-Specific High-Level Runtime Support

Enhancing Parallel Code Generation with Domain-Specific High-Level Runtime Support

Presentation Transcript

Domain-Specific Corpora

XTEAM: Automated Synthesis of Domain-Specific Code Generators

High-Level Test Generation

Generating GPU-Accelerated Code From a High-level Domain-specific Language

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks

Domain Specific Languages

How domain specific are Domain Specific Languages?

Run-time code generation in C++ as a foundation for domain-specific optimization

Run-time code generation in C++ as a foundation for domain-specific optimization

Domain Specific Language

Run-time code generation in C++ as a foundation for domain-specific optimization

Automated Analysis and Code Generation for Domain-Specific Models

Domain Specific Languages

Domain Specific Languages

Run-time code generation in C++ as a foundation for domain-specific optimization

High-Level Test Generation for Gate-level Fault Coverage

Runtime code generation for the JVM

Domain Specific Models

MPI and Parallel Code Support

Domain Specific Languages

Sea Ice

Sea Ice