Compiling R for Performance in Bioinformatics Applications

Compiling R for Performance in Bioinformatics Applications John Garvin, John Mellor-Crummey, Bradley Broom, Ken Kennedy {garvin,johnmc,broom,ken}@cs.rice.edu • The R Language • For statistical computations • Widely used in bioinformatics • Variants: S, S-PLUS • Open source (GPL) • Similar to Matlab, Mathematica, Octave, Ellpack • Interpreted, high-level • R Advantages • Intuitive programming • Quick turnaround time • Convenient domain-specific libraries • A few lines of R code can replace a page of C code • No CS degree required! • R Disadvantage • Poor performance • Big reason: interpreted, not compiled • No whole program to optimize • Researchers must painstakingly rewrite in C or Fortran for performance • Goal • Turn the R interpreter into an optimizing R compiler • Implement full language features while achieving good performance • Problem • R code from M.D. Anderson Cancer Center • Experiment design problem • 1000-patient trial • Discover when results are meaningful • In R interpreter: matter of minutes • Hand coding in C: matter of seconds • Approach • First: translate R to C • Next: compiler optimizations on C code • Part 1: R to C Compiler • Related to multi-staging (Taha), partial evaluation • Generate C code that performs the same actions as the R interpreter • Reverts to interpreting parse tree when necessary • Goal: implement full R language • Integrate with interpreter infrastructure • Interpreted R code can call compiled and vice versa • With code, optimizations are possible • Part 2: Optimization • Telescoping Languages (Kennedy) • Specialization • Domain-specific libraries • Open64 infrastructure • Useful for specifying transformations • Advanced profiling (Froyd) • Status As of 10/9/2003: • Plain compilation into C: done • Same speed as interpreter • Next: optimizations • Preliminary tests show potential • Future Optimizations • Improve allocation • LISP-like lists • Reduce vector allocation • Type specialization (McCosh) • Matrix size, shape analysis • Slice hoisting (Chauhan) • Combine allocations • Control flow • Interpreter: returns are jumps • More detailed control flow • Variable definition and lookup • Explicit environments • Lookup especially expensive • Use target language • Conclusion • Optimizing compiler for R is possible • Complied R can enable large productivity gains • Acknowledgements • Special thanks to Arun Chauhan, Nathan Froyd, Cheryl McCosh, the people at M.D. Anderson, and Walid Taha Acknowledgements

Compiling R for Performance in Bioinformatics Applications

Compiling R for Performance in Bioinformatics Applications

Presentation Transcript

Bioinformatics Applications

Cluster Computer For Bioinformatics Applications

“Semantic Web” Applications in Bioinformatics

Bioinformatics: Applications

Bioinformatics Applications

High performance bioinformatics

High performance bioinformatics

Compiling High Performance Fortran

Compiling and Using the “best” R

Compiling and Executing Applications on BlueGene

Bioinformatics Applications and Workloads

Compiling for VIRAM

Designing Applications for Performance

Folklore Confirmed: Compiling for Speed = Compiling for Energy

Applications to Bioinformatics

Cluster Computing Applications for Bioinformatics

BF528 - Applications in Translational Bioinformatics

Compiling for VIRAM

Bioinformatics Applications in the Virtual Laboratory

Compiling for VIRAM

Compiling and Using the “best” R

Association Pattern Analysis – Applications in Bioinformatics