high performance computing research at berkeley n.
Skip this Video
Loading SlideShow in 5 Seconds..
High Performance Computing Research at Berkeley PowerPoint Presentation
Download Presentation
High Performance Computing Research at Berkeley

High Performance Computing Research at Berkeley

0 Views Download Presentation
Download Presentation

High Performance Computing Research at Berkeley

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. High Performance Computing Research at Berkeley Katherine Yelick U. C. Berkeley, EECS Dept. and Lawrence Berkeley National Laboratory November 2004

  2. Major Research Areas in HPC at Berkeley • Programming Languages & Compilers • Performance Analysis, Modeling, Tuning • Algorithms & Libraries • Reconfigurable Hardware • Applications • Many of these are collaborations • With Lawrence Berkeley Lab scientists • Application scientists across campus, lab, and elsewhere High Performance Computing at Berkeley

  3. Challenges to Performance • Parallel machines are too hard to program • Users “left behind” with each new major generation • Drop in market size also affects those left in it • Efficiency is too low and dropping • Single digit efficiency numbers are common • Even for Top500 < 15% get >80% efficiency • Two trends in High End Computing • Increasing complicated systems • Increasingly sophisticated algorithms • Deep understanding of performance at all levels is important High Performance Computing at Berkeley

  4. Global Address Space Programming • Best of shared memory and message passing • Ease of shared memory • Performance of message passing (or better) • Examples are UPC, Titanium, CAF, Split-C •, • x: 1 y: 2 x: 5 y: 6 x: 7 y: 8 Object heaps are shared Global address space l: l: l: g: g: g: Program stacks are private High Performance Computing at Berkeley

  5. Three GAS Languages Parallel extensions depends on base language • UPC (parallel extension of C) • Consistent with C design • Mapping to hardware is explicit • Widely used in DoD • Titanium (based on JavaTM) • Consistent with Java • Programmability and safety are primary concerns • Bounds checking, exception handling, barrier checking • Attractive to recently-trained programmers • Co-Array Fortran • Array-oriented, builds on Fortran 90 High Performance Computing at Berkeley

  6. Goals of the Berkeley UPC Project • Make UPC Ubiquitous on • Parallel machines • Workstations and PCs for development • A portable compiler: for future machines too • Components of research agenda: • Languagedevelopment ongoing • Compileroptimizations for parallel languages • Runtime work for Partitioned Global Address Space (PGAS) languages in general • Applicationdemonstrations of UPC • LBNL/UCB collaboration High Performance Computing at Berkeley

  7. Where Does Berkeley UPC Run? • Runs on SMPs, clusters & supercomputers • Support Operating Systems: • Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris, MSWindows(cygwin), MacOSX, Unicos, SuperUX • Supported CPUs: • x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC, Opteron • GASNet communication: • Myrinet GM, Quadrics Elan, Mellanox Infiniband VAPI, IBM LAPI, Cray X1, SGI Altix, SHMEM • Specific supercomputer platforms: • Cray T3e, Cray X1, IBM SP, NEC SX-6, Cluster X (Big Mac), SGI Altix 3000 High Performance Computing at Berkeley

  8. The UPC Language • UPC was developed by researchers from IDA, Berkeley, and LLNL • Consortium led by GWU and IDA sets language standard • Ongoing effort to understand application needs • Berkeley has been key player in IO, Collectives, Memory model, and spec issues in general • As standard as MPI-2 and more than SHMEM • Several commercial (HP, Cray, IBM) and open (Berkeley, MTU/HP, Intrepid) • Not just a language for Cray or SGI machines High Performance Computing at Berkeley

  9. Compiling Explicitly Parallel Code • Most compilers are designed for languages with serial semantics • Code motion is a critical optimization • Compilers move code around • Register re-use, instruction scheduling, loop transforms, overlapping communication • Hardware dynamically moves operations around • Out-of-order processors, network reordering, etc. • When is reordering correct? • Because the programs are parallel, there are more restrictions, not fewer • Have to preserve semantics of what may be viewed by other processors High Performance Computing at Berkeley

  10. Compiler Analysis Overview • When compiling sequential programs, compute dependencies: Valid if y not in expr1 and x not in expr2 (roughly) • When compiling parallel code, we need to consider accesses by other processors. x = expr1; y = expr2; y = expr2; x = expr1; Initially flag = data = 0 Proc A Proc B data = 1; while (flag == 0); flag = 1; ... =; Work by Yelick, Krishnamurthy and Chen High Performance Computing at Berkeley

  11. Fast Runtime Support for GAS Languages Compiler-generated code • Many networks provide native RDMA support: Infiniband, Quadrics, Cray, Myrinet Compiler-specific runtime GASNet Extended API • Technical problems: • Some networks require pinning  Can read/write only into pinned area  We use a “firehose” approach to virtualize this • Each platform provides a different primitives: • We use layered approach for portability • Small core is only requirement for functionality • One-sided read/write semantics are a good match, better than send/receive • Work by Bonachea, Bell, Hargrove, Welcome GASNet Core API Network Hardware High Performance Computing at Berkeley

  12. Small Message Performance MPI Best GASNet Lower is better High Performance Computing at Berkeley

  13. GASNet vs. MPI on Infiniband MPI (MVAPI) GASNet (prepinned) GASNet (not prepinned) Higher is better • GASNet significantly outperforms MPI at mid-range sizes: • The cost is MPI tag matching, inherent in two-sided model • Yellow line shows a naïve bounce-buffer pipelining High Performance Computing at Berkeley

  14. P1 P2 P1 P2 G1 G3 G4 G2 PROCESSOR 1 PROCESSOR 2 Applications in UPC • Problems that are hard (or tedious) in message passing: • Fine-grained, asynchronous communication • Dynamic load balancing required • Three applications • Parallel mesh generation (Husbands, using Shewchuk’s Triangle) • Adaptive mesh refinement (shown, Welcome) • Sparse matrix factorization (Demmel, Husbands and Li) High Performance Computing at Berkeley

  15. Titanium Overview Object-oriented language based on Java: • Same high performance parallelism model as UPC: SPMD parallelism in a global address space • Emphasis on domain-specific extensions • Block-structured grid-based computation • Multidimensional arrays • Contiguous storage, domain calculus for index opns • Sparse matrices and unstructured grids • Dynamic communication optimizations • Support for small objects • General mechanism for examples like complex numbers • Semi-automatic memory management • Create named “regions” for new and delete • Joint project with Graham, Hilfinger and Colella High Performance Computing at Berkeley

  16. Research in Titanium • Some problems common to UPC • Analysis of parallel code • Lightweight runtime support • Memory hierarchy optimizations • Automatic deadlock detection for bulk-synchronous code • Dynamic communication optimizations • Tools for debugging, performance analysis, and program development High Performance Computing at Berkeley

  17. Runtime Optimizations • Results for sparse matrix-vector multiply • Two matrices: Random and Finite Element • Titanium versions use: • Packing of remote data • Send entire bounding box • Use a model to select 1 vs 2 Compare to Fortran MPI (Aztec library) High Performance Computing at Berkeley

  18. Heart Simulation in Titanium • Large application effort • Joint with Peskin & McQueen at NYU • Yelick, Givelberg at UCB • Part of NSF NPACI • Generic framework: • Simulation of fluids with immersed elastic structures • Many applications in biology, engineering • Well known hard parallelism (locality/load balance) High Performance Computing at Berkeley

  19. Berkeley Institute for Performance Studies • Newly created, joint institute between the lab and campus Goals: • Bring together researchers on all aspects of performance engineering • Use performance understanding to: • Improve application performance • Compare architectures for application suitability • Influence the design of processors, networks and compilers • Identify algorithmic needs National Science Foundation High Performance Computing at Berkeley

  20. BIPS Approaches • Benchmarking and Analysis • Measure performance • Identify opportunities for improvements in software, hardware, and algorithms • Modeling • Predict performance on future machines • Understand performance limits • Tuning • Improve performance • By hand or with automatic self-tuning tools High Performance Computing at Berkeley

  21. Next Gen Apps Full Apps Compact Apps Micro- Benchmarks System Size and Complexity  Multi-Level Analysis • Full Applications • What users want • Do not reveal impact features • Compact Applications • Can be ported with modest effort • Easily match to phases of full applications • Microbenchmarks • Isolate architectural features • Hard to tie to real applications High Performance Computing at Berkeley

  22. Projects Within BIPS • APEX: Application Performance Characterization Benchmarking (Strohmaier, Shan) • BeBop: Berkeley Benchmarking and Optimization Group (Yelick, Demmel) • LAPACK: Linear Algebra Package (Demmel*) • LDRD Architectural Alternatives (Yelick, Hargrove) • Modern Vector Architecture (Oliker*) • PERC: Performance Engineering Research Center (Bailey, Shan) • Top500: Linpack (Strohmaier*) • ViVA: Virtual Vector Architectures (Oliker*) * many other collaborators High Performance Computing at Berkeley

  23. Vector System Evaluation • US HPC market has been dominated by: • Superscalar cache-based architectures • Clusters of commodity SMPs used for cost effectiveness • Two architectures offer vector alternatives: • The Japanese Earth Simulator • The Cray X1 • Ongoing study of DOE applications on these systems • Work by L. Oliker, J. Borrill, A. Canning, J. Carter, J. Shalf, S. High Performance Computing at Berkeley

  24. Architectural Comparison • Custom vector architectures have • High memory bandwidth relative to peak • Tightly integrated networks result in lower latency (Altix) • Bisection bandwidth depends on topology • JES also dominates here • A key ‘balance point’ for vector systems is the scalar:vector ratio High Performance Computing at Berkeley

  25. Applications Studied Chosen with potential to run at ultrascale • CACTUS Astrophysics 100,000 lines grid based Solves Einstein’s equations of general relativity • PARATECMaterial Science50,000 linesFourier space/grid Density Functional Theory electronic structures codes • LBMHDPlasma Physics 1,500 linesgrid based Lattice Boltzmann approach for magneto-hydrodynamics • GTCMagnetic Fusion 5,000 linesparticle based Particle in cell method for gyrokinetic Vlasov-Poisson equation • MADCAPCosmology 5,000 linesdense linear algebra Extracts key data from Cosmic Microwave Background Radiation High Performance Computing at Berkeley

  26. Summary of Results • Tremendous potential of vector architectures: • 4 codes running faster than ever before • Vector systems allow resolution not possible with scalar (any # procs) • Advantage of having larger/faster nodes • ES shows much higher sustained performance than X1 • Limited X1 specific optimization so far - more may be possible (CAF, etc) • Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio) • Vectors potentially at odds w/ emerging methods (sparse, irregular, adaptive) • GTC example code at odds with data-parallelism • Social barriers to evaluation of these hard-to-vectorize codes High Performance Computing at Berkeley

  27. PERC Performance Tools • Flexible instrumentation systems to capture: • Hardware phenomena • Instruction execution frequencies • Memory reference behavior • Execution overheads • An advanced data management infrastructure to: • Track performance experiments. • Collect data across time and space. • User-friendly tools to tie performance data to user’s source code. • Application level analysis in Berkeley PERC • Work by D. Bailey, H. Shan, and E. Strohmaier High Performance Computing at Berkeley

  28. EVH1 Astrophysics Analysis High Performance Computing at Berkeley

  29. MicroBenchmarks • Using Adaptible probes to understand micro-architecture limits • Tunable to “match” application kernels • Ability to collect continues data sets over parameters reveal cliffs • Three examples • Sqmat • APEX-Map • SPMV (for HPCS) High Performance Computing at Berkeley

  30. Sqmat overview • Java code generate produces unrolled C code • Stream of matrices • Square each Matrix M times in • M controls computational intensity (CI) - the ratio between flops and mem access • Each matrix is size NxN • N controls working set size: 2N2 registers required per matrix. N is varied to cover observable register set size. • Two storage formats: • Direct Storage: Sqmat’s matrix entries stored continuously in memory • Indirect: Entries accessed through indirection vector. “Stanza length” S controls degree of indirection NxN . . . S in a row High Performance Computing at Berkeley

  31. Tolerating Irregularity • S50 • Start with some M at S= (indirect unit stride) • For a given M, how large must S be to achieve at least 50% of the original performance? • M50 • Start with M=1, S= • At S=1 (every access random), how large must M be to achieve 50% of the original performance High Performance Computing at Berkeley

  32. Tolerating Irregularity High Performance Computing at Berkeley

  33. Emerging Architectures • General purpose processors badly suited for data intensive ops • Large caches not useful if re-use is low • Low memory bandwidth, especially for irregular patterns • Superscalar methods of increasing ILP inefficient • Power consumption • Three research processors designed as part of DARPA effort • IRAM: Processor in memory system with vectors (UCB) • Lots of memory bandwidth (on-chip DRAM) • DIVA: Processor in memory system design for multiprocessor system (ISI) • Scalability • Imagine: Stream-based processor (Stanford) • Lots of processing power (64 FPUs/chip) High Performance Computing at Berkeley

  34. Sqmat on Future Machines • Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!) • Imagine much faster for long streams, slower for short ones High Performance Computing at Berkeley

  35. HPCC Benchmarks and Apex-MAP High Performance Computing at Berkeley

  36. APEX Execution Model • Use an array of size M. • Access data in vectors of length L. • Random: • Pick the start address of the vector randomly. • Use the properties of the random numbers to achieve a re-use number k. • Regular: • Walk over consecutive (strided) vectors through memory. • Re-access each vector k-times. High Performance Computing at Berkeley

  37. Apex-Map Sequential spatial temporal High Performance Computing at Berkeley

  38. Apex-Map Sequential spatial temporal High Performance Computing at Berkeley

  39. Apex-Map Sequential spatial temporal High Performance Computing at Berkeley

  40. Parallel Version • Same Design Principal as sequential code. • Data evenly distributed among processes. • L contiguous addresses will be accessed together. • Each remote access is a communication message with length L. • Random Access. • MPI version first • Plans to do Shmem and UPC High Performance Computing at Berkeley

  41. SPMV Benchmark • Microbenchmark for sparse matrix vector multiply • Less “tunable” • Closer to a real app • Strategy • Use either • Random matrix with dense blocks • Dense matrix in sparse format • Register block matrix • Store blocks contiguously, unroll • Only one index per block • Developed for HPCS benchmarks High Performance Computing at Berkeley

  42. Ultra 2i - 9% 63 Mflop/s Ultra 3 - 6% 109 Mflop/s 35 Mflop/s 53 Mflop/s Pentium III - 19% Pentium III-M - 15% 96 Mflop/s 120 Mflop/s High Performance Computing at Berkeley 42 Mflop/s 58 Mflop/s

  43. Power3 - 13% 195 Mflop/s Power4 - 14% 703 Mflop/s 100 Mflop/s 469 Mflop/s Itanium 1 - 7% Itanium 2 - 31% 225 Mflop/s 1.1 Gflop/s High Performance Computing at Berkeley 103 Mflop/s 276 Mflop/s

  44. Automatic Tuning

  45. Motivation for Automatic Performance Tuning • Historical trends • Sparse matrix-vector multiply (SpMV): 10% of peak or less • 2x faster than CSR with “hand-tuning” • Tuning becoming more difficult over time • Performance depends on machine, kernel, matrix • Matrix known at run-time • Best data structure + implementation can be surprising • Our approach: empirical modeling and search • Up to 4x speedups and 31% of peak for SpMV • Many optimization techniques for SpMV • Several other kernels: triangular solve, ATA*x, Ak*x • Proof-of-concept: Integrate with Omega3P • Release OSKI Library, integrate into PETSc High Performance Computing at Berkeley

  46. Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! High Performance Computing at Berkeley

  47. Summary of Performance Optimizations • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 2.2x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A2*x: 2x over CSR, 1.5x over RB High Performance Computing at Berkeley

  48. Optimized Sparse Kernel Interface - OSKI • Provides sparse kernels automatically tuned for user’s matrix & machine • BLAS-style functionality: SpMV.,TrSV, … • Hides complexity of run-time tuning • Includes new, faster locality-aware kernels: ATA*x, … • Faster than standard implementations • Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x • For “advanced” users & solver library writers • Available as stand-alone library (Dec ’04) • Available as PETSc extension (Feb ’05) High Performance Computing at Berkeley

  49. How the OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls High Performance Computing at Berkeley Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

  50. High Performance Software forNumerical Linear Algebra • James Demmel, Jack Dongarra, Xiaoye Li, … • LAPACK and ScaLAPACK • Widely used libraries for dense linear algebra • IBM, Intel, HP, SGI, Cray, NEC, Fujitsu, Matlab,… • New release planned • NSF support, seeking more • New, faster, more accurate numerical algorithms • More parallel versions into ScaLAPACK • Extending functionality • Improving ease of use • Performance tuning • Reliability and Support High Performance Computing at Berkeley