1 / 129

Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Introduction

Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Introduction. Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu. Outline (of all lectures). Why all computers must be parallel processors

jedidiah
Télécharger la présentation

Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelism and Locality in Matrix Computationswww.cs.berkeley.edu/~demmel/cs267_Spr09Introduction Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

  2. Outline (of all lectures) • Why all computers must be parallel processors • Arithmetic is cheap, what costs is moving data • Recurring computational patterns • Dense Linear Algebra • Sparse Linear Algebra • How do I know I get the right answer?

  3. Units of Measure • High Performance Computing (HPC) units are: • Flop: floating point operation • Flops/s: floating point operations per second • Bytes: size of data (a double precision floating point number is 8) • Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes • Current fastest (public) machine ~ 1.5 Pflop/s • Up-to-date list at www.top500.org

  4. Outline (of all lectures) • Why all computers must be parallel processors • Arithmetic is cheap, what costs is moving data • Recurring computational patterns • Dense Linear Algebra • Sparse Linear Algebra • How do I know I get the right answer?

  5. Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Microprocessors have become smaller, denser, and more powerful. Slide source: Jack Dongarra

  6. Performance Development 100 Pflop/s 22.9 PFlop/s 10 Pflop/s 1.1 PFlop/s 1 Pflop/s SUM 100 Tflop/s 17.08 TFlop/s 10 Tflop/s 1.17 TFlop/s N=1 1 Tflop/s 59.7 GFlop/s 100 Gflop/s N=500 10 Gflop/s 400 MFlop/s 1 Gflop/s 100 Mflop/s www.top500.org

  7. Parallelism Revolution is Happening Now • Chip density is continuing increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • There is little or no more hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

  8. Outline (of all four lectures) • Why all computers must be parallel processors • Arithmetic is cheap, what costs is moving data • Recurring computational patterns • Dense Linear Algebra • Sparse Linear Algebra • How do I know I get the right answer?

  9. Motivation • Most applications run at < 10% of the “peak” performance of a system • Peak is the maximum the hardware can physically execute • Much of this performance is lost on a single processor, i.e., the code running on one processor often runs at only 10-20% of the processor peak • Most of the single processor performance loss is in the memory system • Moving data takes much longer than arithmetic and logic • To understand this, we need to look under the hood of modern processors • We will first look at only a single “core” processor • These issues will exist on processors within any parallel computer • For parallel computers, moving data also the bottleneck

  10. Outline • Arithmetic is cheap, what costs is moving data • Idealized and actual costs in modern processors • Parallelism within single processors • Memory hierarchies • What this means for designing algorithms and software

  11. Outline • Arithmetic is cheap, what costs is moving data • Idealized and actual costs in modern processors • Parallelism within single processors • Memory hierarchies • Temporal and spatial locality • Basics of caches • Use of microbenchmarks to characterize performance • What this means for designing algorithms and software

  12. Memory Hierarchy • Most programs have a high degree of locality in their accesses • spatial locality: accessing things nearby previous accesses • temporal locality: reusing an item that was previously accessed • Memory hierarchy tries to exploit locality processor control Second level cache (SRAM) Secondary storage (Disk) Main memory (DRAM) Tertiary storage (Disk/Tape) datapath on-chip cache registers

  13. Processor-DRAM Gap (latency) • Memory hierarchies are getting deeper • Processors get faster more quickly than memory µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1989 1980 1981 1982 1983 1984 1985 1986 1987 1988 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

  14. Approaches to Handling Memory Latency • Approach to address the memory latency problem • Eliminate memory operations by saving values in small, fast memory (cache) and reusing them • need temporal locality in program • Take advantage of better bandwidth by getting a chunk of memory and saving it in small fast memory (cache) and using whole chunk • need spatial locality in program • Take advantage of better bandwidth by allowing processor to issue multiple reads to the memory system at once • concurrency in the instruction stream, e.g. load whole array, as in vector processors; or prefetching • Overlap computation & memory operations • Prefetching • Bandwidth has improved more than latency • 23% per year vs 7% per year • Bandwidth still getting slower compared to arithmetic (at 60% per year)

  15. Cache Basics • Cache is fast (expensive) memory which keeps copy of data in main memory; it is hidden from software • Simplest example: data at memory address xxxxx1101 is stored at cache location 1101 • Cache hit: in-cache memory access—cheap • Cache miss: non-cached memory access—expensive • Need to access next, slower level of cache • Cache line length: # of bytes loaded together in one entry • Ex: If either xxxxx1100 or xxxxx1101 is loaded, both are • Associativity • direct-mapped: only 1 address (line) in a given range in cache • Ex: Data stored at address xxxxx1101 stored at cache location 1101, in 16 word cache • n-way: n 2 lines with different addresses can be stored • Ex: Up to 16 words with addresses xxxxx1101 can be stored at cache location 1101

  16. Why Have Multiple Levels of Cache? • On-chip vs. off-chip • On-chip caches are faster, but limited in size • A large cache has delays • Hardware to check longer addresses in cache takes more time • Associativity, which gives a more general set of data in cache, also takes more time • Some examples: • Cray T3E eliminated one cache to speed up misses • IBM uses a level of cache as a “victim cache” which is cheaper • There are other levels of the memory hierarchy • Register, pages (TLB, virtual memory), … • And it isn’t always a hierarchy

  17. s • for array A of length L from 4KB to 8MB by 2x • for stride s from 4 Bytes (1 word) to L/2 by 2x • time the following loop • (repeat many times and average) • for i from 0 to L by s • load A[i] from memory (4 Bytes) Experimental Study of Memory (Membench) • Microbenchmark for memory system performance time the following loop (repeat many times and average) for i from 0 to L load A[i] from memory (4 Bytes) 1 experiment

  18. memory time size > L1 cache hit time total size < L1 Membench: What to Expect average cost per access • Consider the average cost per load • Plot one line for each array length, time vs. stride • Small stride is best: if cache line holds 4 words, at most ¼ miss • If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs) • Picture assumes only one level of cache • Values have gotten more difficult to measure on modern procs s = stride

  19. Mem: 396 ns (132 cycles) L2: 2 MB, 12 cycles (36 ns) L1: 16 KB 2 cycles (6ns) L1: 16 B line L2: 64 byte line 8 K pages, 32 TLB entries Memory Hierarchy on a Sun Ultra-2i Sun Ultra-2i, 333 MHz Array length See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details

  20. Memory Hierarchy on a Pentium III Array size Katmai processor on Millennium, 550 MHz L2: 512 KB 60 ns L1: 64K 5 ns, 4-way? L1: 32 byte line ?

  21. Mem: 396 ns (132 cycles) L2: 8 MB 128 B line 9 cycles L1: 32 KB 128B line .5-2 cycles Memory Hierarchy on a Power3 (Seaborg) Power3, 375 MHz Array size

  22. Stanza Triad – to measure prefetching • Even smaller benchmark for prefetching • Derived from STREAM Triad • Stanza (L) is the length of a unit stride run while i < arraylength for each L element stanza A[i] = scalar * X[i] + Y[i] skip k elements . . . . . . 1) do L triads 2) skip k elements 3) do L triads stanza stanza Source: Kamil et al, MSP05

  23. Stanza Triad Results • This graph (x-axis) starts at a cache line size (16 Bytes) • If cache locality was the only thing that mattered, we would expect • Flat lines equal to measured memory peak bandwidth (STREAM) as on Pentium3 • Prefetching gets the next cache line (pipelining) while using the current one • This does not “kick in” immediately, so performance depends on L

  24. Outline • Arithmetic is cheap, what costs is moving data • Idealized and actual costs in modern processors • Parallelism within single processors • Memory hierarchies • What this means for designing algorithms and software • This is the main topic of these lectures

  25. What this means for designing algorithms and software • Design goal should be to minimize the most expensive operation • Minimize communication = moving data, either between levels of a memory hierarchy, between processor over a network • An algorithm that is good enough today may not be tomorrow • Communication cost increasing relative to arithmetic • Sometimes helps to do more arithmetic in order to do less communication • Rest of lectures address impact on linear algebra • Many new algorithms, designed to minimize communication • Proofs that communication is minimized • Actual performance of a simple program can be a complicated function of the architecture • Slight changes in the architecture or program change the performance significantly • We would like simple models to help us design efficient algorithms and prove their optimality • Can we automate algorithm design?

  26. Outline (of all four lectures) • Why all computers must be parallel processors • Arithmetic is cheap, what costs is moving data • Recurring computational patterns • Dense Linear Algebra • Sparse Linear Algebra • How do I know I get the right answer?

  27. The “7 Dwarfs” of High Performance Computing • Dense Linear Algebra • Ex: Solve Ax=b or Ax = λx where A is a dense matrix • Sparse Linear Algebra • Ex: Solve Ax=b or Ax = λx where A is a sparse matrix (mostly zero) • Operations on Structured Grids • Ex: Anew(i,j) = 4*A(i,j) – A(i-1,j) – A(i+1,j) – A(i,j-1) – A(i,j+1) • Operations on Unstructured Grids • Ex: Similar, but list of neighbors varies from entry to entry • Spectral Methods • Ex: Fast Fourier Transform (FFT) • Particle Methods • Ex: Compute electrostatic forces using Fast Multiple Method • Monte Carlo • Ex: Many independent simulations using different inputs Phil Colella (LBL) identified 7 kernels out of which most large scale simulation and data-analysis programs are composed:

  28. Motif/Dwarf: Common Computational Methods (Red Hot Blue Cool)

  29. Programming Pattern Language 1.0 Keutzer& Mattson Applications Choose your high level architecture - Guided decomposition Identify the key computational patterns – what are my key computations?Guided instantiation Choose your high level structure – what is the structure of my application? Guided expansion Task Decomposition ↔ Data Decomposition Group Tasks Order groups data sharing data access Graph Algorithms Dynamic Programming Dense Linear Algebra Sparse Linear Algebra Unstructured Grids Structured Grids Graphical models Finite state machines Backtrack Branch and Bound N-Body methods Circuits Spectral Methods Model-view controller Iterator Map reduce Layered systems Arbitrary Static Task Graph Pipe-and-filter Agent and Repository Process Control Event based, implicit invocation Productivity Layer Refine the structure - what concurrent approach do I use? Guided re-organization Digital Circuits Task Parallelism Graph algorithms Event Based Divide and Conquer Data Parallelism Geometric Decomposition Pipeline Discrete Event Utilize Supporting Structures – how do I implement my concurrency? Guided mapping Master/worker Loop Parallelism BSP Distributed Array Shared-Data Fork/Join CSP Shared Queue Shared Hash Table Efficiency Layer Implementation methods – what are the building blocks of parallel programming? Guided implementation Thread Creation/destruction Process/Creation/destruction Message passing Collective communication Speculation Transactional memory Barriers Mutex Semaphores

  30. Algorithms for N x N Linear System Ax=b Algorithm Serial PRAM Memory #Procs • Dense LU N3 N N2 N2 • Band LU N2 N N3/2 N • Jacobi N2 N N N • Explicit Inv. N2 log N N2 N2 • Conj.Gradients N3/2 N1/2*log N N N • Red/Black SOR N3/2 N1/2 N N • Sparse LU N3/2 N1/2 N*log N N • FFT N*log N log N N N • Multigrid N log2 N N N • Lower bound N log N N PRAM is an idealized parallel model with zero cost communication

  31. Algorithms for 2DPoisson Equation (N = n2vars) Algorithm Serial PRAM Memory #Procs • Dense LU N3 N N2 N2 • Band LU N2 N N3/2 N • Jacobi N2 N N N • Explicit Inv. N2 log N N2 N2 • Conj.Gradients N3/2 N1/2*log N N N • Red/Black SOR N3/2 N1/2 N N • Sparse LU N3/2 N1/2 N*log N N • FFT N*log N log N N N • Multigrid N log2 N N N • Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: J.D., Applied Numerical Linear Algebra, SIAM, 1997.

  32. Algorithms for 2D (3D) Poisson Equation (N = n2(n3) vars) Algorithm Serial PRAM Memory #Procs • Dense LU N3 N N2 N2 • Band LU N2 (N7/3)N N3/2 (N5/3) N (N4/3) • Jacobi N2(N5/3) N (N2/3) N N • Explicit Inv. N2 log N N2 N2 • Conj.Gradients N3/2 (N4/3) N1/2 (1/3)*log N N N • Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N • Sparse LU N3/2 (N2)N1/2 N*log N (N4/3) N • FFT N*log N log N N N • Multigrid N log2 N N N • Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: J.D. , Applied Numerical Linear Algebra, SIAM, 1997.

  33. Algorithms and Motifs Algorithm Motifs • Dense LU Dense linear algebra • Band LU Dense linear algebra • Jacobi (Un)structured meshes, Sparse Linear Algebra • Explicit Inv. Dense linear algebra • Conj.Gradients (Un)structured meshes, Sparse Linear Algebra • Red/Black SOR (Un)structured meshes, Sparse Linear Algebra • Sparse LU Sparse Linear Algebra • FFT Spectral • Multigrid (Un)structured meshes, Sparse Linear Algebra

  34. Outline (of all four lectures) • Why all computers must be parallel processors • Arithmetic is cheap, what costs is moving data • Recurring computational patterns • Dense Linear Algebra • Sparse Linear Algebra • How do I know I get the right answer?

  35. For more information • CS267 • Annual one semester course on parallel computing at UC Berkeley • All slides and video archived from Spring 2009 offering • www.cs.berkeley.edu/~demmel/cs267_Spr09 • Google “parallel computing course” • www.cs.berkeley.edu/~demmel/cs267 • 1996 version, but extensive on-line algorithmic notes • Parallelism “boot camp” • Second annual 3 day course at UC Berkeley • parlab.eecs.berkeley.edu/2009bootcamp • ParLab • Parallel Computing Research Lab at UC Berkeley • parlab.eecs.berkeley.edu

  36. Extra slidesFrom CS267 LECTUre 1

  37. Computational Science- Recent News “An important development in sciences is occurring at the intersection of computer science and the sciences that has the potential to have a profound impact on science. It is a leap from the application of computing … to the integration of computer science concepts, tools, and theorems into the very fabric of science.” -Science 2020 Report, March 2006 Nature, March 23, 2006

  38. Drivers for Change • Continued exponential increase in computational power simulation is becoming third pillar of science, complementing theory and experiment • Continued exponential increase in experimental data techniques and technology in data analysis, visualization, analytics, networking, and collaboration tools are becoming essential in all data rich scientific applications

  39. Theory Experiment Simulation Simulation: The Third Pillar of Science • Traditional scientific and engineering method: (1) Do theory or paper design (2) Perform experiments or build system • Limitations: –Too difficult—build large wind tunnels –Too expensive—build a throw-away passenger jet –Too slow—wait for climate or galactic evolution –Too dangerous—weapons, drug design, climate experimentation • Computational science and engineering paradigm: (3) Use high performance computer systems to simulate and analyze the phenomenon • Based on known physical laws and efficient numerical methods • Analyze simulation results with computational tools and methods beyond what is used traditionally for experimental data analysis

  40. Computational Science and Engineering (CSE) • CSE is a widely accepted label for an evolving field concerned with the science of and the engineering of systems and methodologies to solve computational problems arising throughout science and engineering • CSE is characterized by • Multi - disciplinary • Multi - institutional • Requiring high-end resources • Large teams • Focus on community software • CSE is not “just programming” (and not CS) • Fast computers necessary but not sufficient • New graduate program in CSE at UC Berkeley (more later) Reference: Petzold, L., et al., Graduate Education in CSE, SIAM Rev., 43(2001), 163-177

  41. SciDAC - First Federal Program to Implement CSE • SciDAC (Scientific Discovery • through Advanced Computing) • program created in 2001 • About $50M annual funding • Berkeley (LBNL+UCB) largest recipient of SciDAC funding Global Climate Nanoscience Biology Combustion Astrophysics

  42. Some Particularly Challenging Computations • Science • Global climate modeling • Biology: genomics; protein folding; drug design • Astrophysical modeling • Computational Chemistry • Computational Material Sciences and Nanosciences • Engineering • Semiconductor design • Earthquake and structural modeling • Computation fluid dynamics (airplane design) • Combustion (engine design) • Crash simulation • Business • Financial and economic modeling • Transaction processing, web services and search engines • Defense • Nuclear weapons -- test by simulations • Cryptography

  43. Economic Impact of HPC • Airlines: • System-wide logistics optimization systems on parallel systems. • Savings: approx. $100 million per airline per year. • Automotive design: • Major automotive companies use large systems (500+ CPUs) for: • CAD-CAM, crash testing, structural integrity and aerodynamics. • One company has 500+ CPU parallel system. • Savings: approx. $1 billion per company per year. • Semiconductor industry: • Semiconductor firms use large systems (500+ CPUs) for • device electronics simulation and logic validation • Savings: approx. $1 billion per company per year. • Securities industry (note: old data …) • Savings: approx. $15 billion per year for U.S. home mortgages.

  44. $5B World Market in Technical Computing Source: IDC 2004, from NRC Future of Supercomputing Report

  45. What Supercomputers Do Introducing Computational Science and Engineering Two Examples • simulation replacing experiment that is too dangerous • analyzing massive amounts of data with new tools

  46. Global Climate Modeling Problem • Problem is to compute: f(latitude, longitude, elevation, time)  “weather” = (temperature, pressure, humidity, wind velocity) • Approach: • Discretize the domain, e.g., a measurement point every 10 km • Devise an algorithm to predict weather at time t+dt given t • Uses: • Predict major events, e.g., El Nino • Use in setting air emissions standards • Evaluate global warming scenarios Source: http://www.epm.ornl.gov/chammp/chammp.html

  47. Global Climate Modeling Computation • One piece is modeling the fluid flow in the atmosphere • Solve Navier-Stokes equations • Roughly 100 Flops per grid point with 1 minute timestep • Computational requirements: • To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s • Weather prediction (7 days in 24 hours)  56 Gflop/s • Climate prediction (50 years in 30 days)  4.8 Tflop/s • To use in policy negotiations (50 years in 12 hours)  288 Tflop/s • To double the grid resolution, computation is 8x to 16x • State of the art models require integration of atmosphere, clouds, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more • Current models are coarser than this

  48. High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL

  49. U.S.A. Hurricane Source: M.Wehner, LBNL

  50. NERSC User George Smoot wins 2006 Nobel Prize in Physics Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years

More Related