18.337 Introduction

18.337 Introduction

News you can use • Hardware • Multicore chips (2009: mostly 2 cores and 4 cores)(2010 hexacores,octocores)(2011 twelve cores) • Servers (often many multicores sharing memory) • Clusters (often several, to tens, and many more servers not sharing memory)

Performance • Single processor speeds for now no longer growing. • Moore’s law still allows for more real estate per core (transistors double/nearly every two years) • http://www.intel.com/technology/mooreslaw/index.htm • People want performance but hard to get • Slowdowns seen before speedups • Flops (floating point ops / second) • Gigaflops (109), Teraflops (1012), Petaflops(1015) • Compare matmul with matadd. What’s the difference?

Some historical machines

Earth Simulator was #1

Some interesting hardware Nvidia Cell Processor Sicortex – “Teraflops from Milliwatts” http://www.sicortex.com/products/sc648 http://www.gizmag.com/mit-cycling-human-powered-computation/8503/

Programming • MPI: The Message Passing Interface • Low level “lowest common denominator” language that the world has stuck with for nearly 20 years • Can get performance, but can be a hindrance as well • Some say that there are those that will pay for a 2x speedup, just make it easy • Reality is that many want at least 10x and more for a qualitative difference in results • People forget that serial performance can depend on many bottlenecks including time to memory • Performance (and large problems) are the reason for parallel computing, but difficult to get the “ease of use” vs “performance” trade-off right.

Places to Look • Best current news: • http://www.hpcwire.com/ • Huge Conference: • http://sc11.supercomputing.org/

Architecture Diagrams from Sam Williams (formerly) @ BerkeleyBottom Up Performance Engineering: Understanding Hardware’s implications on performance up to softwareTop Down: measuring software and tweaking sometimes aware and sometimes unaware of hardware

http://www.cs.berkeley.edu/~samw/research/talks/sc07.pdf

Want to delve into hard numerical algorithms • Examples: • FFTs and Sparse Linear Algebra • At the MIT level: • Potential “not quite right” question: How do you parallelize these operations? • Rather what issues arise and why is getting performance hard? • Why is nxn matmul easy? Almost cliché? • Comfort level in this class to delve in?

Old Homework (emphasized for effect) • Download a parallel program from somewhere. • Make it work • Download another parallel program • Now, …, make them work together!

SIMD • SIMD (Single Instruction, Multiple Data) refers to parallel hardware that can execute the same instruction on multiple data. (Think the addition of two vectors. One add instruction applies to every element of the vector.) • Term was coined with one element per processor in mind, but with today’s deep memories and hefty processors, large chunks of the vectors would be added on one processor. • Term was coined with a broadcasting of an instruction in mind, hence the single instruction, but today’s machines are usually more flexible. • Term was coined with A+B and elementwise AxB in mind and so nobody really knows for sure if matmul or fft is SIMD or not, but these operations can certainly be built from SIMD operations. • Today, it is not unusual to refer to a SIMD operation (sometimes but not always historically synonymous with Data Parallel Operations though this feels wrong to me) when the software appears to run “lock-step” with every processor executing the same instruction. • Usage: “I hear that machine is particularly fast when the program primarily consists of SIMD operations.” • Graphics processors such as NVIDEA seem to run fastest on SIMD type operations, but current research (and old research too) pushes the limits of SIMD.

SIMD summary • SIMD (Single Instruction, Multiple Data) refers to parallel hardware that can execute the same instruction on multiple data. • One may also refer to a SIMD operation (sometimes but not always historically synonymous with a Data Parallel Operation) when the software appears to run “lock-step” with every processor executing the same instructions.

The Cloud • Problems with HPC systems not what you think • Users wrote codes that nobody could use • Systems hard to install • The Interactive Supercomputing Experience • What the cloud could do • What are the limitations

18.337 Introduction

18.337 Introduction

Presentation Transcript

18.337: Image Median Filter

Introduction to introduction to introduction to … Optimization

18.337 Parallel Computing’s Challenges

Geoff Oxberry 18.337 Project, Spring 2009

18.337 / 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply :

INTRODUCTION/ INTRODUCTION

Introduction

18.337 Parallel Prefix

Geoff Oxberry 18.337 Project, Spring 2009

18.337 Parallel Computing’s Challenges