18.337 Parallel Computing’s Challenges

18.337Parallel Computing’s Challenges

Old Homework (emphasized for effect) • Download a parallel program from somewhere. • Make it work • Download another parallel program • Now, …, make them work together!

SIMD • SIMD (Single Instruction, Multiple Data) refers to parallel hardware that can execute the same instruction on multiple data. (Think the addition of two vectors. One add instruction applies to every element of the vector.) • Term was coined with one element per processor in mind, but with today’s deep memories and hefty processors, large chunks of the vectors would be added on one processor. • Term was coined with a broadcasting of an instruction in mind, hence the single instruction, but today’s machines are usually more flexible. • Term was coined with A+B and elementwise AxB in mind and so nobody really knows for sure if matmul or fft is SIMD or not, but these operations can certainly be built from SIMD operations. • Today, it is not unusual to refer to a SIMD operation (sometimes but not always historically synonymous with Data Parallel Operations though this feels wrong to me) when the software appears to run “lock-step” with every processor executing the same instruction. • Usage: “I hear that machine is particularly fast when the program primarily consists of SIMD operations.” • Graphics processors such as NVIDEA seem to run fastest on SIMD type operations, but current research (and old research too) pushes the limits of SIMD.

Natural Question may not be the most important • How do I parallelize x? • First question many students ask • Answer often either one of • Fairly obvious • Very difficult • Can miss the true issues of high performance • These days people are often good at exploiting locality for performance • People are not very good about hiding communication and anticipating data movement to avoid bottlenecks • People are not very good about interweaving multiple functions to make the best use of resources • Usually misses the issue of interoperability • Will my program play nicely with your program? • Will my program really run on your machine?

Class Notation • Vectors small roman letters”: x,y, … • Vectors have length n if possible • Matrices large roman (sometimes Greek) letters: A,B,X,Λ,Σ • Matrices are n x n, or maybe m x n, but almost never n x m. Could be p x q. • Scalars may be small greek letters or small roman letters – may not be as consistent

Algorithm Example: FFTs • For now think of an FFT as a “black box” y=FFT(x) takes as input and output a vector of length n defined (but not computed) as a matrix time vector: y=Fnx, where (Fn)jk=e2πijk/n for j,k=0,…,(n-1). • Important Use Cases • Column fft: fft(X), fft(X,[ ],1) (MATLAB) • Row fft: fft(X,[ ],2) (MATLAB) • 2d fft: (do a row and column) fft2(X) • fft2(X) = row_fft(col_fft(X)) = col_fft( row_fft(X))

How to implement a column FFT? • Put block columns on each processor • Do local column FFTs Local column FFTs may be “column at a time” or “pipelined” In the case of FFT probably a fast local package available, but may not be true for other ops. Also as MIT students have been known to do, you might try to beat the packages. P0 P1 P2

A closer look at column fft • Put block columns on each processor • Where were the columns? Where are they going? • The cost of the above can be very expensive in performance. Can we hide it somewhere? P0 P1 P2

What about row fft • Suppose block columns on each processor • Many transpose and then apply column FFT and transpose back • This thinking is simple and do-able • Not only simple but encourages the paradigm of • 1) do whatever 2) get good parallelism and 3) do whatever • Harder to decide whether to do rows in parallel or to interweave transposing of pieces and start computation • May be more performance, but nobody to my knowledge has done a good job of this yet. You maybe could be first. P0 P1 P2

Not load balanced column fft? • Suppose block columns on each processor • To load balance or to not load balance, that is the question • Traditional Wisdom says this is badly load balanced and parallelism is lost, but there is a cost of moving the data which may or may not be worth the gain in load balancing P0 P1 P2

2d fft • Suppose block columns on each processor • Can do columns, transpose, rows, transpose • Can do transpose, rows, transpose, columns • Can be fancier? P0 P1 P2

So much has to do with access to memory and data movement • The conventional wisdom is that it’s all about locality. This remains partially true and partially not quite as true as it used to be.

http://www.cs.berkeley.edu/~samw/research/talks/sc07.pdf

A peak inside an FFT(more later in the semester) Time wasted on the telephone

Tracing Back the data dependency

New term for the day: MIMD • MIMD (Multiple Instruction stream, Multiple Data stream) refers to most current parallel hardware where each processor can independently execute their own instructions. The importance of MIMD over SIMD emerged in the early 1990’s, as commodity processors became the basis of much parallel computing. • One may also refer to a MIMD operation in an implementation, if one wishes to emphasize non-homogeneous execution. (Often contrasted to SIMD.)

Importance of Abstractions • Ease of use requires that the very notion of a processor really should be buried underneath the user • Some think that the very requirements of performance require the opposite • I am fairly sure the above bullet is more false than true – you can be the ones to figure this all out!

18.337 Parallel Computing’s Challenges