ECE 669 Parallel Computer Architecture Lecture 12 Interconnection Network Performance

ECE 669Parallel Computer ArchitectureLecture 12Interconnection Network Performance

Performance Analysis of Interconnection Networks • Bandwidth • Latency • Proportional to diameter • Latency with contention • Processor utilization • + Physical constraints

( k 1 ) - n torus 1 dir channels - - 2 Direct nk Avg DIST = 3 Msg size B = Bandwidth & Latency Latency • Average latency kd: • Bandwidth per node - more complex • if all near neighbor messages • If average travel dist then? • Let worst case nk - k k ~ n torus 2 dir channels - - 4 k ~ n no end conns 2 dir channels - - 3 1 B

3 kB Analogy • If each student takes 8 years to graduate • And if a Prof. can support 10 students max at any time How many new students can Prof. take on in a year? Each year take on • Similarly: • Network has channels • # of flits it can sustain = • # of msgs it can concurrently sustain= • Each msg flit uses channels to dest • So Nn Nn B nk 3 nk Nn = x N 3 B or = x = BW per node

Another way of getting BW is: Nn • Max # msgs in net at any time • These take cycles to get delivered, during which time no new mgs can get in • I.e. we can inject or # injected per node per cycle • Note: We have not considered contention thus far. • In practice, latency shoots up much before we achieve the theoretical due to contention. = B kn = 3 Nn kn msgs every cycles B 3 Nn 3 1 = · · B kn N 3 = Bk

What’s a good performance metric to analyze networks • Possibilities: • Bandwidth • Latency • Discuss • Bandwidth - good when latency not an issue. I.e., if we can overlap all communication with computation • Estimate of how much info we can transport • Latency - good when not much parallelism exists (critical sections, e.g.), or when communication cannot be overlapped with computation. • Effective processor utilization best metric - but always not easy to derive. L 1 BW U = 1 req rate * L (U) +

Mem Latency T = (1+c) hops + M + B Performance Analysis • First, analysis based on latency alone • Taking contention into account, but ignoring technological constraints Such analysis is o.k., as long as we do not compare workloads with different traffic rates With no contention, With contention, Average contention delay cycles at each switch Latency T = hops + M + B Delay M Network Hops B

( ( ) ) k k 1 1 r r - - B B d d æ æ 1 1 ö ö ç ç ÷ ÷ 1 1 + + ( ( ) ) r r 2 2 1 1 è è n n ø ø - - kd kd é ù T 1 nk M B = + + + \ ê ú d ë û Direct k-ary n-cube (Agarwal paper) c = hops nkd = r Fraction of bandwidth = consumed, mBkd = Distance traveled in a dim. in a unidirectional torus k 1 - kd = 2 m traffic rate, [probability of request] = k-ary, n-cube latency

1 U = 1 mT + A better metric • Processor utilization U • Or, how network delay impacts overall performance • U = Fraction of time spent doing useful work • Ideally, • Each instruction takes 1 cycle • If probability of message (remote mem req) on each useful cycle is m and corresponding latency is T Each instruction takes 1+mT cycles or M Net T cycles P

T U T0 T1 U0 m1=U0m0 1 U = 1 mT + m1 T1 T0 m0 BW T m m2 U1 m2 Deriving U is not easy • Notice m = Probability of a message on a useful processor cycle meff = m• U = probability of msg on any cycle T = f (meff)= network delay as a function of m or # of useful processor cycles depends on T and meff • Cyclic dependence!

( ) k 1 r - B d æ 1 ö ç ÷ 1 + ( ) r 2 1 è n ø - kd r mBkdU = 1 1 U U = = 1 1 mT mT + + Processor utilization • k-ary n-cube m= Probability of msg on useful cycle • Channel utilization • Latency • Processor utilization • Two equations [1] and [2] • Two unknowns: T, U • Solving, • Where • and é ù [1] T 1 nk M B = + + + ê ú d ë û [2] T0 Bkd Bkd 2 1 1 1 æ ö ( ) ( ) ç ÷ T 2 B2 k 1 n 1 T = + - + - + + - + 0 d 2 4 2 m 2 2 m è ø nkd M B T = + + 0 unloaded network latency =

How do technological constraints impact network design, performance? Technology Constraints • Constraints • Wire lengths limit maximum speed of clock • # pins limit size of nodes • Bisection width limits # of wires per channel • Software • Can we compile to the architecture? • Can users specify programs? • Is it scalable?

B é ù [ ] ( ) ( ) T T n 1 c hops T = + + + switch wire ê ú W ( n ) ë û 1 n The performance equation becomes • Latency • Recall, • Now, • See paper for details. • Summary: Constraints favor small n. T = [ (1 + c) hops + B] cycle time msg length switch delay wire delay f(n) n channel width: function of bisection width - 1 k 2 E . g µ 1 - k E . g µ N 2 µ 1 n µ N

ECE 669 Parallel Computer Architecture Lecture 12 Interconnection Network Performance

ECE 669 Parallel Computer Architecture Lecture 12 Interconnection Network Performance

Presentation Transcript

Parallel Computer Architecture

ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems

ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation

ECE 669 Parallel Computer Architecture Lecture 13 Shared Memory Multiprocessors

ECE 1749H: Interconnection Networks for Parallel Computer Architectures : Flow Control

ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations

ECE 4100/6100 Advanced Computer Architecture Lecture 3 Performance

Computer Network Architecture ECE 156 Fall 2007

Computer Architecture: Introduction Lecture 12

ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches

ECE 669 Parallel Computer Architecture Lecture 3 Design Issues

ECE C61 Computer Architecture Lecture 2 – performance

ECE 669 Parallel Computer Architecture Lecture 19 Processor Design

ECE 1749H: Interconnection Networks for Parallel Computer Architectures: Router Microarchitecture

ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance

ECE 669 Parallel Computer Architecture Lecture 2 Architectural Perspective

ECE 669 Parallel Computer Architecture Lecture 7 Resource Balancing

ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications