170 likes | 354 Vues
CIS669 Distributed and Parallel Processing. Lecture 2: Parallel System Architectures and Performance Evaluation Yuan Shi Spring 2002. Parallel System Architectures.
E N D
CIS669 Distributed and Parallel Processing Lecture 2: Parallel System Architectures and Performance Evaluation Yuan Shi Spring 2002
Parallel System Architectures “Lacking the dignity of a proper discipline, [it] was an orphan in the world of knowledge. The subject became a rag-bag filled with odds and ends of knowledge and pseudo-knowledge, of Biblical dogmas, traveler's tales, and mythical imaginings [Boo83, p.100, textbook:p.15].
Where are the flies? • Computers can be built in many different ways for many different applications. Finding a common criteria to compare architectures is VERY difficult. • Once built, computational performance (speed) varies greatly from applications to applications. It is equally difficult to have a common criteria to measure the “goodness” of a given architecture. • Finally, programming difficulties vary from architecture to architecture. Each “parallel programming” environment dictates a specific programming style that is typically more complex than the serial programming interface.
Most Recent Examples • Cilk (http://supertech.lcs.mit.edu/cilk/) MIT • Passages (http://www.cis.udel.edu/~hiper/hiperspace/projects/gary.htm) Udel • EARTH (http://www.capsl.udel.edu/CURRENTPROJ/EARTH/) Udel
First Battle: Fine v.s. Coarse Grain Parallelism • Fine grain. Pros: • Large degree of parallelism (do many things at one time) • Find grain Cons: • Large communication overhead • Difficult programming model • Less reliable
Coarse Grain Parallelism • Coarse Grain Pros: • Ease of programming • More reliable • Less communication overhead • Coarse Grain Cons: • Less degree of parallelism (do fewer things at the same time)
Question: How to determine the best degree of parallelism? • Timing Models • Timing Model is a method for calculating the performance of running programs on any hardware architecture. • Timing Model can also be used to calculate the scalability of the running programs and architecture. Because, scalability is hardware and application dependent.
Introduction to Timing Models • Time Complexity: T(n) = O(f(n)) ==> The time to run a program on n-sized input is above-bounded by f(n). It really means that it will take no more than f(n) steps to compute n inputs. • Timing Models: • Single Processor: Ts(n) = f(n)/W ==> The time to run a program is approximately equal to the estimated algorithmic steps/single processor power W (algorithmic steps per second).
Timing Model for Multiprocessors • T(n,p) = TCompute + TCommunication + TIO = f(n)/pW + g(n,p)/ + k(n,p)/B • T(n,p) = estimated running time for size n and p processors. • g(n,p) = estimated communication volume (bytes) • k(n,p) = estimated IO volume(bytes) • W= single processor power in algorithmic steps/second • = interconnection network speed in bytes/second • B = IO speed in bytes/second
How to obtain values for W, and B? • Each parameter represents a RANGE of values. • Each parameter can be calibrated using computational experiments. • Ts(n)=f(n)/W can be used to derive W=f(n)/Ts(n). • Set p=2, T(n,p) can be used to derive =g(n,p)/(Tp(n)-f(n)/pW) by removing the IO part (easily done). • Instrumenting the sequential source code can derive k(n,p) and B easily.
Practical Example • Matrix Multiplication: A x B ==> C. • Assumptions: • Each matrix is of n x n elements. • Each element is double precision (8 bytes). • Timing Model: (using O(n3) algorithm) • Tp(n)=n3/pW+g(n)/ +k(n)/B • Ignoring IO to simplify: Tp(n)= n3/pW+g(n)/ • Observation: If p=n2, the system could be VERY FAST since each dot-product is computed on an independent processor in parallel with others. Degree of parallelism = n2 (fine grain).
Quantitative Arguments for Coarse Grain Parallelism • What about g(n,p)? A dot-product requires one row of A and one column of B ==>minimal 2 x 8 x n = 16n bytes per processor or 16n3 bytes transmitted across the network. • Comparing g(n,p)/ with f(n)/W, f(n)/W should be smaller (faster) since W(in GHZ) is typically >> (in MBps). P1 P1 P1 P1 Interconnection Network P1 P1 P1 P1 Px
Example II: Massively Parallel Potentials • Fractal calculation involves solving massively many equations in complex plane in order to produce the color indices (number of iterations until diverging outside of a pre-defined box: (http://aleph0.clarku.edu/~djoyce/julia/explorer.html) to make a striking look image. • Ref: http://www.cis.temple.edu/~shi.
Conclusions • We need to calculate proper parallelism BEFORE implementing a software/hardware solution. • Hardware technologies are advancing rapidly, we need a generic architecture platform that will leverage hardware advances without sacrificing programmability.
A Few Finer Points • The ring must be slotted unidirectional. This allows multiple stations to transmit at the same time. • The ring must be redundant in order to prevent breakage by the loss of a single processor. • The result: