Tuesday, September 19, 2006

Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around. - Numerical Recipes, C Edition

Reference Material • Lectures 1 & 2 • “Parallel Computer Architecture” by David Culler et. al., Chapter 1. • “Sourcebook of Parallel Computing” by Jack Dongarra et. al., Chapters 1 and 2. • Introduction to Parallel Computing by Grama et. al., Chapter 1 and Chapter 2 §2.4. • www.top500.org • Lecture 3 • Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3 • Introduction to Parallel Computing, Lawrence Livermore National Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ • Lecture 4 & 5 • “Techniques for Optimizing Applications” by Garg et. al., Chapter 9 • “Software Optimizations for High Performance Computing” by Wadleigh et. al., Chapter 5 • Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1-2.2

Software Optimizations • Optimize serial code before parallelizing it.

do i=1,n A(i)=B(i) enddo do i=1,n,4 A(i)=B(i) A(i+1)=B(i+1) A(i+2)=B(i+2) A(i+3)=B(i+3) enddo Assumption n is divisible by 4 Loop Unrolling • Unrolled by 4. • Some compilers allow users to specify unrolling depth. • Avoid excessive unrolling: Register pressure / spills can hurt performance • Pipelining to hide instruction latencies • Reduces overhead of index increment and conditional check

do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo Loop Unrolling Unroll outer loop by 2

do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo Loop Unrolling

do j=1 to N do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] enddo enddo do j=1 to N step 2 do i = 1 to N Z[i,j]=Z[i,j]+X[i]*Y[j] Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1] enddo enddo Loop Unrolling Number of load operations can be reduced e.g. Half as many loads of X

Loop Fusion • Beneficial in loop-intensive programs. • Decreases index calculation overhead. • Can also help in instruction level parallelism. • Beneficial if same data structures are used in different loops.

for (i=0; i<n; i++) temp[i] =x[i]*y[i]; for (i=0; i<n; i++) z[i] =w[i]+temp[i]; Loop Fusion

for (i=0; i<n; i++) temp[i] =x[i]*y[i]; for (i=0; i<n; i++) z[i] =w[i]+temp[i]; for (i=0; i<n; i++) z[i] =x[i]*y[i]+w[i]; Loop Fusion Check for register pressure before fusing

Loop Fission • Condition statements can hurt pipelining • Split into two, one with condition statements and the other without. • Compiler can do optimizations in condition-free loop like unrolling. • Beneficial for fat loops that may lead to register spills

Loop Fission for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } }

for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } } for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime = fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; } for (i=0;i<nodes;i++) { if(temp1[i] > hgreat) { temp1[i]=1; } } Loop Fission

Reductions for (i=0; i<n; i++) { sum +=x[i]; } Normally a single register would be used for reduction variable. Hide floating point instruction latency?

for (i=0; i<n; i++) { sum +=x[i]; } sum1=sum2=sum3=sum4=0.0 nend = (n>>2)<<2; for (i=0; i<nend; i+=4){ sum1 +=x[i]; sum2 +=x[i+1]; sum3 +=x[i+2]; sum4 +=x[i+3]; } sumx = sum1 + sum2+ sum3 + sum4; for (i=nend; i<n; i++) sumx += x[i] Reductions

a**0.5 vs sqrt(a)

a**0.5 vs sqrt(a) • Appropriate include files can help in generating faster code. e.g. math.h

The time to access memory has not kept pace with CPU clock speeds. • Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them. • Wastage of CPU cycles: CPU starvation

Ability of memory system to feed data to the processor • Memory latency • Memory Bandwidth

Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Cache block size : 1 word • Peak processor rating?

Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block 1 word • Peak processor rating4 GFlops

Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block: 1 word • Peak processor rating 4 GFlops • Dot product of two vectors • Peak speed of computation?

Effect of Memory Latency • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns (no caches) • Memory block 1 word • Peak processor rating 4 GFlops • Dot product of two vectors • Peak speed of computation? one floating point operation every 100ns i.e. speed of 10 MFLOPS

Effect of Memory Latency: Introduce Cache • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Memory block 1 word • Cache 32KB with 1ns latency • Multiply two matrices A and B of 32x32 words with result in C. (Note: Previous example had no data reuse). • Assume ideal cache placement and enough capacity to hold A,B and C

Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K words • Total operations and total time taken?

Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K words • Total operations and total time taken? • Two matrices = 2K require words • Multiplying two matrices requires 2n3 operations

Effect of Memory Latency: Introduce Cache • Multiply two matrices A and B of 32x32 words with result in C • 32x32 = 1K • Two matrices = 2K require 2K *100ns = 200µs. • Multiplying two matrices requires 2n3 operations = 2*323 = 64K operations • 4 operations per cycle we need 64K/4 cycles = 16µs • Total time = 200+16µs • Computation rate 64K operations/(200+16µs) = 303 MFLOPS

Effect of Memory Bandwidth • 1 GHz processor (1ns clock) • Capable of executing 4 instructions in each cycle of 1ns • DRAM with latency 100ns • Memory block 4 words • Cache 32KB with 1ns latency • Dot product example again • Bandwidth increased 4 fold

Reduce cache misses. • Spatial locality • Temporal locality

Impact of strided access for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) column_sum[i]+= b[j][i];

Eliminating strided access for (i=0; i<1000; i++) column_sum[i] = 0.0; for(j=0; j<1000; j++) for (i=0; i<1000; i++) column_sum[i]+= b[j][i]; Assumption: Vector column_sum is retained in the cache

do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo enddo N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop. Little reuse between touches How many cache misses for A and B?

Tuesday, September 19, 2006

Tuesday, September 19, 2006

Presentation Transcript

Tuesday 15th September

Tuesday, September 5

Tuesday, September 17th

Tuesday, March 19

Tuesday, September 4

Tuesday , September 17

Tuesday, November 19

Tuesday, November 19

Tuesday 19/4

MICROECONOMIC ANALYSIS OF LAW September 19, 2006

Tuesday September 30

Tuesday, September 21

Tuesday, September 04, 2006

PUBLIC HEARING ON CONDITIONAL GRANTS AND CAPITAL EXPENDITURE ON TUESDAY 19 SEPTEMBER 2006

19 SEPTEMBER 2006

REES 395, Tuesday, 12 September 2006

Tuesday, September 21

Tuesday, September 21, 2006 5 p.m. to 6 p.m.

Tuesday, September 12, 2006

Tuesday, September 14

Tuesday September 25th