Concurrency Idea

Concurrency Idea

Concurrency idea Challenge Print primes from 1 to 1010 Given Ten-processor multiprocessor One thread per processor Goal Get ten-fold speedup (or close) 2

Load Balancing Split the work evenly Each thread tests range of 109 1 1010 109 2·109 … … P0 P1 P9 3

Procedure for Thread i void primePrint { int i = ThreadID.get(); // IDs in {0..9} for(j = i*109+1, j<(i+1)*109; j++) { if(isPrime(j)) print(j); } } 4

Issues Higher ranges have fewer primes Yet larger numbers harder to test Thread workloads Uneven Hard to predict 5

Issues Higher ranges have fewer primes Yet larger numbers harder to test Thread workloads Uneven Hard to predict Need dynamic load balancing rejected 6

Shared Counter 19 each thread takes a number 18 17 7

Procedure for Thread i int counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } } 8

Procedure for Thread i Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } } Shared counter object 9

Where Things Reside cache cache cache Bus Bus void primePrint { int i = ThreadID.get(); // IDs in {0..9} for(j = i*109+1, j<(i+1)*109; j++) { if(isPrime(j)) print(j); } } Local variables code shared memory 1 shared counter 10

Procedure for Thread i Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } } Stop when every value taken 11

Procedure for Thread i Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j =counter.getAndIncrement(); if (isPrime(j)) print(j); } } Increment & return each new value 12

Counter Implementation public class Counter{ private long value; public long getAndIncrement() { return value++; } } 13

Counter Implementation public class Counter { private long value; public long getAndIncrement() { return value++; } } OK for single thread, not for concurrent threads 14

What It Means public class Counter { private long value; public long getAndIncrement() { return value++; } } 15

What It Means public class Counter { private long value; public long getAndIncrement() { return value++; } } temp = value; value = temp + 1; return temp; 16

Not so good… time Value… 1 2 3 2 read 1 write 2 read 2 write 3 read 1 write 2 17

Is this problem inherent? !! !! write read read write If we could only glue reads and writes together… 18

Challenge public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } } 19

Challenge public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } } Make these steps atomic (indivisible) 20

Hardware Solution public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } } ReadModifyWrite() instruction 21

An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized{ temp = value; value = temp + 1; } return temp; } } 22

An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized{ temp = value; value = temp + 1; } return temp; } } Synchronized block 23

An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } } Mutual Exclusion 24

Why do we care? We want as much of the code as possible to execute concurrently (in parallel) A larger sequential part implies reduced performance Amdahl’s law: this relation is not linear… 25

Amdahl’s Law Speedup= …of computation given nCPUs instead of 1 26

Amdahl’s Law Speedup= 27

Amdahl’s Law Parallel fraction Speedup= 28

Amdahl’s Law Sequential fraction Parallel fraction Speedup= 29

Amdahl’s Law Sequential fraction Parallel fraction Speedup= Number of processors 30

Example Ten processors 60% concurrent, 40% sequential How close to 10-fold speedup? 31

Example Ten processors 60% concurrent, 40% sequential How close to 10-fold speedup? Speedup = 2.17= 32

Back to Real-World Multicore Scaling Speedup 2.9x 2x 1.8x User code Multicore Not reducing sequential % of code 40

Fine grained parallelism has huge performance benefit The reason we get only 2.9 speedup c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Shared Data Structures Fine Grained Coarse Grained 25% Shared 25% Shared 75% Unshared 75% Unshared

Multiprocessor Programming This is what this course is about… The % that is not easy to make concurrent yet may have a large impact on overall speedup 43

Concurrency Idea

Concurrency Idea

Presentation Transcript

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency

Concurrency