410 likes | 517 Vues
Implementing a program to print prime numbers from 1 to 1010 using a ten-processor multiprocessor system, aiming for a ten-fold speedup. The challenge lies in load balancing, thread workload distribution, and dynamic load balancing. Various procedures and counter implementations are explored, focusing on thread synchronization and mutual exclusion for optimal performance. The discussion delves into Amdahl's Law, seeking to achieve efficient parallelization for improved speedup. Additionally, hardware solutions for read-modify-write operations are highlighted, emphasizing the importance of concurrent execution and fine-grained parallelism for enhanced multicore scaling.
E N D
Concurrency idea Challenge Print primes from 1 to 1010 Given Ten-processor multiprocessor One thread per processor Goal Get ten-fold speedup (or close) 2
Load Balancing Split the work evenly Each thread tests range of 109 1 1010 109 2·109 … … P0 P1 P9 3
Procedure for Thread i void primePrint { int i = ThreadID.get(); // IDs in {0..9} for(j = i*109+1, j<(i+1)*109; j++) { if(isPrime(j)) print(j); } } 4
Issues Higher ranges have fewer primes Yet larger numbers harder to test Thread workloads Uneven Hard to predict 5
Issues Higher ranges have fewer primes Yet larger numbers harder to test Thread workloads Uneven Hard to predict Need dynamic load balancing rejected 6
Shared Counter 19 each thread takes a number 18 17 7
Procedure for Thread i int counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } } 8
Procedure for Thread i Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } } Shared counter object 9
Where Things Reside cache cache cache Bus Bus void primePrint { int i = ThreadID.get(); // IDs in {0..9} for(j = i*109+1, j<(i+1)*109; j++) { if(isPrime(j)) print(j); } } Local variables code shared memory 1 shared counter 10
Procedure for Thread i Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); } } Stop when every value taken 11
Procedure for Thread i Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j =counter.getAndIncrement(); if (isPrime(j)) print(j); } } Increment & return each new value 12
Counter Implementation public class Counter{ private long value; public long getAndIncrement() { return value++; } } 13
Counter Implementation public class Counter { private long value; public long getAndIncrement() { return value++; } } OK for single thread, not for concurrent threads 14
What It Means public class Counter { private long value; public long getAndIncrement() { return value++; } } 15
What It Means public class Counter { private long value; public long getAndIncrement() { return value++; } } temp = value; value = temp + 1; return temp; 16
Not so good… time Value… 1 2 3 2 read 1 write 2 read 2 write 3 read 1 write 2 17
Is this problem inherent? !! !! write read read write If we could only glue reads and writes together… 18
Challenge public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } } 19
Challenge public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } } Make these steps atomic (indivisible) 20
Hardware Solution public class Counter { private long value; public long getAndIncrement() { temp = value; value = temp + 1; return temp; } } ReadModifyWrite() instruction 21
An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized{ temp = value; value = temp + 1; } return temp; } } 22
An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized{ temp = value; value = temp + 1; } return temp; } } Synchronized block 23
An Aside: Java™ public class Counter { private long value; public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; } } Mutual Exclusion 24
Why do we care? We want as much of the code as possible to execute concurrently (in parallel) A larger sequential part implies reduced performance Amdahl’s law: this relation is not linear… 25
Amdahl’s Law Speedup= …of computation given nCPUs instead of 1 26
Amdahl’s Law Speedup= 27
Amdahl’s Law Parallel fraction Speedup= 28
Amdahl’s Law Sequential fraction Parallel fraction Speedup= 29
Amdahl’s Law Sequential fraction Parallel fraction Speedup= Number of processors 30
Example Ten processors 60% concurrent, 40% sequential How close to 10-fold speedup? 31
Example Ten processors 60% concurrent, 40% sequential How close to 10-fold speedup? Speedup = 2.17= 32
Example Ten processors 80% concurrent, 20% sequential How close to 10-fold speedup? 33
Example Ten processors 80% concurrent, 20% sequential How close to 10-fold speedup? Speedup = 3.57= 34
Example Ten processors 90% concurrent, 10% sequential How close to 10-fold speedup? 35
Example Ten processors 90% concurrent, 10% sequential How close to 10-fold speedup? Speedup = 5.26= 36
Example Ten processors 99% concurrent, 01% sequential How close to 10-fold speedup? 37
Example Ten processors 99% concurrent, 01% sequential How close to 10-fold speedup? Speedup = 9.17= 38
Back to Real-World Multicore Scaling Speedup 2.9x 2x 1.8x User code Multicore Not reducing sequential % of code 40
Fine grained parallelism has huge performance benefit The reason we get only 2.9 speedup c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Shared Data Structures Fine Grained Coarse Grained 25% Shared 25% Shared 75% Unshared 75% Unshared
Multiprocessor Programming This is what this course is about… The % that is not easy to make concurrent yet may have a large impact on overall speedup 43