Big Data Technologies Lecture 3: Algorithm Parallelization

# Big Data Technologies Lecture 3: Algorithm Parallelization

Télécharger la présentation

## Big Data Technologies Lecture 3: Algorithm Parallelization

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Big Data TechnologiesLecture 3:Algorithm Parallelization Assoc. Prof. Marc FRÎNCU, PhD. Habil. marc.frincu@e-uvt.ro

2. Conceptually • Master-slavemodel • Master process • Starts a number ofclient processes (slave) on other cores/CPUs/machines • Communicates with them • Sends data for processing and receives answers • Ensures all data are received and continues execution

3. Can we parallelize the algorithm? • Code • Understanding of sequential code (if exists) • Identifying criticalpoints • Where are the computational heavy code lines? • Profiling • Parallelizecode only where intense computation is found (legea lui Amdahl) • Where are the bottlenecks? • Are there areas of slow code? • I/O • Use parallel optimized code • IBM ESSL, Intel MKL, AMD AMCL, etc. • If many parallel versions of the same algorithm exist ALL must be studied! • Data • Any data dependencies? • Can they be removed? • Can data be partitioned?

4. exemples • Potential energy of each molecule. Find the minimum energy configuration • Each computation can be done in parallel • Search for minimum energy can also be parallelized • Parallel search • Fibonaci • F(n) depends on F(n-1) and F(n-2) • Cannot be parallelized

5. Automatic vs. manual parallelization • The process of parallelizing code is complex, iterative and error prone, requiring time until an efficient solution is found • Some compilerscan parallelize code (pre-proccessing) • Automated • Compileridentifiesparallel code sections as well as bottleneck areas • Offers an analysis of the benefits of parallelization • The target are the iterative statements (do, for) • Programmer oriented • Use compilation directives and execution flags • The programmer is in charge of finding the parallel sections • They are mostly used on shared memory devices • OpenMP • BUT • Performance can degrade • Less flexible than manual parallelization • Limited to certain code sections (loops) • Not all sequential code is parallelizable as is

6. Designing parallel algorithms • Partitioning • Communication • Synchronization • Data dependencies • Load balancing • Granularity • I/O • Debugging • Performance analysis and optimization

7. Problem and data partitioning • The first step in any parallel problem is to partitionit in order to be handled by multiple parallel processes • Data partitioning (domain) • Algorithm partitioning

8. Domain partitioning

9. Functional partitioning

10. Interprocess communication • Do we need communication? • Embarrassingly parallelproblems • Exemple: image negative • Problems where we need information from neighbors • Exemple: 2D heat dissipation

11. Interprocess communication • Communication overhead • Resources are used to store and send data • Communication requires synchronization • A process will wait for the other to send data • Network limitation • Latency vs. bandwidth • Latency: time needed to send information from point A to point B • Microseconds • Bandwidth: quantity of information sent per time unit • MB/s, GB/s • Many small messages make latency dominant  pack all together as a single message • Communication visibility • In MPI messages are visible and under the programmer’s control • The Data Parallel model hides communication details (MapReduce)

12. Interprocess communication • Synchronous vs. asynchronous • Synchronization blocksthe codeas one task must wait for the other to end and send its data • Asinchronousexecution assumes that a task executes independently from others (non-blocking communication)

13. Interprocess communication • Communication objective • Knowing how communication takes place is vital in parallel algorithm design • Punctual:task to task • Collective: task to many tasks

14. Interprocess communication • Efficiency of communication • An MPI implemention may give different results based on the used hardware architecture • Asynchronous communication may improve the execution of the parallel algorithm • Type and property of the network • Overhead and complexity

15. Sincronization • Barrier • Each tasks executes until it reaches a barrier • Synchronization takes place when all tasks reach the barrier • The Bulk Synchronous Parallel model • Semaphore • Used to serialize access to data • Only one task access the data at one time • When multiple tasks request access simultaenously then access is granted randomly or based on priorities • Java threads, MPI • Synchronous operations • Only the tasks involved in active communication • Before sending data a task must receive the OK from the other • MPI

16. Data dependencies There is a dependency between two tasks if the order of their execution affects the outcome of the program. • Dependencies are the main inhibitors of parallelism • Occur when a certain data address is accessed by multiple tasks • How do we solve them? • Sincronization • Exclusive access to the shared memory Value of Y depends on X (which X?) Execution of A(j) depends of A(j-1)

17. Load balancing • Uniform distribution of workfload per task • All tasks perform some work, none is idle • Optimize usage as well as number of tasks • How do we achieve it? • Partitioningthe workload (date) • Matrix operations • Distribuția datelor uniform pe mașini • Loops • Distribute cycles uniformly on machines • For heterogeneous machines • Profile code to determine unbalanced code • Dynamic allocation of the workload • When a task finishes one task it receives the next one

18. Granularity • Computation/communication ratio • Periods of computation separated by periods of communication • Fine grained parallelism • Less computation between communication • Low computation/communication ratio • For load balancing • High overhead with few opportunities for optimization • Communication/synchronization can take longer than computation • Coarse grained parallelism • Large computation/communication ratio • Opportunity to optimize • Hard to load balance • How do we chose? • Depends on algorithm and hardware • Better to have coarse grained parallelism • Communication/synchronization overhead is usually to high compared to computation cost • Fine grained parallelism can help in load balancing

19. I/O • Parallelism inhibitor • Requires lots of time • Parallel I/O systems are not widely available • HDFS (Hadoop Distributed File System) • Lustre (servere Linux) • IBM Spectrum Scale • ... • In a shared environment I/O can lead to file overwritings • Read operations can be affected by the capability of the server to handle multiple requests • I/O over the network (NSF) can lead to congestion and file server blackouts

20. Debugging • Can be costly if code complexity is high • Many applications • TotalView • DDT • Inspector (Intel)

21. Performance analysis and optimization • Much more complex than for the sequential code • Valgrind (http://valgrind.org/) • Vampir (http://vampir.eu/) • Mpitrace (https://computing.llnl.gov/tutorials/bgq/index.html#mpitrace)

22. Practical example • Compute PI using Monte Carlo • Idea • Use random numbers to cover the surface • (x,y)  random between -1 and 1 • Circle radius = 1 • No. of total points must be large enought

23. Practical example • Problem analysis • Any data dependencies? • Can we generate points in parallel? • Communication? • Do we need communication to generate random numbers? • Can we compute the percentage of points generated inside the circle on each processor without communication? • Load balancing? • Can we generate the same amount of points on each processor? • Strategy • Divideet impera • G points per processor (N – total no. of points, P – no. of processors) • On each processor, check if generated point is inside the circle • SMDP model • Master-slave • A parent process (master) will gather results from child processes (slaves)

24. Pseudocode npoints = 10000 circle_count= 0 p = number of tasks num= npoints/p // find out if I am MASTER or WORKER doj = 1,num generate 2 random numbers between 0 and 1 xcoordinate= random1 ycoordinate= random2 if(xcoordinate, ycoordinate) inside circle then circle_count= circle_count + 1 end do ifI am MASTERthen receive from WORKERS their circle_counts compute PI (use MASTER and WORKER calculations) else if I am WORKER then send to MASTER circle_count endif

25. APIs for parallel shared and distributed memory algorithms • Shared memory • OpenMP • Distributed memory • Unified Parallel C • MPI • GPUs • CUDA • Distributed computing • MapReduce • Data flows: Storm, Spark • Graphs: Giraph, GraphX

26. Openmp • Shared memory model • Requires minimal modifications of the sequential code • Specification is implemented in the compiler • g++ program.cpp -fopenmp -o program • fork-join model • When the compiler reaches a parallel construct in OpenMP, it creates a series of threads which execute in parallel (fork) • The threads merge at the end of their execution (join)

28. PI in Openmp Sums up the values in count Sets variables private to the thread Sets the number of threads #include <omp.h> ... #pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads) { // give random() a seed value srand48((int)time(NULL) ^ omp_get_thread_num()); for(i=0; i<niter; ++i) //main loop { x = (double)drand48();// gets a random x coordinate y = (double)drand48();// gets a random y coordinate z = ((x*x)+(y*y)); // checksto see if number is inside unit circle if(z<=1) { ++count;// if it is, consider it a valid random point } } } pi = ((double)count/(double)(niter*numthreads))*4.0; printf("Pi: %f\n", pi); Seed initialization Nu putemfolosifuncțiile de bibliotecă! NU SUNT THREAD SAFE! x = custom random fct y = custom random fct See https://www.bnl.gov/bnlhpc2013/files/pdf/OpenMPTutorial.pdf For a correct implementation!

29. MPI • Message Passing Interface • API for running message based parallel application • Distributed memory • Many implementations • MPICH2 • Function names start withMPI_ • MPI_Init(); • MPI_Finalize(); • MPI_Comm_size(MPI_COMM_WORLD, &world_size); • MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); • MPI_Get_processor_name(processor_name, &name_len); • MPI_Send(&offset, 1, MPI_INT, dest, BEGIN, MPI_COMM_WORLD); • MPI_Recv(&offset, 1, MPI_INT, source, msgtype, MPI_COMM_WORLD, &status);

30. PI in mpi MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&taskid); printf ("MPI task %d has started...\n", taskid); // set seed for random number generator equal to task ID srandom (taskid); avepi = 0; for (i = 0; i < ROUNDS; i++) { // call the function which generates random numbers and counts how many are in the circle // see the OpenMP code for details // this method is called in MASTER and all SLAVES count= dboard(DARTS); rc= MPI_Reduce(&count, &pisum, 1, MPI_DOUBLE, MPI_SUM, MASTER, MPI_COMM_WORLD); } if (taskid == MASTER) pi = (pisum/(count * numtasks)) * 4.0; printf("\nReal value of PI: 3.1415926535897 \n"); MPI_Finalize();

31. Uniform parallel C (UPC) • Extension of C for parallel computing • Distributed memory • Shared memory • SPMD model • Own compiler • upcc-o program program.upc • upcrun -n 4 program • Variables • shared int x • Own constructions • upc_forall • Constants • THREADS • MYTHREAD • Function names start withupc_ • Custom versions of existing standard C functions • upc_memcpy, ...

32. PI in upc shared intcount [THREADS]; upc_forall (j=0; j<THREADS; ++j;j)// main loop { for (i=0; i<niter; i++) { x = (double)drand48();// gets a random x coordinate y = (double)drand48();// gets a random y coordinate z = ((x*x)+(y*y)); // checks to see if number is inside unit circle if (z<=1) { ++count[MYTHREAD];// if it is, consider it a valid random point } } } upc_barrier();// ensure all is done if (MYTHREAD == 0) { for (j=0; j<THREADS; ++j;j) countHit += count[j]; pi = ((double)countHit/(double)(niter*THREADS))*4.0; printf("Pi: %f\n", pi); }

33. CUDA • NVIDIA • API and platform for GPUs • Direct access to accelerator instructions • Each core runs a kernel function(thread) • Threadsare grouped in blocks • Can communicate through shared memory, synchronization primitives, and barriers • Parallel or sequential execution • Run on the same thread • ThreadID: • Is computed based on data • Blocks form a grid • Communication between blocks is not possible

34. CUDA architecture

35. PI in cuda

36. MapReduce • Terms barrowed from functional programming (eg.,Lisp) • Integrated with Hadoop (map square ‘(1 2 3 4)) • Output: (1 4 9 16) [process each element independently] (reduce + ‘(1 4 9 16)) • (+ 16 (+ 9 (+ 4 1) ) ) • Output: 30 [processes all dataset together] Divide et impera • Divides data in chunks for parallel execution • 64 MB per block • Aggregates the result • Programmer only implements the map and reduce functions • Platform takes care of communication

37. MapReduce model Reduce(k’, v’[]) -> (k’’,v’’) Map(k, v) ->(k’,v’) Input Map Reduce Output Shuffle – MR System Funcții definite utilizator void reduce(String key, Iterator values) { //for each key, iterate through all values //aggregate results //emit final result } void map(String key, String value) { //do work //emit(key, value) pairs to reducers }

38. PI in MapReduce Point inside circle • void map (LongWritable size, Context context) • { • int count = 0;- • for(long i = 0; i < size.get(); i++) • { • //generate random points in unit square • final double x = …; • final double y = …; • if (x*x + y*y <= 1) • { • ++count; • } • } • //output map results • context.write(newBooleanWritable(true), newLongWritable(count)); • context.write(newBooleanWritable(false), newLongWritable(size.get()-count)); • } • void reduce (BooleanWritableisInside, Iterable<LongWritable> values, Context context) • { • if(isInside.get()) • { • for(LongWritable val : values) • { • numInside +=val.get(); • } • } • else • { • for(LongWritable val : values) • { • numOutside +=val.get(); • } • } • } • // reduce done. Store results in HDFS • void cleanup(Context context) throws IOException • { • … • writer.append(new LongWritable(numInside), new LongWritable(numOutside)); • writer.close(); • }

39. PI in mapreduce public static BigDecimalestimatePi(intnumMaps, longnumPoints, Path tmpDir, Configuration conf) throws IOException, ClassNotFoundException, InterruptedException{ … Job job = new Job(conf); job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputKeyClass(BooleanWritable.class); job.setOutputValueClass(LongWritable.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setMapperClass(QmcMapper.class); job.setReducerClass(QmcReducer.class); job.setNumReduceTasks(1); // generate an input file for each map task … // start MapReduce job job.waitForCompletion(true); // read outputs … reader.next(numInside, numOutside); // evaluate Pi pi = ((double)numInside/(double)(numMaps* numPoints))*4.0; }