140 likes | 268 Vues
This document explores MapReduce as a prominent language for parallel computing, addressing its role in tackling challenges posed by multi-core CPUs and GPUs. It highlights the benefits, such as simplifying programming through sequential code and effective problem-solving in areas like indexing. The paper discusses the limitations of MapReduce, including variations in architecture, synchronization difficulties, and memory allocation challenges, while suggesting potential advancements for portability and performance enhancement. The potential impact of MapReduce in Internet companies and future computing paradigms is also emphasized.
E N D
MapReduce As A Language for Parallel Computing Wenguang CHEN, Dehao CHEN Tsinghua University
Future Architecture • Many alternatives • A few powerful cores( Intel/AMD, 2,3,4,6 …) • Many simple cores( nVidia, ATI, Larrabe, 32, 128, 196, 256 … ) • Heterogenous( CELL, 1/8; FPGA speedup … ) • But programming them is not easy • All use different programming model, some are (relatively) easy, some are extremely difficult • OpenMP, MPI, MapReduce • CUDA, Brooks • Verilog, System C
What makes parallel computing so difficult • Parallelism identification and expression • Autoparallelizing has been failed so far • Complex synchronization may be required • Data races and deadlocks which are difficult to debug • Load balance…
Map-Reduce is promising • Can only solve a subset of problems • But an important and fast growing subset, such as indexing • Easy to use • Programmers only need to write sequential code • The simplest practical parallel programming paradigm? • Dominated programming paradigm in Internet companies • Originally support distributed systems, now ported to GPU, CELL, multicore • But many dialects, which hurt the portability
Limitations on GPUs • Rely on the CPU to allocate memory • How to support variant length data? • Combine size and offset information with the key/val pair • How to allocate output buffer on GPUs? • Two-pass scan—Get the count first, and then do real execution • Lack of lock support • How to synchronize to avoid write conflict? • Memory is pre-allocated, so that every thread knows where it should write to
MapReduce on Multi-core CPU (Phoenix [HPCA'07]) Input Split Map Partition Reduce Merge Output
MapReduce on Multi-core CPU (Mars[PACT‘08]) Input MapCount Prefixsum Allocate intermediate buffer on GPU Map Sort and Group ReduceCount Prefixsum Allocate output buffer on GPU Reduce Output
Program Example • Word Count (Phoenix Implementation) … for (i = 0; i < args->length; i++) { curr_ltr = toupper(data[i]); switch (state) { case IN_WORD: data[i] = curr_ltr; if ((curr_ltr < 'A' || curr_ltr > 'Z') && curr_ltr != '\'‘) { data[i] = 0; emit_intermediate(curr_start, (void *)1, &data[i] - curr_start + 1); state = NOT_IN_WORD; } break; …
Program Example • Word Count (Mars Implementation) __device__ void GPU_MAP_FUNC//(void *key, void val, int keySize, int valSize){…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_FUNC(word, &wordSize, wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…} __device__ void GPU_MAP_COUNT_FUNC //(void *key, void *val, int keySize, int valSize) {…. do {…. if (*line != ' ‘) line++; else { line++; GPU_EMIT_INTER_COUNT_FUNC( wordSize-1, sizeof(int)); while (*line == ' ‘) { line++; } wordSize = 0; } } while (*line != '\n');…}
Pros and Cons • Load Balance • Phoenix: Static + Dynamic • Mars: Static, attribute same amount of map/reduce workload to each thread • Pre-allocation • Lock free • requires two-phase scan, which is not an efficient solution • Sorting----Bottleneck of Mars • Phoenix use insertion sorts dynamically during emitting • Mars use bitonic sort -- O(n*logn*logn)
Map-Reduce as a Language, not a library • Can we have a portable Map-Reduce that could run across different architectures efficiently? • Promising • Map-Reduce already specify the parallelism well • No complex synchronizations in users code • But still difficult • Different architecture provides different features • Either portability and performance issues • Use compiler and runtime to cover the architecture differences, as what we have done in supporting high-level languages such as C
Compiler, library &Runtime C X86 Power Sparc … Map-Reduce Cluster library &Runtime Cluster library &Runtime Cluster library &Runtime Map-Reduce Multicore Multicore library &Runtime Map-Reduce General Multicore library &Runtime Map-Reduce GPU GPU library &Runtime GPU
Case study on nVidia GPU • Portability • Host function support • Annotating libc and inline • Dynamic memory allocation • Big problem, not support that in user code? • Performance • Memory Hierarchy Optimization( global, shared, readonly memory identification ) • Typed Language is preferrable( int4 type acceleration…) • Dynamic memory allocation(again!)
More to explore • …