60 likes | 214 Vues
The Harp library introduces advanced data and communication abstractions to improve big data processing expressiveness and performance. Unlike traditional MapReduce, Harp provides hierarchical data abstractions for arrays, key-values, and graphs, along with a collective communication model. It supports various operations, memory allocation through caching, and fault tolerance via checkpointing. Compatible with Hadoop's ecosystem, Harp optimizes parallelism and resource management, enabling efficient computation for diverse application needs. Discover how Harp transforms big data insights through its innovative programming model.
E N D
Motivation • Contemporary big data tools such as MapReduce and graph processing tools have fixed data abstraction and support a limited set of communication operations • MPI contains abundant and highly-optimized collective communication operations but is limited on data abstractions • To improve the expressiveness and performance in big data processing… • We introduce Harp library, which provides data abstractions and related communication abstractions and transform map-reduce programming model to map-collecitve model.
Features • Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) • Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. • Collective communication model to support various communication operations on the data abstractions. • Caching with buffer management for memory allocation required from computation and communication • BSP style parallelism • Fault tolerance with check-pointing
Architecture MapReduce Applications Map-Collective Applications Application MapReduce V2 Harp Framework YARN Resource Manager
Parallelism Model MapReduce Model Map-Collective Model M M M M M M M M Shuffle Collective Communication R R
Broadcast, Allgather, Allreduce, Regroup-(combine/reduce), Message-to-Vertex, Edge-to-Vertex Hierarchical Data Abstraction and Collective Communication Vertex Table Key-Value Table Message Table Edge Table Array Table <Array Type> Table Message Partition Vertex Partition Key-Value Partition Array Partition< Array Type > Edge Partition Partition Broadcast, Send Long Array Double Array Int Array Vertices, Edges, Messages Byte Array Key-Values Array Struct Object Basic Types Broadcast, Send, Gather Commutable