450 likes | 671 Vues
Apache Giraph on yarn. Chuan Lei and Mohammad Islam. Fast Scalable Graph Processing . What is Apache Giraph Why do I need it Giraph + MapReduce Giraph + Yarn. What is Apache Giraph.
E N D
Apache Giraphon yarn Chuan Lei and Mohammad Islam
Fast Scalable Graph Processing • What is Apache Giraph • Why do I need it • Giraph + MapReduce • Giraph + Yarn
What is Apache Giraph • Giraph is a framework for performing offline batch processing of semi-structured graph data on massive scale • Giraph is loosely based upon Google’s Pregel graph processing framework
What is Apache Giraph • Giraph performs iterative calculation on top of an existing Hadoop cluster
What is Apache Giraph • Giraph uses Apache Zookeeper to enforce atomic barrier waits and perform leader election Still Working…! Done! Done!
Why do I need it? • Giraph makes graph algorithms easy to reason about and implement by following the Bulk Synchronous Parallel (BSP) programming model • In BSP, all algorithms are implemented from the point of view of a single vertex in the input graph performing a single iteration of the computation
Why do I need it? • Giraph makes iterative data processing more practical for Hadoop users • Giraph can avoid costly disk and network operations that are mandatory in MR • No concept of message passing in MR
Why do I need it? • Each cycle of an iterative calculation on Hadoop means running a full MapReduce job
PageRank example • PageRank – measuring the relative importance of document within a set of documents • 1. All vertices start with same PageRank 1.0 1.0 1.0
PageRank example • 2. Each vertex distributes an equal portion of its pagerank to all neighbors 1.0 0.5 0.5 1.0
PageRank example • 3. Each vertex sums incoming values times a weight factor and adds in small adjustment: 1/(# vertices in graph) 1.5*1+1/3 1*1+1/3 0.5*1+1/3
PageRank example • 4. This value becomes the vertices’ PageRank for the next iteration 1.33 1.83 0.83
PageRank example • 5. Repeat until convergence: change in PR per iteration < epsilon)
PageRank on MapReduce • 1. Load complete input graph from disk as [K = vertex ID, V = out-edges and PR] Map Sort/Shuffle Reduce
PageRank on MapReduce • 2. Emit all input records (full graph state), emit [K = edgeTarget, V = share of PR] Map Sort/Shuffle Reduce
PageRank on MapReduce • 3. Sort and Shuffle this entire mess. Map Sort/Shuffle Reduce
PageRank on MapReduce • 4. Sum incoming PR shares for each vertex, update PR values in graph state records Map Sort/Shuffle Reduce
PageRank on MapReduce • 5. Emit full graph state to disk… Map Sort/Shuffle Reduce
PageRank on MapReduce • 6. … and START OVER! Map Sort/Shuffle Reduce
PageRank on MapReduce • Awkward to reason about • I/O bound despite simple core business logic
PageRank on Giraph • 1. Hadoop Mappers are “hijacked” to host Giraph master and worker tasks Map Sort/Shuffle Reduce
PageRank on Giraph • 2. Input graph is loaded once, maintaining code-data locality when possible Map Sort/Shuffle Reduce
PageRank on Giraph • 3. All iterations are performed on data in memory, optionally spilled to disk. Disk access is linear/scan-based Map Sort/Shuffle Reduce
PageRank on Giraph • 4. Output is written from the Mappers hosting the calculation, and the job run ends Map Sort/Shuffle Reduce
PageRank on Giraph • This is all well and good, but must we manipulate Hadoop this way? • Heap and other resources are set once, globally for all Mappers in the computation • No control of which cluster nodes host which tasks • No control over how Mappers are scheduled • Mapper and Reducer slots abstraction is meaningless for Giraph
Overview of Yarn • YARN (Yet Another Resource Negotiator) is Hadoop’s next-gen management platform • A general purpose framework that is not fixed to MapReduce paradigm • Offers fine-grained control over each task’s resource allocation
Giraph on Yarn • It’s a natural fit!
Giraph on Yarn • Client • Resource Manager • Application Master Node Manager Worker Worker Node Manager Resource Manager ZooKeeper Client App Mstr Worker Node Manager Master Worker
Giraph Architecture • Master / Workers • Zookeeper Worker Worker Master Worker Worker Worker Worker Worker Worker Worker
Metrics • Performance • Processing time • Scalability • Graph size (number of vertices and number of edges)
Optimization Factors • Memory Size • Number of Workers • Combiner • Out-of-Core • GCcontrol • Parallel GC • ConcurrentGC • Young Generation • Object Reuse JVM Giraph App
Experimental Settings • Cluster - 43 nodes ~ 800 GB memory • Hadoop-2.0.3-alpha (non-secure) • Giraph-1.0.0-release • Data - LinkedIn social network graph • Approx. 205 million vertices • Approx. 11billion edges • Application - PageRank algorithm
Baseline Result • 10 v.s 20 GB per worker • Max memory 800 GB • Processing time • 10 GB per worker – better performance • Scalability • 20 GB per worker –higher scalability 5 workers 15 workers 40 workers 25 workers 30 workers 40 workers 10 workers 800G 400G
Heap Dump w/o Concurrent GC • Iteration 3 • Iteration 27 GB GB GB GB • Big portion of unreachable objects are messages created at each superstep
Concurrent GC 20 GB per worker • Significantly improves the scalability by 3 folds • Suffered from performance degradation by 16%
Using Combiner 20 GB per worker • Scale up 2 times w/o any other optimizations • Speed up the performance by 50%
Memory Distribution • More workers achieve better performance • Larger memory size per worker provides higher scalability
Application – Object Reuse 20 GB per worker • Improves 5x scalability • Improves 4x performance • Require skills from application developers 650G 29 mins
Problems of Giraph on Yarn • Various knobs to tune to make Giraph applications work efficiently • Highly depend on skillful application developers • Performance penalties suffered from scaling up
Future Direction • C++ provides direct control over memory management • No need to rewrite the whole Giraph • Only master and worker in C++
Conclusion • Linkedin is the 1st player of Giraph on Yarn • Improvements and bug fixes • Provide patches in Apache Giraph • Make full LI graph run on 40-node cluster with 650GB memory • Evaluate various performance and scalability options