Apache Giraph on yarn

Apache Giraphon yarn Chuan Lei and Mohammad Islam

Fast Scalable Graph Processing • What is Apache Giraph • Why do I need it • Giraph + MapReduce • Giraph + Yarn

What is Apache Giraph • Giraph is a framework for performing offline batch processing of semi-structured graph data on massive scale • Giraph is loosely based upon Google’s Pregel graph processing framework

What is Apache Giraph • Giraph performs iterative calculation on top of an existing Hadoop cluster

What is Apache Giraph • Giraph uses Apache Zookeeper to enforce atomic barrier waits and perform leader election Still Working…! Done! Done!

What is Apache Giraph

Why do I need it? • Giraph makes graph algorithms easy to reason about and implement by following the Bulk Synchronous Parallel (BSP) programming model • In BSP, all algorithms are implemented from the point of view of a single vertex in the input graph performing a single iteration of the computation

Why do I need it? • Giraph makes iterative data processing more practical for Hadoop users • Giraph can avoid costly disk and network operations that are mandatory in MR • No concept of message passing in MR

Why do I need it? • Each cycle of an iterative calculation on Hadoop means running a full MapReduce job

PageRank example • PageRank – measuring the relative importance of document within a set of documents • 1. All vertices start with same PageRank 1.0 1.0 1.0

PageRank example • 2. Each vertex distributes an equal portion of its pagerank to all neighbors 1.0 0.5 0.5 1.0

PageRank example • 3. Each vertex sums incoming values times a weight factor and adds in small adjustment: 1/(# vertices in graph) 1.5*1+1/3 1*1+1/3 0.5*1+1/3

PageRank example • 4. This value becomes the vertices’ PageRank for the next iteration 1.33 1.83 0.83

PageRank example • 5. Repeat until convergence: change in PR per iteration < epsilon)

PageRank on MapReduce • 1. Load complete input graph from disk as [K = vertex ID, V = out-edges and PR] Map Sort/Shuffle Reduce

PageRank on MapReduce • 2. Emit all input records (full graph state), emit [K = edgeTarget, V = share of PR] Map Sort/Shuffle Reduce

PageRank on MapReduce • 3. Sort and Shuffle this entire mess. Map Sort/Shuffle Reduce

PageRank on MapReduce • 4. Sum incoming PR shares for each vertex, update PR values in graph state records Map Sort/Shuffle Reduce

PageRank on MapReduce • 5. Emit full graph state to disk… Map Sort/Shuffle Reduce

PageRank on MapReduce • 6. … and START OVER! Map Sort/Shuffle Reduce

PageRank on MapReduce • Awkward to reason about • I/O bound despite simple core business logic

PageRank on Giraph • 1. Hadoop Mappers are “hijacked” to host Giraph master and worker tasks Map Sort/Shuffle Reduce

PageRank on Giraph • 2. Input graph is loaded once, maintaining code-data locality when possible Map Sort/Shuffle Reduce

PageRank on Giraph • 3. All iterations are performed on data in memory, optionally spilled to disk. Disk access is linear/scan-based Map Sort/Shuffle Reduce

PageRank on Giraph • 4. Output is written from the Mappers hosting the calculation, and the job run ends Map Sort/Shuffle Reduce

PageRank on Giraph • This is all well and good, but must we manipulate Hadoop this way? • Heap and other resources are set once, globally for all Mappers in the computation • No control of which cluster nodes host which tasks • No control over how Mappers are scheduled • Mapper and Reducer slots abstraction is meaningless for Giraph

Overview of Yarn • YARN (Yet Another Resource Negotiator) is Hadoop’s next-gen management platform • A general purpose framework that is not fixed to MapReduce paradigm • Offers fine-grained control over each task’s resource allocation

Giraph on Yarn • It’s a natural fit!

Giraph on Yarn • Client • Resource Manager • Application Master Node Manager Worker Worker Node Manager Resource Manager ZooKeeper Client App Mstr Worker Node Manager Master Worker

Giraph Architecture • Master / Workers • Zookeeper Worker Worker Master Worker Worker Worker Worker Worker Worker Worker

Metrics • Performance • Processing time • Scalability • Graph size (number of vertices and number of edges)

Optimization Factors • Memory Size • Number of Workers • Combiner • Out-of-Core • GCcontrol • Parallel GC • ConcurrentGC • Young Generation • Object Reuse JVM Giraph App

Experimental Settings • Cluster - 43 nodes ~ 800 GB memory • Hadoop-2.0.3-alpha (non-secure) • Giraph-1.0.0-release • Data - LinkedIn social network graph • Approx. 205 million vertices • Approx. 11billion edges • Application - PageRank algorithm

Baseline Result • 10 v.s 20 GB per worker • Max memory 800 GB • Processing time • 10 GB per worker – better performance • Scalability • 20 GB per worker –higher scalability 5 workers 15 workers 40 workers 25 workers 30 workers 40 workers 10 workers 800G 400G

Heap Dump w/o Concurrent GC • Iteration 3 • Iteration 27 GB GB GB GB • Big portion of unreachable objects are messages created at each superstep

Concurrent GC 20 GB per worker • Significantly improves the scalability by 3 folds • Suffered from performance degradation by 16%

Using Combiner 20 GB per worker • Scale up 2 times w/o any other optimizations • Speed up the performance by 50%

Memory Distribution • More workers achieve better performance • Larger memory size per worker provides higher scalability

Application – Object Reuse 20 GB per worker • Improves 5x scalability • Improves 4x performance • Require skills from application developers 650G 29 mins

Problems of Giraph on Yarn • Various knobs to tune to make Giraph applications work efficiently • Highly depend on skillful application developers • Performance penalties suffered from scaling up

Future Direction • C++ provides direct control over memory management • No need to rewrite the whole Giraph • Only master and worker in C++

Conclusion • Linkedin is the 1st player of Giraph on Yarn • Improvements and bug fixes • Provide patches in Apache Giraph • Make full LI graph run on 40-node cluster with 650GB memory • Evaluate various performance and scalability options

Apache Giraph on yarn

Apache Giraph on yarn

Presentation Transcript

YARN

An Introduction to Apache Giraph

Yarn Exporter | Cotton Yarn Exporter | Pakistan Origin Yarn

Apache Hadoop YARN Enabling next generation data applications

YARN

Spinning Yarn

Extra Yarn

Beyond Hadoop : Pig and Giraph

Yarn Structure

SEMINAR ON “ APACHE HELICOPTERS ”

Apache Samza * Reliable Stream Processing atop Apache Kafka and Yarn

YARN BOMBING

Apache Hadoop YARN Enabling next generation data applications

YARN MANUFACTURING

Yarn

Plymouth Yarn - Wool and Malabrigo Yarn

Apache YARN Mechanism

Yarn Count

Yarn - Wool - Plymouth yarn - Alpaca Knitting

Melange Yarn in India |Yarn Manufacturers