Phoenix Rebirth: Scalable MapReduce on a NUMA System

Phoenix Rebirth: Scalable MapReduce on a NUMA System Richard Yoo, Anthony Romano

MapReduce and Phoenix • MapReduce • A functional style parallel programming framework & runtime for large clusters • Users only provide map / reduce functions • Map • Processes input data and generates a set of intermediate key / value pairs • Reduce • Properly merges the intermediate pairs with the same key • Runtime • Automatically parallelizes computation • Manages data distribution / result collection • Phoenix • Shared memory implementation of MapReduce • Shown to be an efficient programming model for SMP / CMP

Project Goal • Improve the scalability of Phoenix on a NUMA system • 4 socket UltraSPARC T2+ machine • 256 hardware contexts (8 cores per chip, 8 contexts per core) • NUMA • 300 cycles to access local memory • 100 cycle overhead to access remote memory • All off-chip memory traffic potentially an issue chip 0 chip 1 mem 0 mem 1 hub chip 2 chip 3 mem 2 mem 3

Motivating Example • Baseline Phoenix shows moderate scalability on a single socket machine • Performance plummets on a NUMA machine with larger number of threads • One chip supports 64 threads • Utilizing off-chip thread destroys scalability Speedup on a Single Socket UltraSPARC T2 Speedup on a 4-Socket UltraSPARC T2+ (NUMA)

MapReduce on a NUMA System • What’s happening? • It’s about • Locality • Locality • Locality • Reducing off-chip memory traffic would be the key to scalability • Why is locality a problem in (the invincible) MapReduce? • Original Google MapReduce meant for cluster • Communication implemented via GFS (distributed file system) • Because Phoenix is a shared memory implementation • Communication takes place through shared memory • How the user provided map / reduce functions access memory, and how they interact has significant impact on overall system performance

Reducing Off-Chip Traffic • Where are these traffics being generated? • MapReduce is actually • Split-Map-Reduce-Merge • Each phase boundary introduces off-chip memory traffic • Location of map data determined by split phase • Global data shuffle from map to reduce phase • Merge phase entails global gathering

Split-to-Map Phase Off-Chip Traffic • Split phase • User supplied data are distributed over the system • Map phase • Workers pull the data to work on a map task chip 0 chip 1 mem 0 mem 1 hub chip 2 chip 3 mem 2 mem 3

Map-to-Reduce Phase Off-Chip Traffic • Map phase • Map results reside on local memory • Reduce phase • Has to be performed on a global scope • Global gathering of data • Similar gathering occurs at reduce-to-merge phase chip 0 chip 1 mem 0 mem 1 hub chip 2 chip 3 mem 2 mem 3

Applied Optimizations • Implement optimization / scheduling logic at each phase boundary to minimize off-chip traffic • Split-to-Map Phase Traffic • Data locality aware map task distribution • Installed per locality group (chip) task queues • Distributed map tasks according to the location of map data • Worker threads work on local map tasks first • Then perform task stealing across locality groups • Map-to-Reduce Phase Traffic • Combiners • Perform local reduction before shipping off map results to reduce worker • Reduces the amount of off-chip data traffic

Applied Optimizations (contd.) • Reduce-to-Merge Phase Traffic • Per locality group merge worker • Original Phoenix performs a global-scale merging • Perform localized merge first • Then merge the entire result at the final phase • Reduces chip-crossings during the merge phase • And of course, a lot of tunings / optimizations for the single chip case • Improved buffer management • Fine tune performance knobs • Size of map tasks • Number of key hash buckets

Preliminary Results Execution time on kmeans Speedup on kmeans

Preliminary Results (contd.) Execution time on pca Speedup on pca

Summary of Results • Significantly improved scalability while not sacrificing execution time • Utilizing off-chip threads still an issue • Memory bandwidth seems to be a problem Original Phoenix Speedup Current Speedup

In Progress • Measure memory bandwidth • Check whether the memory subsystem is the bottleneck or not • MapReducing-MapReduce • Execute multiple MapReduce instances • One MapReduce instance per each locality group • Globally merge the final result • Minimizes off-chip memory accesses as much as possible

Questions?

Memory Bandwidth Bottleneck • For word_count, realized L2 throughput caps at 64 threads Speedup on word_count Systemwide L2D Load Misses per uSecond

Phoenix Rebirth: Scalable MapReduce on a NUMA System

Phoenix Rebirth: Scalable MapReduce on a NUMA System

Presentation Transcript

Scalable Software Verification with BLAST

The Rebirth of Older Industrial Regions: New Opportunities for Private Sector Investment Barry Bluestone Northeastern Un

University of Phoenix MTH 209 Algebra II

Scalable Data Mining

Awards Ceremony

James Taylor @ JamesPlusPlus http://phoenix- hbase.blogspot.com /

Apache Mahout Feb 13, 2012 Shannon Quinn

Scalable Many-Core Memory Systems Lecture 1, Topic 1: DRAM Basics and DRAM Scaling

MapReduce , Collective Communication, and Services

Scalable Information Extraction

Scalable Web Architectures

Bell Ringer Question

Univ. of Phoenix SCI/256