Distributed Systems CS 15-440

Distributed SystemsCS 15-440 Hadoop Lecture 13, October 25, 2017 Mohammad Hammoud

Today • Last Session: • MPI (Concluded) • Today’s Session: • Hadoop Distributed File System and MapReduce • Announcements: • P2 grades are out • PS4 is out. It is due on Nov 1st by midnight • P3 is due on Nov 12th by midnight

We Live in a World of Data…

What Do We Do With Big Data? Store Share Access Process Encrypt …. and more! We want to do all these seamlessly...

Where to Store Big Data? • The underlying storage system is a key component for enabling Big Data querying/mining/analytics • Typically, the storage system would “partition” and “distribute” Big Data, using striping (or partitioning) and placement techniques • This allows for concurrent accesses to data • as well as improves fault-tolerance Striping Unit Stripe Size Logical File 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Server 1 Server 2 Server 3 Server 4 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Example: The Google File System • GFS paritionslarge files into fixed-size blocks and distributes them randomly across cluster machines Blk 2 Blk 3 Large File Blk 0 Blk 1 Blk 4 Blk 6 Blk 5 Server 2 Server 3 Server 1 Server 0 (Writer) Blk 0 Blk 0 Blk 1 Blk 0 0M Blk 1 Blk 2 Blk 2 Blk 1 64M Blk 2 Blk 3 Blk 4 Blk 4 128M Blk 5 Blk 3 Blk 3 Blk 5 192M Blk 6 Blk 4 Blk 6 256M Blk 5 320M Blk 6 384M

Example: The Google File System • GFS adopts a master-slave architecture File name GFS client Master Contact address Chunk Id, range Chunk Server Chunk Server Chunk Server Linux File System Linux File System Linux File System Chunk data

How to Process Big Data? • One alternative: Create a custom distributed system (or program) for each new algorithm • Cumbersome! • Another alternative: utilize modern distributed analytics frameworks, which: • Relieve programmers from concerns with many of the difficult aspects of developing distributed programs • Allow programmers to focus on ONLY the sequential parts of their programs • Examples: • Hadoop MapReduce • Google’s Pregel • CMU’s Distributed GraphLab

Distributed Analytics Frameworks Hadoop MapReduce Architectural & Scheduling Models Execution Model Introduction Programming Model

Hadoop • Hadoop is one of the most successful realizations of large-scale “data-parallel” distributed analytics frameworks • Hadoop MapReduce is an open source implementation of Google’s MapReduce • Hadoop uses Hadoop Distributed File System (HDFS) as a storage layer • HDFS is an open source implementation of GFS

Hadoop MapReduce: A Bird’s Eye View • Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task Split 0 HDFS BLK Partition Reduce Task Partition Partition Partition Partition Map Task Partition Split 1 HDFS BLK Dataset Partition Reduce Task To HDFS Partition Partition Partition Map Task Split 2 HDFS BLK HDFS Partition Partition Reduce Task Partition Partition Partition Map Task Split 3 HDFS BLK Partition Partition Merge & Sort Stage Shuffle Stage Reduce Stage Map Phase Reduce Phase

The Programming Model • Hadoop MapReduce employs a shared-based programming model, which entails that: • Tasks can interact (if needed) via reading and writing to a shared space • HDFS provides the shared space for all Map and Reduce tasks • Programmers write only sequential code, without defining functions that send/receive messages between tasks A Shared Address Space (Provided by HDFS) MT2 MT4 MT5 MT6 MT1 MT3 “Implicit” communication (provided by the MapReduce Engine)- Programmers do not write or call any communication routines RT1 RT2 RT3 A Shared Address Space (Provided by HDFS)

Example: Word Count A Map Function A Reduce Function A Chunk of File Mohammad is delivering a lecture to the 15-440 class Parse & Count A Text File Mohammad is delivering a lecture to the 15-440 class The course name of 15-440 is Distributed Systems Iterate& Sum A Map Function A Chunk of File The course name of 15-440 is Distributed Systems Parse & Count

The Execution Model • Hadoop MapReduce adopts a synchronous execution model • A distributed program (or system) is said to be synchronous if and only if its constituent tasks operate in a lock-step mode • No two tasks can run concurrently under two different iterations • In MapReduce: • Each iteration is treated as a MapReduce job • A job can encompass 1 or many Map tasks and 0 or many Reduce tasks • Programs with multiple iterations (i.e., iterative programs) are executed using multiple chained MapReduce jobs • When all Reduce tasks within job iare committed, a new job i+ 1 is started (if any) • Hence, two different tasks cannot run in parallel under two different jobs (or iterations)

The Architectural and Scheduling Models • Hadoop MapReduce employs a master-slave architecture • A pull-based task scheduling strategy is used, whereby: • Map tasks are scheduled in proximity of HDFS blocks • Reduce tasks are scheduled anywhere Core Switch The master A slave Rack Switch 1 Rack Switch 2 TaskTracker5 TaskTracker2 JobTracker TaskTracker3 TaskTracker4 TaskTracker1 MT3 MT3 MT1 MT2 MT2 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

The Architectural and Scheduling Models • Hadoop MapReduce employs a master-slave architecture • With the above setup, how many Map tasks can run in parallel? • Each TaskTracker has by default two Map slots, thus can run two Map tasks concurrently • With 4 TaskTrackers and 2 Map slots on each TaskTracker, 8 Map tasks can be executed in parallel • The maximum number of Map tasks that can run in parallel is denoted as Map wave Core Switch The master A slave Rack Switch 1 Rack Switch 2 TaskTracker5 TaskTracker2 JobTracker TaskTracker3 TaskTracker4 TaskTracker1 MT3 MT3 MT2 MT2 MT1 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

The Architectural and Scheduling Models • Hadoop MapReduce employs a master-slave architecture • For a dataset with a size of 1024MB, how many Map waves are needed? • The size of each HDFS block is by default 64MB and each split encompasses by default 1 HDFS block • Hence, there will be a total of 1024/64 = 16 HDFS blocks or 16 splits • The input to each Map task is a single split, thus there will be a total of 16 Map tasks • Therefore, 16 tasks/8 slots = 2 Map waves will be needed Core Switch The master A slave Rack Switch 1 Rack Switch 2 TaskTracker5 TaskTracker2 JobTracker TaskTracker3 TaskTracker4 TaskTracker1 MT3 MT3 MT2 MT2 MT1 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1

Hadoop MapReduce: Summary

Next Class • Pregel and GraphLab

Distributed Systems CS 15-440