Introduction to Hadoop: A Framework for Distributed Computing

INTRODUCTION TO HADOOP Dr. G SudhaSadhasivam Professor, CSE PSG College of Technology Coimbatore

Contents • Distributed System • DFS • Hadoop • Why its is needed? • Issues • Mutate / lease

Operating systems • Operating system - Software that supervises and controls tasks on a computer. Individual OS: • Batch processing jobs are collected, placed in a queue, no interaction with job during processing • Time shared computing resources are provided to different users, interaction with program during execution • RT systems  fast response, can be interrupted

Distributed Systems • Consists of a number of computers that are connected and managed so that they automaticallyshare the job processing load among the constituent computers. • A distributed operating system is one that appears to its users as a traditional uniprocessor system, even though it is actually composed of multiple processors. • It gives a single system view to its users and provides a single service. • Users are transparent to location of files. It provides a virtual computing env. Eg The Internet, ATM banking networks, mobile computing networks, Global Positioning Systems and Air Traffic Control DISTRIBUTED SYSTEM IS A COLLECTION OF INDEPENDENT COMPUTERS THAT APPEARS TO IS USERS AS A SINGLE COHERENT SYSTEM

Application Application Application Distributed Operating System Services Application Application Application Network OS Network OS Network OS Network Operating System • In a network operating system the users are aware of the existence of multiple computers. • The operating system of individual computers must have facilities to have communication and functionality. • Each machine runs its own OS and has its own user. • Remote login and file access • Less transparent but more independency Distributed OS Networked OS

DFS • Resource sharing is the motivation behind distributed Systems. To share files  file system • File System is responsible for the organization, storage, retrieval, naming, sharing, and protection of files. • The file system is responsible for controlling access to the data and for performing low-level operations such as buffering frequently used data and issuing disk I/O requests • The goal is to allow users of physically distributed computers to share data and storage resources by using a common file system.

Hadoop What is Hadoop? • It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it • Apache Software Foundation Project • Open source • Amazon’s EC2 • alpha (0.18) release available for download Hadoop Includes • HDFS a distributed filesystem • Map/Reduce HDFS implements this programming model. It is an offline computing engine Concept Moving computation is more efficient than moving large data

Data intensive applications with Petabytes of data. • Web pages - 20+ billion web pages x 20KB = 400+ terabytes • One computer can read 30-35 MB/sec from disk ~four months to read the web • same problem with 1000 machines, < 3 hours • Difficulty with a large number of machines • communication and coordination • recovering from machine failure • status reporting • debugging • optimization • locality

FACTS Single-thread performance doesn’t matter We have large problems and totalthroughput/price more important than peak performance Stuff Breaks – more reliability • If you have one server, it may stay up three years (1,000 days) • If you have 10,000 servers, expect to lose ten a day “Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails, albeit less often – software still needs to be fault-tolerant – commodity machines without fancy hardware give better perf/price DECISION : COMMODITY HARDWARE. DFS : HADOOP – REASONS????? WHAT SOFTWARE MODEL????????

HDFS Why? Seek vs Transfer • CPU & transfer speed, RAM & disk size double every 18 - 24 months • Seek time nearly constant (~5%/year) • Time to read entire drive is growing vs transfer rate. • Moral: scalable computing must go at transfer rate • BTree (Relational DBS) – operate at seek rate, log(N) seeks/access -- memory / stream based • sort/merge flat files (MapReduce) – operate at transfer rate, log(N) transfers/sort -- Batch based

Characteristics • Fault tolerant, scalable, Efficient, reliable distributed storage system • Moving computation to place of data • Single cluster with computation and data. • Process huge amounts of data. • Scalable: store and process petabytes of data. • Economical: • It distributes the data and processing across clusters of commonly available computers. • Clusters PCs into a storage and computing platform. • It minimises no of CPU cycles, RAM on individual machines etc. • Efficient: • By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid. • Computation is moved to place where data is present. • Reliable: • Hadoop automatically maintains multiple copies of data • Automatically redeploys computing tasks based on failures.

Cluster node runs both DFS and MR

• Data Model – Data is organized into files and directories – Files are divided into uniform sized blocks and distributed across cluster nodes – Replicate blocks to handle hardware failure – Checksums of data for corruption detection and recovery – Expose block placement so that computes can be migrated to data • large streamingreads and small randomreads • Facility for multiple clients to append to a file

Assumes commodity hardware that fails • Files are replicated to handle hardware failure • Checksums for corruption detection and recovery • Continues operation as nodes / racks added / removed • Optimized for fast batch processing • Data location exposed to allow computes to move to data • Stores data in chunks/blocks on every node in the cluster • Provides VERY high aggregate bandwidth

Files are broken in to large blocks. – Typically 128 MB block size – Blocks are replicated for reliability • One replica on local node, another replica on a remote rack, Third replica on local rack, Additional replicas are randomly placed • Understands rack locality – Data placement exposed so that computation can be migrated to data • Client talks to both NameNode and DataNodes – Data is not sent through the namenode, clients access data directly from DataNode – Throughput of file system scales nearly linearly with the number of nodes.

Block Placement

Hadoop Cluster Architecture:

Components • DFS Master “Namenode” • Manages the file system namespace • Controls read/write access to files • Manages block replication • Checkpoints namespace and journals namespace changes for reliability Metadata of Name node in Memory – The entire metadata is in main memory – No demand paging of FS metadata Types of Metadata: List of files, file and chunk namespaces; list of blocks, location of replicas; file attributes etc.

DFS SLAVES or DATA NODES • Serve read/write requests from clients • Perform replication tasks upon instruction by namenode Data nodes act as: 1) A Block Server – Stores data in the local file system – Stores metadata of a block (e.g. CRC) – Serves data and metadata to Clients 2) Block Report: Periodically sends a report of all existing blocks to the NameNode 3) Periodically sends heartbeat to NameNode (detect node failures) 4) Facilitates Pipelining of Data (to other specified DataNodes)

Map/Reduce Master “Jobtracker” • Accepts MR jobs submitted by users • Assigns Map and Reduce tasks to Tasktrackers • Monitors task and tasktracker status, reexecutes tasks upon failure • Map/Reduce Slaves “Tasktrackers” • Run Map and Reduce tasks upon instruction from the Jobtracker • Manage storage and transmission of intermediate output.

SECONDARY NAME NODE • Copies FsImage and Transaction Log from NameNode to a temporary directory • Merges FSImage and Transaction Log into a new FSImage in temporary directory • Uploads new FSImage to the NameNode – Transaction Log on NameNode is purged

HDFS Architecture • NameNode: filename, offset> blockid, block > datanode • DataNode: maps block > local disk • Secondary NameNode: periodically merges edit logs Block is also called chunk

JOBTRACKER, TASKTACKER AND JOBCLIENT

HDFS API • Most common file and directory operations supported: – Create, open, close, read, write, seek, list, delete etc. • Files are write once and have exclusively one writer • Some operations peculiar to HDFS: – set replication, get block locations • Support for owners, permissions

DATA CORRECTNESS • Use Checksums to validate data – Use CRC32 • File Creation – Client computes checksum per 512 byte – DataNode stores the checksum • File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas

MUTATION ORDER AND LEASES • A mutation is an operation that changes the contents / metadata of a chunk such as append / write operation. • Each mutation is performed at all replicas. • Leases (order of mutations) are used to maintain consistency • Master grants chunk lease to one replica (primary) • Primary picks the serial order for all mutations to the chunk • All replicas follow this order (consistency)

Software Model - ??? • Parallel programming improves performance and efficiency. • In a parallel program, the processing is broken up into parts, each of which can be executed concurrently • Identify whether the problem can be parallelised (fib) • Matrix operations with independency

Master/Worker • The MASTER: • initializes the array and splits it up according to the number of available WORKERS • sends each WORKER its subarray • receives the results from each WORKER • The WORKER: • receives the subarray from the MASTER • performs processing on the subarray • returns results to MASTER

CALCULATING PI The area of the square, denoted As = (2r)^2 or 4r^2. The area of the circle, denoted Ac, is pi * r2. • pi = Ac / r^2 • As = 4r^2 • r^2 = As / 4 • pi = 4 * Ac / As • pi= 4 * No of pts on the circle / num of points on the square

Randomly generate points in the square • Count the number of generated points that are both in the circle and in the square  MAP (find ra = No of pts on the circle / num of points on the square) • ra = the number of points in the circle divided by the number of points in the square  gather all ra • PI = 4 * r  REDUCE Parallelised calculation of points on the circle (MAP) Then merged in to find PI  REDUCE

Cluster node runs both DFS and MR

WHAT IS MAP REDUCE PROGRAMMING • Restricted parallel programming model meant for large clusters • User implements Map() and Reduce()‏ • Parallel computing framework (HDFS lib) • Libraries take care of EVERYTHING else (abstraction) • Parallelization • Fault Tolerance • Data Distribution • Load Balancing • Useful model for many practical tasks

Conclusion • Why commodity hw ? because cheaper designed to tolerate faults • Why HDFS ? network bandwidth vs seek latency • Why Map reduce programming model? parallel programming large data sets moving computation to data single compute + data cluster

Introduction to Hadoop: A Framework for Distributed Computing