Hadoop & Map Reduce

Hadoop & Map Reduce Presentation by Yoni Nesher NonSQL database Techforum

Hadoop & Map Reduce Forum Agenda: Big data problem domain Hadoop ecosystem Hadoop Distributed File System (HDFS) Diving in to MapReduce MapReducecase studies MapReducev.s. parallel DBs systems – comparison and analysis

Hadoop & Map Reduce Main topics: HDFS – Hadoop distributed file system - manage the storage across a network of machine, designed for storing very large files, optimized for streaming data access patterns MapReduce- A distributed data processing model and execution environment that runs on large clusters of commodity machines.

Introduction • It has been said that “More data usually beats better algorithms” • For some problems, however sophisticated your algorithms are, they can often be beaten simply by having more data (and a less sophisticated algorithm) • So the good news is that Big Data is here! • The bad news is that we are struggling to store and analyze it..

Introduction • A possible (and only) solution - read and write data in parallel • This approach introduces new problems in the data I/O domain: • hardware failure: • As soon as you start using many pieces of hardware, the chance that one will fail is fairly high. • A common way of avoiding data loss is through replications • Combining data: • Most of the analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other disks TheMapReduce - programming model abstractsthe problem from disk reads and writes (commin up..)

Introduction • What is Hadoop? • Hadoop provides a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. • There are other parts to Hadoop, but these capabilities are its kernel • History • Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. • Hadoop has its origins in Apache Nutch, an open source web search engine, also a part of the Lucene project. • In January 2008, Hadoop was made its own top-level project at Apache • Using Hadoop: Yahoo!, Last.fm, Facebook, the New York Times(more examples later on..)

The Hadoop ecosystem: • Common • A set of components and interfaces for distributed file systems and general I/O (serialization, Java RPC, persistent data structures). • Avro • A serialization system for efficient, cross-language RPC, and persistent data storage. • MapReduce • A distributed data processing model and execution environment that runs on large clusters of commodity machines. • HDFS • A distributed file system that runs on large clusters of commodity machines. • Pig • A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. • Hive • A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. • HBase • A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). • ZooKeeper • A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. • Sqoop • A tool for efficiently moving data between relational databases and HDFS.

Hadoop HDFS What is HDFS ? • distributed filesystems – manage the storage across a network of machine • designed for storing very large files • files that are hundreds of megabytes, gigabytes, or terabytes in size. • There are Hadoop clusters running today that store petabytes of data in single files • streaming data access patterns • Optimize for write-once, read-many-times • Not optimized forlow latency seek operations, Lots of small files, Multiple writers, arbitrary file modifications

HDFS concepts • Blocks • The minimum amount of data that a file system can read or write. • A disk files system blocks are typically a few kilobytes in size • HDFS block are 64MB by default • files in HDFS are broken into block-sized chunks, which are stored as independent units. • Unlike a file system for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.

HDFS concepts • Blocks (cont.) • blocks are just a chunk of data to be stored—file metadata such as hierarchies and permissions does not need to be stored with the blocks • each block is replicated to a small number of physically separate machines (typically three). • If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client

HDFS concepts • Namenodes and Datanodes master-worker pattern: a namenode (the master) and a number of datanodes (workers) • NameNode: • Manages the filesystem namespace and maintains the filesystem tree and the metadata for all the files and directories in the tree. • Information is stored persistently on the local disk • Knows the datanodes on which all the blocks for a given file are located • It does not store block locations persistently, since this information is reconstructed from datanodeswhen the system starts.

HDFS concepts • Namenodes and Datanodes (cont.) • DataNodes: • Datanodes are the workhorses of the file system. • Store and retrieve blocks when they are told to (by clients or the namenode • Report back to the namenode periodically with lists of blocks that they are storing • Without the namenode, the filesystem cannot be used: • If the machine running the namenodecrashes, all the files on the file system would be lost • There would be no way of knowing how to reconstruct the files from the blocks on the datanodes.

HDFS concepts • Namenodes and Datanodes (cont.) • For this reason, it is important to make the namenode resilient to failure • Possible solution - back up the files that make up the persistent state of the file system metadata • Hadoop can be configured so that the namenode writes its persistent state to multiple file systems. These writes are synchronous and atomic. • The usual configuration choice is to write to local disk as well as a remote NFS mount.

HDFS concepts • Linux CLI examples • add file from local FS: hadoopfs -copyFromLocalinput/docs/quangle.txt quangle.txt • Return file to local FS: hadoopfs –copyToLocalquangle.txt quangle.copy.txt hadoopfs -mkdir books hadoopfs -ls .

HDFS concepts Anatomy of a file write • Create a file • Write data • Close file

HDFS concepts Anatomy of a file read • Open a file • Read data • Close file

HDFS concepts Network topology and Hadoop • For example: • a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1. Using this notation, here are the distances for the four scenarios: • distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node) • distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack) • distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center) • distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)

MapReduce What is it? • A distributed data processing model and execution environment that runs on large clusters of commodity machines. • Can be used with Java, Ruby, Python, C++ and more • Inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal • MapReduceproccess flow HDFS Data <key, value> collection <key, value> collection Formatting Map MR framework processing <key, values> collection <key, value> collection HDFS Data Reduce output

MapReduce Problem example: Weather Dataset Create a program that mines weather data • Weather sensors collecting data every hour at many locations across the globe, gather a large volume of log data. Source: NCDC • The data is stored using a line-oriented ASCII format, in which each line is a record • Mission - calculate max temperature each year around the world • Problem - millions of temperature measurements records

MapReduce Example: Weather Dataset Brute Force approach – Bash: (each year’s logs are compressed to a single yearXXXX.gz file) • The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large Instance.

MapReduce Weather Dataset with MapReduce Input formatting phase • The input to MR job is the raw NCDC data • Input format: we use Hadoop text formatter class: • When given a directory (HDFS URL), outputsa Hadoop <key,value> collection: • The key is the offset of the beginning of the line from the beginning of the file • The value is the line text HDFS Data <key, value> collection Formatting

MapReduce <key, value> collection <key, value> collection Map Map phase • The input to our map phase is the lines <offset, line_text> pairs • Map function pulls out the year and the air temperature, since these are the only fields we are interested in • Map function also drops bad records - filters out temperatures that are missing suspect, or erroneous. • Map Output (<year, temp> pairs):

MapReduce <key, value> collection <key, values> collection MR framework processing MR framework processing phase • The output from the map function is processed by the MR framework before being sent to the reduce function • This processing sorts and groups the key-value pairs by key • MR framework processing output (<year, temperatures> pairs):

MapReduce <key, values> collection <key, value> collection Reduce Reduce phase • The input to our reduce phase is the <year, temperatures> pairs • All the reduce function has to do now is iterate through the list and pick up the maximum reading • Reduce output:

MapReduce <key, value> collection HDFS Data output Data output phase • The input to the data output class is the <year, max temperature> pairs from the reduce function • When using the default Hadoop output formatter, the output is written to a pre-defined directory, which contains one output file per reducer. • Output formatter file output:

MapReduce Process summary: Question: How this process could be more optimized in the NCDC case? Textual logs in HDFS <year, temp> collection <offset, line> collection Formatting Map MR framework processing <year, temp values> collection <year, max temp> collection Textual result in HDFS Reduce output

MapReduce Some code.. Map function

MapReduce Some code.. Reduce function

MapReduce Some code.. Putting it all together And running: hadoopMaxTemperature input/ncdc/sample.txt output

MapReduce Going deep.. Definitions: • MR Job - a unit of work that the client wants to be performed. Consists of: • The input data • The MapReduce program • Configuration information. • Hadoopruns the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks.

MapReduce • There are two types of nodes that control the job execution process: a jobtrackerand a number of tasktrackers. • Jobtracker- coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers. • Tasktrackers - run tasks and send progress reportsto the jobtracker, which keeps a record of the overall progress of each job. Job Tracker Task Tracker Task Tracker Task Tracker

MapReduce Scaling out! • Hadoopdivides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. • Splits are normally corresponds to (one or more) file blocks • Hadoopcreates one map task for each split, which runs the user defined map function for each record in the split. • Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization.

MapReduce

MapReduce • When there are multiple reducers, the map tasks partitiontheir output, each creating one partition for each reduce task. • Framework ensures that the records for any given key are all in a single partition. • The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner—which buckets keys using a hash function—works very well • what will be a good partition function in our case?

MapReduce

MapReduce • Overall MR system flows • The 4 entities in MapReduce application: • The client, which submits the MapReduce job • The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker. • The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker. • The distributed filesystem(HDFS), which is used for sharing job files between the other entities.

MapReduce • Overall MR system flows • Job submission: • MR program creates a new JobClientinstance and call submitJob() on it (step 1) • Having submitted the job, runJob() polls the job’s progress once a second and reports the progress to the console if it has changed since the last report. • When the job is complete, if it was successful, the job counters are displayed. • Otherwise, the error that caused the job to fail is logged to the console. • The job submission process implemented by JobClient’ssubmitJob() method does the • following: • Asks the jobtracker for a new job ID by calling getNewJobId() on JobTracker (step 2)

MapReduce • Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. • Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program. • Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’sfilesystem in a directory named after the job ID (step 3). • Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker(step 4).

MapReduce • When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job scheduler will pick it up and initialize it. • Initialization involves creating an object to represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of the tasks’ status and progress (step 5). • To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6). • It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasksproperty in the JobConfand the scheduler simply creates this number of reduce tasks to be run. • Tasks are given IDs at this point

MapReduce • Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. • Heartbeats tell the jobtracker that a tasktracker is alive • The jobtracker will allocate a task to the tasktrackerusing the heartbeat return value (step 7). • Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the task from. • Tasktrackershave a fixed number of slots for map tasks and for reduce tasks (The precise number depends on the number of cores and the amount of memory on the tasktracker) • The default scheduler fills empty map task slots beforereduce task slots

MapReduce • Data locality • For a map task, it takes account of the tasktracker’s network location and picks a task whose input split is as close as possible to the tasktracker. • In the optimal case, the task is data-local- running on the same node that the split resides on. • Alternatively, the task may be rack-local: on the same rack, but not the same node, as the split. • Some tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one they are running on.

MapReduce • running the task: • First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’sfilesystem • Second, it creates a local working directory for the task, and un-jars the contents of the JAR into this directory. • Third, it creates an instance of TaskRunner to run the task. • TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so that any bugs in the user-defined map and reduce functions don’t affect the tasktracker (by causing it to crash or hang, for example). • It is possible to reuse the JVM between tasks • The child process communicates with its parent in order to inform the parent of the task’s progress every few seconds until the task is complete.

MapReduce • Job completion • When the jobtracker receives a notification that the last task for a job is complete, it changes the status for the job to “successful.” • When the JobClient polls for status, it learns that the job has completed successfully, so it prints a message to tell the user and then returns from the runJob() method.| • Last, the jobtrackercleans up its working state for the job and instructs tasktrackersto do the same (so intermediate output is deleted, for example)

MapReduce Back to the Weather Dataset • The same program will run, without alteration, on a full cluster. • This is the point of MapReduce: it scales to the size of your data and the size of your hardware. • On a 10-node EC2 cluster running High-CPU Extra Large Instances, the program took six minutes to run

MapReduce Hadoop implementations around: • EBay • 532 nodes cluster (8 * 532 cores, 5.3PB). • Heavy usage of JavaMapReduce, Pig, Hive, HBase • Using it for Search optimization and Research. • Facebook • Use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. • Currently major clusters: • 1100-machine cluster with 8800 cores and about 12 PB raw storage. • 300-machine cluster with 2400 cores and about 3 PB raw storage. • Each (commodity) node has 8 cores and 12 TB of storage.

MapReduce • LinkedIn • multiple grids divided up based upon purpose. • 120 Nehalem-based Sun x4275, with 2x4 cores, 24GB RAM, 8x1TB SATA • 580 Westmere-based HP SL 170x, with 2x4 cores, 24GB RAM, 6x2TB SATA • 1200 Westmere-based SuperMicro X8DTT-H, with 2x6 cores, 24GB RAM, 6x2TB SATA • Software: • CentOS 5.5 -> RHEL 6.1 • Apache Hadoop 0.20.2+patches -> Apache Hadoop 0.20.204+patches • Pig 0.9 heavily customized • Hive, Avro, Kafka, and other bits and pieces... • Used for discovering People You May Know and other fun facts. • Yahoo! • More than 100,000 CPUs in >40,000 computers running Hadoop • Biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) • Used to support research for Ad Systems and Web Search • Also used to do scaling tests to support development of Hadoop on larger clusters

MapReduce and Parallel DBMS systems Parallel DBMS systems • In the mid-1980s the Teradata and Gamma projects pioneered a new architectural paradigm for parallel database systems based on a cluster of commodity computers nodes • Those were called “shared-nothing nodes” (or separate CPU, memory, and disks), only connected through a high-speed interconnection • Every parallel database system built since then essentially uses the techniques first pioneered by these two projects: • horizontal partitioning of relational tables - distribute the rows of a relational table across the nodes of the cluster so they can be processed in parallel. • Partitioned execution of SQL queries - selection, aggregation, join, projection, and update queries are distributed among the nodes, result are sent back to a “master” node for merge.

MapReduce and Parallel DBMS systems • Many commercial implementations are available, including Teradata, Netezza, DataAllegro (Microsoft), ParAccel, Greenplum, Aster, Vertica, and DB2. • All run on shared-nothing clusters of nodes, with tables horizontally partitioned over them. MapReduce • An attractive quality of the MR programming model is simplicity; an MR program consists of only two functions • Map and Reduce—written by a user to process key/value data pairs. • The inputdata set is stored in a collection of partitions in a distributed file system deployed on each node in the cluster. • The program is then injected into a distributed-processing framework and executed

MapReduce and Parallel DBMS systems MR – Parallel DBMS comparison • Filtering and transformation of individual data items (tuples in tables) can be executed by a modern parallel DBMS using SQL. • For Map operations not easily expressed in SQL, many DBMSs support user defined functions (UDF) extensibility provides the equivalent functionality of a Map operation. • SQL aggregates augmented with UDFs and user-defined aggregates provide DBMS users the same MR-style reduce functionality. • Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL. • Given this, parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL).

MapReduce and Parallel DBMS systems MR – Parallel DBMS comparison • As for scalability - several production databases in the multi-petabyte range are run by very large customers, operating on clusters of order 100 nodes. • The people who manage these systems do not report the need for additional parallelism. • Thus, parallel DBMSs offer great scalability over the range of nodes that customers desire. So why use MapReduce? Why it is used so widely?

Hadoop & Map Reduce