About Hadoop

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets across a cluster of commodity servers. Internal components: HDFS & YARN with Mapreduce

What is HDFS HDFS is a file system to store the data in reliable manner. It consists of two types of nodes called NameNode and DataNode to store metadata and actual data. HDFS is a block-structured file system. Just like Linux file systems, HDFS splits a file into fixed-sizeblocks, also known as partitions or splits. The default block size is 128 MB.

YARN YARN is a distributed OS also called Cluster manager to process huge amount of data paralelly and quickly. At a time process different types of data such as Batch process, streaming, iterative data and more. It's unified stack.

What is Mapreduce? Mapreduce is a processing engine in Hadoop. It can process only batch data. It means bounded data. Internally it process disk to disk. So It's very very slow. Manually optimize everything, allows different ecosystems like HIve, Pig, and more to process the data.

Common data sources

Processing too slow

Data lost

HDFS is No1 to store data paralelly • There is no competetor to store data reliabelly in scalable manner with Low cost. • But problem is process the data quickly. • How to overcome to process quickly? • The problem with Mapreduce is It's very very slow • How to resolve it?

Speed and Durability is too key factors

Problem - Solution Disk to Disk processing Very Very slow. So that Mapreduce taking a lot of time. Framework - framework creates new processing problems. In-memory Processing is processing data everything in RAM. So that very very processing

LIBRARY lIBRARY LIBRARY LIBRARY

Why only Spark why not others?

10 times less code, 10 times Fast

Why I switch to Spark? The key features of Spark include the following: • Easy to use (progrmmer friendly) • Fast (in-memory) • General-purpose • Scalable parallelly process the data • Optimized Fault tolerant Unified platform

Different type of data Batch processing-- Hadoop Streaming --- Strom Iterative --MLLib or graphx Interactive --SQL/BI

key entities ........................ 1) driver program, 2) cluster manager, 3) worker node, 4) executors, 5) tasks

What is Driver Program? The spark driver is the program that declares/defines the transformations and actions on RDDs of data and submits such requests to the master. Where the driver program is placed to process, that node is called Driver node, it might either within or out of the cluster.

Cluster manager(Yarn) It's a distributed OS. It's schedule the tasks and allocate the resources in the cluster. Allocate RAM and CPUS to Executors based on Node manager request

Worker nodes/node manager In Hadoop terminaligy it's also called node manager It's manage the executors If executors cross limits, nodemanager kill the executors

Tasks • A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executoron a worker node. Each task performs some computations to either return a result to a driver program or S3/hdfs. • Spark creates a task per data partition. An executor runs one or more tasks concurrently. The amountof parallelism is determined by the number of partitions. More partitions mean more tasks processing datain parallel.

About Hadoop

About Hadoop

Presentation Transcript

Hadoop

Hadoop

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop @

Cassandra + Hadoop

Hadoop Demo

Hadoop

HADOOP

Hadoop

Hadoop

Hadoop

Hadoop

Few Things about Hadoop

About Us- Big Data Online Hadoop Training

Learn about basics of Big Data Hadoop Training