1 / 90

About Hadoop

About Hadoop. Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets across a cluster of commodity servers. Internal components: HDFS & YARN with Mapreduce. What is HDFS.

maureenp
Télécharger la présentation

About Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets across a cluster of commodity servers. Internal components: HDFS & YARN with Mapreduce

  2. What is HDFS HDFS is a file system to store the data in reliable manner. It consists of two types of nodes called NameNode and DataNode to store metadata and actual data. HDFS is a block-structured file system. Just like Linux file systems, HDFS splits a file into fixed-sizeblocks, also known as partitions or splits. The default block size is 128 MB.

  3. YARN YARN is a distributed OS also called Cluster manager to process huge amount of data paralelly and quickly. At a time process different types of data such as Batch process, streaming, iterative data and more. It's unified stack.

  4. What is Mapreduce? Mapreduce is a processing engine in Hadoop. It can process only batch data. It means bounded data. Internally it process disk to disk. So It's very very slow. Manually optimize everything, allows different ecosystems like HIve, Pig, and more to process the data.

  5. Common data sources

  6. Processing too slow

  7. Data lost

  8. HDFS is No1 to store data paralelly • There is no competetor to store data reliabelly in scalable manner with Low cost. • But problem is process the data quickly. • How to overcome to process quickly? • The problem with Mapreduce is It's very very slow • How to resolve it?

  9. Speed and Durability is too key factors

  10. Problem - Solution Disk to Disk processing Very Very slow. So that Mapreduce taking a lot of time. Framework - framework creates new processing problems. In-memory Processing is processing data everything in RAM. So that very very processing

  11. LIBRARY lIBRARY LIBRARY LIBRARY

  12. Why only Spark why not others?

  13. 10 times less code, 10 times Fast

  14. Why I switch to Spark? The key features of Spark include the following: • Easy to use (progrmmer friendly) • Fast (in-memory) • General-purpose • Scalable parallelly process the data • Optimized Fault tolerant Unified platform

  15. Different type of data Batch processing-- Hadoop Streaming --- Strom Iterative --MLLib or graphx Interactive --SQL/BI

  16. key entities ........................ 1) driver program, 2) cluster manager, 3) worker node, 4) executors, 5) tasks

  17. What is Driver Program? The spark driver is the program that declares/defines the transformations and actions on RDDs of data and submits such requests to the master. Where the driver program is placed to process, that node is called Driver node, it might either within or out of the cluster.

  18. Cluster manager(Yarn) It's a distributed OS. It's schedule the tasks and allocate the resources in the cluster. Allocate RAM and CPUS to Executors based on Node manager request

  19. Worker nodes/node manager In Hadoop terminaligy it's also called node manager It's manage the executors If executors cross limits, nodemanager kill the executors

  20. Tasks • A task is the smallest unit of work that Spark sends to an executor. It is executed by a thread in an executoron a worker node. Each task performs some computations to either return a result to a driver program or S3/hdfs. • Spark creates a task per data partition. An executor runs one or more tasks concurrently. The amountof parallelism is determined by the number of partitions. More partitions mean more tasks processing datain parallel.

More Related