Big Data & Hadoop

Hannah Jones presents Big Data & Hadoop

Agenda • What is BigData? • What is Hadoop? • Hadoop vs. Traditional Database • Who uses Hadoop? • Brief History • How does Hadoop work? • Storage • Processing • Other Projects • Presentation Layer

What is Big Data? • Collections of large and complex data sets • Difficult to process • Challenges to capture, store, search, share, transfer, analyze and visualization • Volume, Velocity, Variety

New types of Data • Structured • Unstructured • Tweets • Images • Audio • Text Messages • Click Trails

Big Data = Big ROI • Healthcare: 20% decrease in patient mortality by analyzing streaming patient data • Telco: 92% decrease in processing time by analyzing networking and call data • Utilities: 99% improved accuracy in placing power generation resources by analyzing 2.8 petabytes of untapped data

Software Available • Amazon DynamoDB • Couchbase 2.0 • Apache Cassandra • Apache Hadoop

Hadoop

HADOOP! The… Database…? • Massively scalable storage and batch data processing system • Replace some functions of DBs • Simultaneously ingesting, processing and delivering/exporting large volumes of data • Absorb any type of data from any source

Traditional DB • Databases (MySQL, SQL Server, Oracle, etc..) • Transactional systems, reporting, and archiving • Reads & Writes for “reasonable” data sets (< 1B rows) • Real time or batch processing

Hadoop • Overcomes the traditional limitations of storage and computing • Provides linear scalability from 1 to 4000 servers • Low cost, open source software • Leverage inexpensive, commodity hardware as platform

Hadoop • Analysis of highly granular data • Structured and Unstructured data • Fast integration of multiple data sources • Responds well to rapidly changing business requirements

Hadoop vs. Databases • Scale-Out instead of Scale-Up • Key/value pairs instead of relational tables • Functional programming (MapReduce) instead of declarative queries (SQL) • Offline batch processing instead of online transactions • More complex security

Who uses Hadoop?

History

Hadoop in Detail • 2 Main pieces • HDFS • Storage • Files and directories • Provides high bandwidth access to the data • MapReduce • Processing – Manages Jobs

HDFS – Gettysburg Address Example

NameNode Disabilities • Loose all 3 machines => Lose all of the data • Very Rare • Single Point of Failure • Doesn’t fail very often • Secondary NameNode • A backup • Not automatic

Hadoop in Detail • 2 Main pieces • HDFS • Storage • Files and directories • Provides high bandwidth access to the data • MapReduce • Processing – Manages Jobs

Map Reduce • Break a massive task into smaller chunks • Process in parallel • Key/Value Pairs vs. Columns/Rows -> map() function

Steps to understanding MapReduce • Think in terms of keys and values • Write a Mapper • Write a Reducer

Job and Task Trackers • Job Tracker • MapReduce Coordinator • 1 JobTracker for an entire cluster • Accepts user’s jobs • Divides it into tasks • Assign to individual TaskTracker • TaskTracker reports statuses as it runs • Notice if a TaskTrackerdisappears if failure • Assign the things running on that Tasktrackerto another task tracker

MapReduce– Gettysburg Address Example • Hello World :: Regular Programming • Word Count :: Map Reduce Programming

Word Count Example Map

Word Count Example

Word Count Example Shuffle <Four, [1, 1]> <Score, [1, 1, 1]> <and, [1, 1, 1, 1, 1, 1, 1]> <seven, [1, 1, 1, 1]> <years, [1]> <ago, [1, 1, 1]> <our, [1, 1, 1, 1, 1]> <fathers, [1]> <brought, [1]>

Word Count Example Reduce <Four, 2> <Score, 3> <and, 7> <seven, 4> <years, 1> <ago, 3> <our, 5> <fathers, 1> <brought, 1>

MapReduce jobs can be tedious to write…

But wait… there’s more

Zoo Keeper

New Projects • Hcatalog • Store meta data • Oozie • Scheduling System • Sqoop • Transfer data • Mahout • Machine learning w/ MapReduce • BigTop • Integrates all of these projects

Getting Data Out • Databases • Datameer • Tableau

Summary • BigData is the new trend in businesses • Hadoop is a way to store and process the data • Made up of many projects • Presentation tools are a great way to visualize the data

Big Data & Hadoop

Big Data & Hadoop

Presentation Transcript

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop YARN in the Cloud

Hadoop & Map Reduce

Hadoop Introduction

Tutorial : Big Data Algorithms and Applications Under Hadoop

Let's Break It Up: Using Informix with Hadoop

Hadoop Online Training | Hadoop Online Training in usa, uk,

Big-data Computing: Hadoop Distributed File System

HADOOP ADMIN: Session -2

Big Data & Hadoop

Hadoop & Map Reduce

Hadoop Online Training Online Hadoop Training in usa, uk

Hadoop Online Training

Batch Start on Big Data & Hadoop

Hadoop Training in Chennai

Big Data Hadoop Training Institute in Pune - Mindkraftors

Instructor-led live hadoop online training with 24x7 on demand support

Hadoop Training Chennai

Hadoop Training in Chennai

Big Data Simplified in the Easy Way

Big Data &amp; Hadoop

Big Data &amp; Hadoop

Presentation Transcript

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop YARN in the Cloud

Hadoop &amp; Map Reduce

Hadoop Introduction

Tutorial : Big Data Algorithms and Applications Under Hadoop

Let's Break It Up: Using Informix with Hadoop

Hadoop Online Training | Hadoop Online Training in usa, uk,

Big-data Computing: Hadoop Distributed File System

HADOOP ADMIN: Session -2

Big Data &amp; Hadoop

Hadoop &amp; Map Reduce

Hadoop Online Training Online Hadoop Training in usa, uk

Hadoop Online Training

Batch Start on Big Data & Hadoop

Hadoop Training in Chennai

Big Data Hadoop Training Institute in Pune - Mindkraftors

Instructor-led live hadoop online training with 24x7 on demand support

Hadoop Training Chennai

Hadoop Training in Chennai

Big Data Simplified in the Easy Way

Big Data & Hadoop

Big Data & Hadoop

Hadoop & Map Reduce

Big Data & Hadoop

Hadoop & Map Reduce