ETL with Hadoop and MapReduce

ETL withHadoopandMapReduce
Jan Pieter Posthuma –SQL Zaterdag 17 november 2012

Agenda Big Data Why do we needit? Hadoop MapReduce Pig and Hive Demo’s

Expectations

Big Data Too much data transformed to insight in a traditional BI way Too much data transformed to insight in a traditional BI way

Why do we need it? Classical BI solution Filter Report Stage DWH Datamart Source ±10Gb ±10Gb ±10Gb ±100Mb ±10Kb Σ ±30Gb Big Data is about reducing time to insight: No ETL No Cleansing No Load ‘Analyze data when it arrives’

Hadoop Replaces the need of additional Staging, DWH and ETL Additional storage needed for highly unstructured data Easy retrieval for (structured) data Pig Hive SQOOP ODBC for Hive Polybase (HDFS)

Big Data ecosystem Reports BI tools Excel Dashboards Sqoop Hive & Pig (Virtual) datamarts Map/ Reduce Relational Databases HDFS Hadoop

MapReduce Map function: varmap = function (key, value, context) {} Reducefunction: varreduce = function (key, values, context) {} var map = function(key, value, context) ⁞ distributed and scheduledmultiple times to all nodes context := (key, value) var map = function(key, value, context) var reduce =function(key, values, context) MapReduce.js ⁞ var map = function(key, value, context) Processing data segments

Hive and Pig Principle is the same: easy data retrieval Both use MapReduce Different founders Facebook (Hive) and Yahoo (PIG) Different language SQL like (Hive) and more procedural (PIG) ‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’

Demo Query RDW Data RDW data with Hive table Pig and MapReduce to get data from KNMI

KNMI.js 1/2 var map = function (key, value, context) { if (value[0] != '#') { varallValues = value.split(','); if (allValues[7] != '') { context.write(allValues[0]+'-'+allValues[1], allValues[0] + ',' + allValues[1] + ',' + allValues[7]); } } };

KNMI.js 2/2 varreduce = function (key, values, context) { varmMax = -9999; varmMin = 9999; varmKey = key.split('-'); while (values.hasNext()) { varmValues = values.next().split(','); mMax = mValues[2] > mMax ? mValues[2] : mMax; mMin = mValues[2] < mMin ? mValues[2] : mMin; } context.write(key.trim(), mKey[0].toString() + '\t' + mKey[1].toString() + '\t' + mMax.toString() + '\t' + mMin.toString()); };

ETL with Hadoop and MapReduce

ETL with Hadoop and MapReduce

Presentation Transcript

ETL withHadoopandMapReduce

PARLab Parallel Boot Camp Cloud Computing with MapReduce and Hadoop

MapReduce and Hadoop Distributed File System

Hadoop Streaming で MapReduce

Introduction to Hadoop and MapReduce

Hadoop: Beyond MapReduce

Introduction to MapReduce and Hadoop

MapReduce and Hadoop

Mapreduce and Hadoop

Hadoop MapReduce

Introduction to MapReduce and Hadoop

Hadoop MapReduce Programmers perspective

Cloud Computing with MapReduce and Hadoop

MapReduce: Hadoop Implementation

Cloud Computing with MapReduce and Hadoop

Шаблоны проектирования Hadoop MapReduce

MapReduce and Hadoop Distributed File System

Development Environment Of Hadoop MapReduce | Hadoop Online Training

MapReduce in Hadoop Framework

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

MapReduce and Hadoop Distributed File System

Cloud Computing with MapReduce and Hadoop