1 / 0

ETL with Hadoop and MapReduce

ETL with Hadoop and MapReduce. Jan Pieter Posthuma –SQL Zaterdag 17 november 2012. Agenda. Big Data Why do we need it ? Hadoop MapReduce Pig and Hive Demo’s. Expectation s. Big Data. Too much data transformed to insight in a traditional BI way.

coye
Télécharger la présentation

ETL with Hadoop and MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ETL withHadoopandMapReduce

    Jan Pieter Posthuma –SQL Zaterdag 17 november 2012
  2. Agenda Big Data Why do we needit? Hadoop MapReduce Pig and Hive Demo’s
  3. Expectations
  4. Big Data Too much data transformed to insight in a traditional BI way Too much data transformed to insight in a traditional BI way
  5. Why do we need it? Classical BI solution Filter Report Stage DWH Datamart Source ±10Gb ±10Gb ±10Gb ±100Mb ±10Kb Σ ±30Gb Big Data is about reducing time to insight: No ETL No Cleansing No Load ‘Analyze data when it arrives’
  6. Hadoop Replaces the need of additional Staging, DWH and ETL Additional storage needed for highly unstructured data Easy retrieval for (structured) data Pig Hive SQOOP ODBC for Hive Polybase (HDFS)
  7. Big Data ecosystem Reports BI tools Excel Dashboards Sqoop Hive & Pig (Virtual) datamarts Map/ Reduce Relational Databases HDFS Hadoop
  8. MapReduce Map function: varmap = function (key, value, context) {} Reducefunction: varreduce = function (key, values, context) {} var map = function(key, value, context) ⁞ distributed and scheduledmultiple times to all nodes context := (key, value) var map = function(key, value, context) var reduce =function(key, values, context) MapReduce.js ⁞ var map = function(key, value, context) Processing data segments
  9. Hive and Pig Principle is the same: easy data retrieval Both use MapReduce Different founders Facebook (Hive) and Yahoo (PIG) Different language SQL like (Hive) and more procedural (PIG) ‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’
  10. Demo Query RDW Data RDW data with Hive table Pig and MapReduce to get data from KNMI
  11. KNMI.js 1/2 var map = function (key, value, context) { if (value[0] != '#') { varallValues = value.split(','); if (allValues[7] != '') { context.write(allValues[0]+'-'+allValues[1], allValues[0] + ',' + allValues[1] + ',' + allValues[7]); } } };
  12. KNMI.js 2/2 varreduce = function (key, values, context) { varmMax = -9999; varmMin = 9999; varmKey = key.split('-'); while (values.hasNext()) { varmValues = values.next().split(','); mMax = mValues[2] > mMax ? mValues[2] : mMax; mMin = mValues[2] < mMin ? mValues[2] : mMin; } context.write(key.trim(), mKey[0].toString() + '\t' + mKey[1].toString() + '\t' + mMax.toString() + '\t' + mMin.toString()); };
More Related