HDInsight on Azure and Map-Reduce

HDInsight on Azure and Map-Reduce
Richard Conway Windows Azure MVP Elastacloud Limited

Agenda Introduction Big Data with HDInsight

Introduction

Solving problems through distribution Some challenges become bound by hardware capacity; 24 hours on 1 machine can be 1 hours on 24 machines. These 24 machines require orchestration; jobs are to be divided into tasks and tasks are distributed across a cluster. There are systems of software required to facilitate the distribution; examples are Hadoop and HPC Server. We will now provision a Hadoop cluster on Windows Azure.

Big Data vs Big Compute

Hadoop HPC Server Open MPI

All distributed compute works on the basis of taking a large JOB and breaking it to many smaller TASKS which are then parallelised

HPC Head Node Broker Node Worker Nodes Hadoop Name Node Name Node Data Nodes

Understanding Big Data

KEY TRENDS Device Explosion Social Networks Cheap Storage $100 gets you 3million times more storage in 30 years) >5.5 billion (70+% of global population) >2Billion users Ubiquitous Connection Sensor Networks Inexpensive Computing Web traffic 2010130 Exabyte (10 E18) 20151.6 ZettaByte (10 E21) >10 Billion 1980 10 MIPS/$ 200510M MIPS/$

What is Big Data? Internet of things Social Sentiment Wikis / Blogs Exabytes (10E18) Sensors / RFID / Devices Click Stream Audio / Video WEB 2.0 Log Files Mobile Petabytes (10E15) Advertising eCommerce Collaboration Spatial & GPS Coordinates Volume ERP / CRM Digital Marketing Data Market Feeds Terabytes (10E12) Search Marketing eGov Feeds Contacts Payables Web Logs Weather Deal Tracking Payroll Gigabytes (10E9) Sales Pipeline Inventory Recommendations Text/Image Velocity - Variety - variability Internet of things WEB 2.0 ERP / CRM 1990 9,000$ 2000 15$ 2010 0.07$ 1980 190,000$ Storage/GB

Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions Software Growth 34% compound annual growth rate2 Services Growth 39% compound annual growth rate2 49% CEOs and CIOs are planning big data projects McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 2012 IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012

Devices: Internet and Internet of things Internet of things Trillions of computer-enabled devices which are part of the IoT 100kBit/sec Low bandwidth last-mile connection Trillions of networked nodes Invisible devices 6+billion people 1.5 billion use net US: 4.3 devices per adult Cable: 10Mbs+ Fiber: 50-100Mbs Laptops / tablets / smartphones High-bandwidth access Mostly addressed by local schemes Machine-centric Sensing-focus Billions of networked devices Internet Global addressing User-centric Communication-focus

Big Data Scenarios

Short History of Hadoop Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scale Hadoop started as a part of the Nutch project. In Jan 2006 Doug Cutting started working on Hadoop at Yahoo Factored out of Nutch in Feb 2006 First release of Apache Hadoopin September 2007 Jan 2008 Hadoop became a top level Apache project

Hadoop Distributed Architecture Task tracker Task tracker MapReduce Layer Job tracker Name node HDFS Layer Data node Data node Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png

MapReduce: Move Code to the Data FIRST, STORE THE DATA Server Server Files Server Server

So How Does It Work? SECOND, TAKE THE PROCESSING TO THE DATA RUNTIME // Map Reduce function in JavaScript varmap = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") context.write(words[i].toLowerCase(), 1);} }}; varreduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; Code Server Server Server Server

Traditional RDBMS vs. NoSQL Reference: Tom White’s Hadoop: The Definitive Guide

Windows Azure HDInsight Service

Creating an HDInsightCluster Demo

HDINSIGHT / HADOOP Eco-System Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages JavaScript C#, F#, .NET Data Integration ( ODBC / SQOOP/ REST) Relational (SQL Server) Stats processing (RHadoop) Machine Learning (Mahout) Pipeline / workflow (Oozie) Graph (Pegasus) PDW Polybase Metadata (HCatalog) Event Driven Processing Query (Hive) Scripting (Pig) NoSQL Database (HBase) Event Pipeline (Flume) Distributed Processing (MapReduce) Business Intelligence (Excel, Power View, SSAS) Distributed Storage (HDFS) Active Directory (Security) Monitoring & Deployment (System Center) World's Data (Azure Data Marketplace) Azure Storage Vault (ASV)

Storing Data with HDInsight

HDFS on Azure: Tale of two File Systems HDFS API Azure Blob Storage Name Node de Front end Front end Front end Partition Layer Data Node Data Node Stream Layer … DFS (1 Data Node per Worker Role) and Compute Cluster Azure Storage (ASV)

Azure Storage (ASV) Default file system for HDInsight Service Provides sharable, persistent, highly-scalable Storage with high availability (Azure Blob Store) Azure storage itself does not provide compute Fast access from compute nodes to data in same data center Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path> Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>

Map Reduce
Examples in C#

Map/Reduce Map/Reduce is a programming model for efficient distributed computing Input > Map> Shuffle & Sort > Reduce > Output Efficiency from Streaming through data, reducing seeks A good fit for a lot of applications Log processing Web index building Data mining and machine learning

Hadoop SDK C# integration Remote Data & Jobs Hive in C# Serialization

http://hadoopsdk.codeplex.com

Jobs publicclassFrenchSessionsJob: HadoopJob<FrenchSessionsMapper, SessionsReducer> { publicoverrideHadoopJobConfiguration Configure(ExecutorContext context) { varconfig = newHadoopJobConfiguration() { InputPath = "\"/AllSessions/*.gz\"", OutputFolder = "/FrenchSessions/" }; returnconfig; } }

Mapper publicclassFrenchSessionsMapper : MapperBase { publicoverridevoid Map(stringinputLine, MapperContext context) { if (inputLine.Contains("Country=France") { context.IncrementCounter("FrenchSession"); context.EmitKeyValue("FR", "1"); } } }

Reducer publicclassSessionsReducer : ReducerCombinerBase { publicoverridevoid Reduce(string key, IEnumerable<string> values, ReducerContext context) { context.EmitKeyValue(key, values.Count()); } }

Navigating the HDInsight portal Demo

C# and Map/Reduce Demo

https://elastastorage.blob.core.windows.net/hdinsight/Map-Reduce HDInsight Lab.pdf

Questions?

HDInsight on Azure and Map-Reduce