1 / 0
HDInsight on Azure and Map-Reduce
0 likes | 210 Vues
HDInsight on Azure and Map-Reduce. Richard Conway Windows Azure MVP Elastacloud Limited. Agenda. Introduction Big Data with HDInsight. Introduction. Solving problems through distribution.
Télécharger la présentation
HDInsight on Azure and Map-Reduce
An Image/Link below is provided (as is) to download presentation
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.
E N D
Presentation Transcript
-
HDInsight on Azure and Map-Reduce
Richard Conway Windows Azure MVP Elastacloud Limited - Agenda Introduction Big Data with HDInsight
- Introduction
- Solving problems through distribution Some challenges become bound by hardware capacity; 24 hours on 1 machine can be 1 hours on 24 machines. These 24 machines require orchestration; jobs are to be divided into tasks and tasks are distributed across a cluster. There are systems of software required to facilitate the distribution; examples are Hadoop and HPC Server. We will now provision a Hadoop cluster on Windows Azure.
- Big Data vs Big Compute
- Hadoop HPC Server Open MPI
-
All distributed compute works on the basis of taking a large JOB and breaking it to many smaller TASKS which are then parallelised
- HPC Head Node Broker Node Worker Nodes Hadoop Name Node Name Node Data Nodes
- Understanding Big Data
- KEY TRENDS Device Explosion Social Networks Cheap Storage $100 gets you 3million times more storage in 30 years) >5.5 billion (70+% of global population) >2Billion users Ubiquitous Connection Sensor Networks Inexpensive Computing Web traffic 2010130 Exabyte (10 E18) 20151.6 ZettaByte (10 E21) >10 Billion 1980 10 MIPS/$ 200510M MIPS/$
- What is Big Data? Internet of things Social Sentiment Wikis / Blogs Exabytes (10E18) Sensors / RFID / Devices Click Stream Audio / Video WEB 2.0 Log Files Mobile Petabytes (10E15) Advertising eCommerce Collaboration Spatial & GPS Coordinates Volume ERP / CRM Digital Marketing Data Market Feeds Terabytes (10E12) Search Marketing eGov Feeds Contacts Payables Web Logs Weather Deal Tracking Payroll Gigabytes (10E9) Sales Pipeline Inventory Recommendations Text/Image Velocity - Variety - variability Internet of things WEB 2.0 ERP / CRM 1990 9,000$ 2000 15$ 2010 0.07$ 1980 190,000$ Storage/GB
- Big Data, BIG OPPORTUNITY Big Data is a top priority for institutions Software Growth 34% compound annual growth rate2 Services Growth 39% compound annual growth rate2 49% CEOs and CIOs are planning big data projects McKinsey&Company, McKinsey Global Survey Results, Minding Your Digital Business, 2012 IDC Market Analysis, Worldwide Big Data Technology and Services 2012–2015 Forecast , 2012
- Devices: Internet and Internet of things Internet of things Trillions of computer-enabled devices which are part of the IoT 100kBit/sec Low bandwidth last-mile connection Trillions of networked nodes Invisible devices 6+billion people 1.5 billion use net US: 4.3 devices per adult Cable: 10Mbs+ Fiber: 50-100Mbs Laptops / tablets / smartphones High-bandwidth access Mostly addressed by local schemes Machine-centric Sensing-focus Billions of networked devices Internet Global addressing User-centric Communication-focus
- Big Data Scenarios
- Short History of Hadoop Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scale Hadoop started as a part of the Nutch project. In Jan 2006 Doug Cutting started working on Hadoop at Yahoo Factored out of Nutch in Feb 2006 First release of Apache Hadoopin September 2007 Jan 2008 Hadoop became a top level Apache project
- Hadoop Distributed Architecture Task tracker Task tracker MapReduce Layer Job tracker Name node HDFS Layer Data node Data node Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png
- MapReduce: Move Code to the Data FIRST, STORE THE DATA Server Server Files Server Server
- So How Does It Work? SECOND, TAKE THE PROCESSING TO THE DATA RUNTIME // Map Reduce function in JavaScript varmap = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") context.write(words[i].toLowerCase(), 1);} }}; varreduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; Code Server Server Server Server
- Traditional RDBMS vs. NoSQL Reference: Tom White’s Hadoop: The Definitive Guide
- Windows Azure HDInsight Service
- Creating an HDInsightCluster Demo
- HDINSIGHT / HADOOP Eco-System Legend Red = Core Hadoop Blue = Data processing Purple = Microsoft integration points and value adds Orange = Data Movement Green = Packages JavaScript C#, F#, .NET Data Integration ( ODBC / SQOOP/ REST) Relational (SQL Server) Stats processing (RHadoop) Machine Learning (Mahout) Pipeline / workflow (Oozie) Graph (Pegasus) PDW Polybase Metadata (HCatalog) Event Driven Processing Query (Hive) Scripting (Pig) NoSQL Database (HBase) Event Pipeline (Flume) Distributed Processing (MapReduce) Business Intelligence (Excel, Power View, SSAS) Distributed Storage (HDFS) Active Directory (Security) Monitoring & Deployment (System Center) World's Data (Azure Data Marketplace) Azure Storage Vault (ASV)
-
Storing Data with HDInsight
- HDFS on Azure: Tale of two File Systems HDFS API Azure Blob Storage Name Node de Front end Front end Front end Partition Layer Data Node Data Node Stream Layer … DFS (1 Data Node per Worker Role) and Compute Cluster Azure Storage (ASV)
- Azure Storage (ASV) Default file system for HDInsight Service Provides sharable, persistent, highly-scalable Storage with high availability (Azure Blob Store) Azure storage itself does not provide compute Fast access from compute nodes to data in same data center Several file systems, addressable via:asv[s]:<container>@<account>.blob.core.windows.net/<path> Requires storage key in core-site.xml:<property> <name>fs.azure.account.key.accountname</name> <value>enterthekeyvaluehere</value></property>
-
Map Reduce
Examples in C# - Map/Reduce Map/Reduce is a programming model for efficient distributed computing Input > Map> Shuffle & Sort > Reduce > Output Efficiency from Streaming through data, reducing seeks A good fit for a lot of applications Log processing Web index building Data mining and machine learning
- Hadoop SDK C# integration Remote Data & Jobs Hive in C# Serialization
- http://hadoopsdk.codeplex.com
- Jobs publicclassFrenchSessionsJob: HadoopJob<FrenchSessionsMapper, SessionsReducer> { publicoverrideHadoopJobConfiguration Configure(ExecutorContext context) { varconfig = newHadoopJobConfiguration() { InputPath = "\"/AllSessions/*.gz\"", OutputFolder = "/FrenchSessions/" }; returnconfig; } }
- Mapper publicclassFrenchSessionsMapper : MapperBase { publicoverridevoid Map(stringinputLine, MapperContext context) { if (inputLine.Contains("Country=France") { context.IncrementCounter("FrenchSession"); context.EmitKeyValue("FR", "1"); } } }
- Reducer publicclassSessionsReducer : ReducerCombinerBase { publicoverridevoid Reduce(string key, IEnumerable<string> values, ReducerContext context) { context.EmitKeyValue(key, values.Count()); } }
- Navigating the HDInsight portal Demo
- C# and Map/Reduce Demo
- https://elastastorage.blob.core.windows.net/hdinsight/Map-Reduce HDInsight Lab.pdf
-
Questions?
More Related