Introduction to Hadoop on Azure: Map-Reduce Demos

hello Map-Reduce!” “Introducing Hadoop on Azure: Joe Hummel, PhD Visiting Researcher: U. of California, IrvineAdjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net

Agenda • A little history… • Why Hadoop? • How it works • Demos • Summary Hadoop on Azure

A little history… • Map-Reduce is from functional programming // function returns 1 if i is prime, 0 if not: letisPrime(i) = ... // sums 2 numbers: letsum(x, y) = return x + y // count the number of primes in 1..N: letcountPrimes(N) = let L = [ 1 .. N ] // [ 1, 2, 3, 4, 5, 6, ... ] let T = mapisPrime L // [ 0, 1, 1, 0, 1, 0, ... ] let count = reducesum T // 42 return count Hadoop on Azure

A little more history… • Created by to drive internet search • BIG data ― scalable to TBs and beyond • Parallelism: to get the performance • Data partitioning: to drive the parallelism • Fault tolerance: at this scale, machines are going to crash, a lot… BIG Data page hits

Who’s using Hadoop • Search engines: Google, Yahoo, Bing • Facebook • Twitter • Financials • Health industry • Insurance • Credit card companies • Just about any company collecting user data… Hadoop on Azure

Hadoop today • Freely-available framework for big data • http://hadoop.apache.org/ • Based on concept of Map-Reduce: map function reduce intermediate results BIG data Map Map Reduce R Map Map . . . . . .

Massively-parallel Mapper Mapper Mapper Reducer Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Hadoop on Azure

Workflow Data Map Map Map [ <key1,value>, <key4,value>, <key2,value>, … ] Sort Sort Sort [ <key1,value>, <key1,value>, … ] Merge [ <key1, [value,value,…]>, <key2, [value,value,…]>, … ] Reduce R [ <key1, value>, <key2, value>… ]

Example Average rating… • Netflix data-mining… Netflix Movie Reviews (.txt) Netflix Data Mining App movieid,userid,rating,date 1,2390087,3,2005-09-06 217,5567801,5,2006-01-03 42,1121098,3,2006-03-25 1,8972234,5,2003-12-02 . . . Hadoop on Azure

NetflixWorkflow Data Map Map Map [ <1,3>, <217,5>, <42,3>, <1,5>, <134,2>, <42,1>, … ] Sort Sort Sort [ <1,3>, <1,5>, <42,3>, <42,1>, <134,2>, <217,5>, … ] Merge [ <1, [3,5]>, <42, [3,1]>, <134, [2, …]>, <217, [5, …]>, … ] Reduce R [ <1, 4>, <42, 2>, <134, ?>, … ]

Netflix map/ reduce functions? • To compute average rating for every movie: • // Javascript version: • varmap= function (key, value, context) • { • var values = value.split(","); • // field 0 contains movieid, field 2 the rating: • context.write(values[0], values[2]); • }; • varreduce= function (key, values, context) • { • var sum = 0; • var count = 0; • while (values.hasNext()) • { • count++; • sum += parseInt(values.next()); • } • context.write(key, sum/count); • }; Hadoop on Azure

Traditional use of Hadoop • Upload data to HDFS • Hadoop file system • Write map / reduce functions • default is to use Java • most languages supported: C, C++, C#, JavaScript, Python, … • Compile and upload code • For Java, you upload .jar file • For others, .exe or script • SubmitMapReduce job • Wait for job to complete Hadoop on Azure

When to use Hadoop? • Queries against big datasets • Embarrassingly-parallel problems • Solution must fit into map-reduce framework • Non-real-time demands • Hadoop is not for: • Small datasets (< 1GB?) • Sub-second / real-time needs (though clearly Google makes it work) Hadoop on Azure

Data set for demo • We’ll be working with Chicago crime data… • https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 • http://www.cityofchicago.org/city/en/narr/foia/CityData.html 1 GB 5M rows

Goal? • Compute top-10 crimes… IUCR Count 0486 366903 0820 308074 . . . 0890 166916 IUCR = Illinois Uniform Crime Codes https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e

Demo • Hadoop on Azure… • Supports traditional Hadoop usage • Upload data • Write MapReduce program • Submit job • Additional features: • Allows access to persistent data from Azure Storage Vault • Provides interactive JavaScript console • Built-in higher-level query languages (PIG, HIVE) Hadoop on Azure

Demo: map reduce functions • // Javascript version: • varmap= function (key, value, context) • { • var values = value.split(","); • context.write(values[4], 1); • }; • varreduce= function (key, values, context) • { • var sum = 0; • while (values.hasNext()) • { • sum += parseInt(values.next()); • } • context.write(key, sum); • }; 0486 366903 0820 308074 . . . Hadoop on Azure

Demo: PIG command • // interactive PIG with explicit Map-Reduce functions: • pig.from("asv://datafiles/CC-from-2001.txt"). • mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). • orderBy("Count DESC"). • take(10). • to("output-from-2001") • // visualize the results: • file = fs.read("output-from2001/part-r-00000") • data = parse(file.data, "IUCR, Count:long") • graph.bar(data) Hadoop on Azure

Hadoop on Azure • Microsoft is offering free access to Hadoop • Request invitation @ http://www.hadooponazure.com/ • Hadoopconnector for Excel • Process data using Hadoop, analyze/visualize using Excel Hadoop on Azure

That’s it! Hadoop on Azure

Summary • Hadoop is all about big data processing • Scalable, parallel, fault-tolerant • Easy to understand programming model • Map-Reduce • But then solution must fit into this framework… • Rich ecosystem developing around Hadoop • Technologies: PIG, HIVE, HBase, … • Companies: Cloudera, Hortonworks, MapR, … Hadoop on Azure

Thank you for attending • Presenter: Joe Hummel • Email: joe@joehummel.net • Materials: http://www.joehummel.net/downloads.html • For more info: • http://www.hadooponazure.com/ • http://msdn.microsoft.com/en-us/magazine/jj190805.aspx • Overview, including how to access via .NET API: • http://www.simple-talk.com/cloud/data-science/analyze-big-data-with-apache-hadoop-on-windows-azure-preview-service-update-3/ Hadoop on Azure

Introduction to Hadoop on Azure: Map-Reduce Demos