220 likes | 310 Vues
hello Map-Reduce!”. “Introducing Hadoop on Azure:. Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago. Materials: http ://www.joehummel.net/downloads.html Email: joe@joehummel.net. Agenda. A little history…
E N D
hello Map-Reduce!” “Introducing Hadoop on Azure: Joe Hummel, PhD Visiting Researcher: U. of California, IrvineAdjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net
Agenda • A little history… • Why Hadoop? • How it works • Demos • Summary Hadoop on Azure
A little history… • Map-Reduce is from functional programming // function returns 1 if i is prime, 0 if not: letisPrime(i) = ... // sums 2 numbers: letsum(x, y) = return x + y // count the number of primes in 1..N: letcountPrimes(N) = let L = [ 1 .. N ] // [ 1, 2, 3, 4, 5, 6, ... ] let T = mapisPrime L // [ 0, 1, 1, 0, 1, 0, ... ] let count = reducesum T // 42 return count Hadoop on Azure
A little more history… • Created by to drive internet search • BIG data ― scalable to TBs and beyond • Parallelism: to get the performance • Data partitioning: to drive the parallelism • Fault tolerance: at this scale, machines are going to crash, a lot… BIG Data page hits
Who’s using Hadoop • Search engines: Google, Yahoo, Bing • Facebook • Twitter • Financials • Health industry • Insurance • Credit card companies • Just about any company collecting user data… Hadoop on Azure
Hadoop today • Freely-available framework for big data • http://hadoop.apache.org/ • Based on concept of Map-Reduce: map function reduce intermediate results BIG data Map Map Reduce R Map Map . . . . . .
Massively-parallel Mapper Mapper Mapper Reducer Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Hadoop on Azure
Workflow Data Map Map Map [ <key1,value>, <key4,value>, <key2,value>, … ] Sort Sort Sort [ <key1,value>, <key1,value>, … ] Merge [ <key1, [value,value,…]>, <key2, [value,value,…]>, … ] Reduce R [ <key1, value>, <key2, value>… ]
Example Average rating… • Netflix data-mining… Netflix Movie Reviews (.txt) Netflix Data Mining App movieid,userid,rating,date 1,2390087,3,2005-09-06 217,5567801,5,2006-01-03 42,1121098,3,2006-03-25 1,8972234,5,2003-12-02 . . . Hadoop on Azure
NetflixWorkflow Data Map Map Map [ <1,3>, <217,5>, <42,3>, <1,5>, <134,2>, <42,1>, … ] Sort Sort Sort [ <1,3>, <1,5>, <42,3>, <42,1>, <134,2>, <217,5>, … ] Merge [ <1, [3,5]>, <42, [3,1]>, <134, [2, …]>, <217, [5, …]>, … ] Reduce R [ <1, 4>, <42, 2>, <134, ?>, … ]
Netflix map/ reduce functions? • To compute average rating for every movie: • // Javascript version: • varmap= function (key, value, context) • { • var values = value.split(","); • // field 0 contains movieid, field 2 the rating: • context.write(values[0], values[2]); • }; • varreduce= function (key, values, context) • { • var sum = 0; • var count = 0; • while (values.hasNext()) • { • count++; • sum += parseInt(values.next()); • } • context.write(key, sum/count); • }; Hadoop on Azure
Traditional use of Hadoop • Upload data to HDFS • Hadoop file system • Write map / reduce functions • default is to use Java • most languages supported: C, C++, C#, JavaScript, Python, … • Compile and upload code • For Java, you upload .jar file • For others, .exe or script • SubmitMapReduce job • Wait for job to complete Hadoop on Azure
When to use Hadoop? • Queries against big datasets • Embarrassingly-parallel problems • Solution must fit into map-reduce framework • Non-real-time demands • Hadoop is not for: • Small datasets (< 1GB?) • Sub-second / real-time needs (though clearly Google makes it work) Hadoop on Azure
Data set for demo • We’ll be working with Chicago crime data… • https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 • http://www.cityofchicago.org/city/en/narr/foia/CityData.html 1 GB 5M rows
Goal? • Compute top-10 crimes… IUCR Count 0486 366903 0820 308074 . . . 0890 166916 IUCR = Illinois Uniform Crime Codes https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e
Demo • Hadoop on Azure… • Supports traditional Hadoop usage • Upload data • Write MapReduce program • Submit job • Additional features: • Allows access to persistent data from Azure Storage Vault • Provides interactive JavaScript console • Built-in higher-level query languages (PIG, HIVE) Hadoop on Azure
Demo: map reduce functions • // Javascript version: • varmap= function (key, value, context) • { • var values = value.split(","); • context.write(values[4], 1); • }; • varreduce= function (key, values, context) • { • var sum = 0; • while (values.hasNext()) • { • sum += parseInt(values.next()); • } • context.write(key, sum); • }; 0486 366903 0820 308074 . . . Hadoop on Azure
Demo: PIG command • // interactive PIG with explicit Map-Reduce functions: • pig.from("asv://datafiles/CC-from-2001.txt"). • mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). • orderBy("Count DESC"). • take(10). • to("output-from-2001") • // visualize the results: • file = fs.read("output-from2001/part-r-00000") • data = parse(file.data, "IUCR, Count:long") • graph.bar(data) Hadoop on Azure
Hadoop on Azure • Microsoft is offering free access to Hadoop • Request invitation @ http://www.hadooponazure.com/ • Hadoopconnector for Excel • Process data using Hadoop, analyze/visualize using Excel Hadoop on Azure
That’s it! Hadoop on Azure
Summary • Hadoop is all about big data processing • Scalable, parallel, fault-tolerant • Easy to understand programming model • Map-Reduce • But then solution must fit into this framework… • Rich ecosystem developing around Hadoop • Technologies: PIG, HIVE, HBase, … • Companies: Cloudera, Hortonworks, MapR, … Hadoop on Azure
Thank you for attending • Presenter: Joe Hummel • Email: joe@joehummel.net • Materials: http://www.joehummel.net/downloads.html • For more info: • http://www.hadooponazure.com/ • http://msdn.microsoft.com/en-us/magazine/jj190805.aspx • Overview, including how to access via .NET API: • http://www.simple-talk.com/cloud/data-science/analyze-big-data-with-apache-hadoop-on-windows-azure-preview-service-update-3/ Hadoop on Azure