1 / 22

“Introducing Hadoop on Azure:

hello Map-Reduce!”. “Introducing Hadoop on Azure:. Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago. Materials: http ://www.joehummel.net/downloads.html Email: joe@joehummel.net. Agenda. A little history…

kamala
Télécharger la présentation

“Introducing Hadoop on Azure:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. hello Map-Reduce!” “Introducing Hadoop on Azure: Joe Hummel, PhD Visiting Researcher: U. of California, IrvineAdjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net

  2. Agenda • A little history… • Why Hadoop? • How it works • Demos • Summary Hadoop on Azure

  3. A little history… • Map-Reduce is from functional programming // function returns 1 if i is prime, 0 if not: letisPrime(i) = ... // sums 2 numbers: letsum(x, y) = return x + y // count the number of primes in 1..N: letcountPrimes(N) = let L = [ 1 .. N ] // [ 1, 2, 3, 4, 5, 6, ... ] let T = mapisPrime L // [ 0, 1, 1, 0, 1, 0, ... ] let count = reducesum T // 42 return count Hadoop on Azure

  4. A little more history… • Created by to drive internet search • BIG data ― scalable to TBs and beyond • Parallelism: to get the performance • Data partitioning: to drive the parallelism • Fault tolerance: at this scale, machines are going to crash, a lot… BIG Data page hits

  5. Who’s using Hadoop • Search engines: Google, Yahoo, Bing • Facebook • Twitter • Financials • Health industry • Insurance • Credit card companies • Just about any company collecting user data… Hadoop on Azure

  6. Hadoop today • Freely-available framework for big data • http://hadoop.apache.org/ • Based on concept of Map-Reduce: map function reduce intermediate results BIG data Map Map Reduce R Map Map . . . . . .

  7. Massively-parallel Mapper Mapper Mapper Reducer Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Reducer Mapper Mapper Mapper Hadoop on Azure

  8. Workflow Data Map Map Map [ <key1,value>, <key4,value>, <key2,value>, … ] Sort Sort Sort [ <key1,value>, <key1,value>, … ] Merge [ <key1, [value,value,…]>, <key2, [value,value,…]>, … ] Reduce R [ <key1, value>, <key2, value>… ]

  9. Example Average rating… • Netflix data-mining… Netflix Movie Reviews (.txt) Netflix Data Mining App movieid,userid,rating,date 1,2390087,3,2005-09-06 217,5567801,5,2006-01-03 42,1121098,3,2006-03-25 1,8972234,5,2003-12-02 . . . Hadoop on Azure

  10. NetflixWorkflow Data Map Map Map [ <1,3>, <217,5>, <42,3>, <1,5>, <134,2>, <42,1>, … ] Sort Sort Sort [ <1,3>, <1,5>, <42,3>, <42,1>, <134,2>, <217,5>, … ] Merge [ <1, [3,5]>, <42, [3,1]>, <134, [2, …]>, <217, [5, …]>, … ] Reduce R [ <1, 4>, <42, 2>, <134, ?>, … ]

  11. Netflix map/ reduce functions? • To compute average rating for every movie: • // Javascript version: • varmap= function (key, value, context) • { • var values = value.split(","); • // field 0 contains movieid, field 2 the rating: • context.write(values[0], values[2]); • }; • varreduce= function (key, values, context) • { • var sum = 0; • var count = 0; • while (values.hasNext()) • { • count++; • sum += parseInt(values.next()); • } • context.write(key, sum/count); • }; Hadoop on Azure

  12. Traditional use of Hadoop • Upload data to HDFS • Hadoop file system • Write map / reduce functions • default is to use Java • most languages supported: C, C++, C#, JavaScript, Python, … • Compile and upload code • For Java, you upload .jar file • For others, .exe or script • SubmitMapReduce job • Wait for job to complete Hadoop on Azure

  13. When to use Hadoop? • Queries against big datasets • Embarrassingly-parallel problems • Solution must fit into map-reduce framework • Non-real-time demands • Hadoop is not for: • Small datasets (< 1GB?) • Sub-second / real-time needs (though clearly Google makes it work) Hadoop on Azure

  14. Data set for demo • We’ll be working with Chicago crime data… • https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 • http://www.cityofchicago.org/city/en/narr/foia/CityData.html 1 GB 5M rows

  15. Goal? • Compute top-10 crimes… IUCR Count 0486 366903 0820 308074 . . . 0890 166916 IUCR = Illinois Uniform Crime Codes https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e

  16. Demo • Hadoop on Azure… • Supports traditional Hadoop usage • Upload data • Write MapReduce program • Submit job • Additional features: • Allows access to persistent data from Azure Storage Vault • Provides interactive JavaScript console • Built-in higher-level query languages (PIG, HIVE) Hadoop on Azure

  17. Demo: map reduce functions • // Javascript version: • varmap= function (key, value, context) • { • var values = value.split(","); • context.write(values[4], 1); • }; • varreduce= function (key, values, context) • { • var sum = 0; • while (values.hasNext()) • { • sum += parseInt(values.next()); • } • context.write(key, sum); • }; 0486 366903 0820 308074 . . . Hadoop on Azure

  18. Demo: PIG command • // interactive PIG with explicit Map-Reduce functions: • pig.from("asv://datafiles/CC-from-2001.txt"). • mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). • orderBy("Count DESC"). • take(10). • to("output-from-2001") • // visualize the results: • file = fs.read("output-from2001/part-r-00000") • data = parse(file.data, "IUCR, Count:long") • graph.bar(data) Hadoop on Azure

  19. Hadoop on Azure • Microsoft is offering free access to Hadoop • Request invitation @ http://www.hadooponazure.com/ • Hadoopconnector for Excel • Process data using Hadoop, analyze/visualize using Excel Hadoop on Azure

  20. That’s it! Hadoop on Azure

  21. Summary • Hadoop is all about big data processing • Scalable, parallel, fault-tolerant • Easy to understand programming model • Map-Reduce • But then solution must fit into this framework… • Rich ecosystem developing around Hadoop • Technologies: PIG, HIVE, HBase, … • Companies: Cloudera, Hortonworks, MapR, … Hadoop on Azure

  22. Thank you for attending • Presenter: Joe Hummel • Email: joe@joehummel.net • Materials: http://www.joehummel.net/downloads.html • For more info: • http://www.hadooponazure.com/ • http://msdn.microsoft.com/en-us/magazine/jj190805.aspx • Overview, including how to access via .NET API: • http://www.simple-talk.com/cloud/data-science/analyze-big-data-with-apache-hadoop-on-windows-azure-preview-service-update-3/ Hadoop on Azure

More Related