1 / 11

Project 1 : Who is Popular, and Who is Not.

Project 1 : Who is Popular, and Who is Not. Angel Trifonov Anh Pham Xiao Qin. Tasks. Task b, c both in Pig and Java Task h in Java. Task b in Java. Write a job(s) that reports for each country, how many of its citizens have a Facebook page. Single map-reduce job

Télécharger la présentation

Project 1 : Who is Popular, and Who is Not.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project 1 : Who is Popular, and Who is Not. Angel Trifonov AnhPham Xiao Qin

  2. Tasks Task b, c both in Pig and Java Task h in Java

  3. Task b in Java Write a job(s) that reports for each country, how many of its citizens have a Facebook page. • Single map-reduce job • Input: MyPage datasets • Mapper: examine each file line-by-line • Each line converted to a string • String is split using “,” delimiter • Extract nationality and map to an IntWriteable • Reducer: take all pairs and sum values for each key • Output: number of users per nationality • Single reducer

  4. Task b in Pig • Group Mypage dataset based on Country code: • countrygrp= group mypage by cc; • Report number of people that have Facebook page for each country: • taskb= foreachcountrygrp generate group, COUNT(mypage.id); • dump taskb; Running Time Comparison: Plain MapReduce: 1 min 36 sec (Job time) Pig: 24sec (Job time)

  5. Task c in Java Find the top 10 interesting Facebook pages, namely, those that got the most accesses based on your AccessLog dataset compared to all other pages. • HadoopSettings: multiple mappers and one reducer. (setNumReduceTasks(1)) • Input: AccessLog • 1st round: • Mapper(s): Parse the input data. Get the WhatPage. Set WhatPage as the key and a constant number 1 as the value. • Reducer: For each key, sum up the total value. Set the WhatPage as the key and the total count as the value • 2nd round: • Swap the key and value (InverseMapper.class) • Output: [Count] , [WhatPage] (in descending order )

  6. Task c in Pig • Group the Accesslog dataset based on accessed facebook ID: • access_fid_grp= group alog by fid; • Get the access count for each accessed facebook ID: • grpcnt = foreachaccess_fid_grpgenerate group,COUNT(alog.aid) as alogcnt; • Order the count descending: • grporder = order grpcnt by alogcntdesc; • List top 10: • taskc = limit grporder 10; • dump taskc; Running Time Comparison: Plain MapReduce: 2 min 1 sec(Job time) Pig: 1 min 52 sec (Job time)

  7. Task h :Define Potential Stalkers A person who visits another person’s Facebook page too much. But they are not friend.

  8. Mapper - Output key: 2nd field (Person ID): IntWritable 1st Field, PersonID, 3rd Field… - Output value: “<dataset tag>, <ID>”: Text Friends: personIDf, friendID Accesslog: personIDa, visitedID

  9. Reducer Key:<personID> Value List:<(f,friendID) (a,visitedID) (f,friendID) (a,visitedID) …> • Sort the list based on the second field of each element. • All visitedID and friendIDhave the same value will be place next to each other • If all ID are visitedID, and it appears too many times (based on a predefined threshold) => Potential stalker. • Output: personIDvisitedID

  10. Sample Result

  11. Thank you! Questions?

More Related