1 / 92

Moving Social Network Mining Algorithms to the MapReduce World

Moving Social Network Mining Algorithms to the MapReduce World. August 23, 2012 KAIST Jae-Gil Lee. 강연자 소개 - 이재길 교수. 약력 2010 년 12 월 ~ 현재 : KAIST 지식서비스공학과 조교수 2008 년 9 월 ~2010 년 11 월 : IBM Almaden Research Center 연구원

amelia
Télécharger la présentation

Moving Social Network Mining Algorithms to the MapReduce World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Moving Social Network Mining Algorithms to the MapReduce World August 23, 2012 KAIST Jae-Gil Lee

  2. 강연자 소개 - 이재길 교수 • 약력 • 2010년 12월~현재: KAIST 지식서비스공학과 조교수 • 2008년 9월~2010년 11월: IBM Almaden Research Center 연구원 • 2006년 7월~2008년 8월: University of Illinois at Urbana-Champaign 박사후연구원 • 연구분야 • 시공간 데이터 마이닝(경로 및 교통 데이터) • 소셜 네트워크 및 그래프 데이터 마이닝 • 빅 데이터 분석 (MapReduce및 Hadoop) • 연락처 • E-mail: • 홈페이지: http://dm.kaist.ac.kr/jaegil

  3. KAIST 지식서비스공학과 지식서비스공학은 인지공학, 인공지능, IT 기술, 의사결정, HCI, 빅데이터 분석 등의 지식관련 기술을 융합하여 인간과 IT 시스템과의 소통과 협력을 혁신하는 지능적 지식서비스를 연구 개발하는 것을 목표로 하고 있으며 지식서비스 발전의 중심축을 이루는 학문이다. 홈페이지: http://kse.kaist.ac.kr/

  4. 1 2 3 4 5 Contents Big Data and Social Networks MapReduce and Hadoop Data Mining with MapReduce Social Network Data Mining with MapReduce Conclusions

  5. 1. Big Data and Social Networks

  6. Big Data (1/2) • Big datarefers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze [IDC] • IDC forecast of the size of “digital universe” in 2011 is 1.8 zettabytes (A zettabyte is one billion terabytes!) • The size of datasets that qualify as big data will also increase • The three V’s: Volume, Velocity, Variety • Velocity refers to the low-latency, real-time speed at which analytics needs to be applied • Volume refers to "internet scale" • Variety means that data is in all sorts of forms all over the place

  7. Big Data (2/2)

  8. Big Data Social Networks • The online social network (OSN) is one of the main sources of big data

  9. Data Growth in Facebook

  10. Data Growth in Twitter

  11. Some Statistics on OSNs • Twitter is estimated to have 140 million users, generating 340 million tweets a day and handling over 1.6 billion search queries per day • As of May 2012, Facebook has more than 900 million active users; Facebook has 138.9 million monthly unique U.S. visitors in May 2011

  12. Social Network Service • An online service, platform, or site that focuses on building and reflecting of social networks or social relations among people, who, for example, share interests and/or activities [wikipedia] • Consists of a representation of each user (often a profile), his/her social links, and a variety of additional services • Provides service using web or mobile phones

  13. Popular OSN Services • Facebook, Twitter, LinkedIn, 싸이월드, 카카오스토리, 미투데이 • Inherently created for social networking • Flickr, YouTube • Originally created for content sharing • Also allow an extensive level of social interaction • e.g., subscription features • Foursquare, Google Latitude, 아임IN • Geosocialnetworking services

  14. Other Forms of OSNs • MSN Messenger, Skype, Google Talk, 카카오톡 • Can be considered as an indirect form of social networks • Bibliographic networks (e.g., DBLP, Google Scholar) • Co-authorship data • Citation data • Blogs (e.g., Naver Blog) • Neighbor list

  15. Data Characteristics • Relationshipdata: e.g., follower, … • Contentdata: e.g., tweets, … • Locationdata a user contents relationship location

  16. Graph Data • A social network is usually modeled as a graph • A node → an actor • An edge → a relationship or an interaction

  17. Directed or Undirected? • Edges can be either directed or undirected • Undirected edge (or symmetric relationship) • No direction in edges • Facebook friendship: if A is a friend of B, then B should be also a friend of A • Directed edge (or asymmetric relationship) • Direction does matter in edges • Twitter following: although A is a follower of B, B may not be a follower of A A B A B

  18. Weight on Edges? • Edges can be weighted • Examples of weight? • Geographical social networking data → ? • DBLP co-authorship data → ? • … A B

  19. Two Types of Graph Data • A single large graph • e.g., social network data(in the previous page) • Multiple graphs (each of which may possibly be of modest size) • e.g., chemical compound database Scope of this tutorial

  20. 2. MapReduce and Hadoop Note: Some of the slides in this section are from KDD 2011 tutorial “Large-scale Data Mining: MapReduce and Beyond”

  21. Big Data Analysis • To handle big data, Google proposed a new approach called MapReduce • MapReducecan crunch huge amounts of data by splitting the task over multiple computers that can operate in parallel • No matter how large the problem is, you can always increase the number of processors (that today are relatively cheap)

  22. MapReduce Basics • Mapstep: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node. • Reducestep: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. Example:

  23. Example – Programming Model def getName (line): return line.split(‘\t’)[1] def addCounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist input = open(‘employees.txt’, ‘r’) intermediate = map(getName, input) result = reduce(addCounts, \ intermediate, {}) mapper employees.txt # LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... reducer Q: “What is the frequency of each first name?”

  24. Example – Programming Model defgetName (line): return (line.split(‘\t’)[1], 1) defaddCounts (hist, (name, c)): hist[name] = \ hist.get(name,default=0) + c return hist input = open(‘employees.txt’, ‘r’) intermediate = map(getName, input) result = reduce(addCounts, \ intermediate, {}) mapper employees.txt # LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90,000 ... ... ... ... ... ... reducer Q: “What is the frequency of each first name?” Key-value iterators

  25. Example – Programming ModelHadoop / Java public class HistogramJob extends Configured implements Tool { public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]); output.collect(firstname, ONE); } } // class FieldMapper typed… non-boilerplate

  26. Example – Programming ModelHadoop / Java public static class LongSumReducer extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable sum = new LongWritable(); @Override public void reduce (Text key, Iterator<LongWritable> vals, OutputCollector<Text,LongWritable> out, Reporter r) { long s = 0; while (vals.hasNext()) s += vals.next().get(); sum.set(s); output.collect(key, sum); } } // class LongSumReducer

  27. Example – Programming ModelHadoop / Java public int run (String[] args) throws Exception { JobConf job = new JobConf(getConf(), HistogramJob.class); job.setJobName(“Histogram”); FileInputFormat.setInputPaths(job, args[0]); job.setMapperClass(FieldMapper.class); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); // ... JobClient.runJob(job); return 0; } // run() public static main (String[] args) throws Exception { ToolRunner.run(new Configuration(), new HistogramJob(), args); } // main() } // class HistogramJob

  28. Execution Model: Flow Key/value iterators Input file Smith John $90,000 SPLIT 0 MAPPER John 1 Output file REDUCER John 2 SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 Yates John $80,000 PART 1 MAPPER John 1 SPLIT 3 MAPPER Sort-merge All-to-all, hash partitioning Sequential scan

  29. Execution Model: Placement HOST 1 HOST 2 SPLIT 0 Replica 2/3 SPLIT 4 Replica 1/3 SPLIT 3 Replica 3/3 SPLIT 2 Replica 2/3 MAPPER MAPPER HOST 3 SPLIT 3 Replica 1/3 HOST 0 SPLIT 0 Replica 3/3 SPLIT 2 Replica 3/3 SPLIT 1 Replica 1/3 SPLIT 0 Replica 1/3 SPLIT 1 Replica 2/3 MAPPER MAPPER Computation co-located with data (as much as possible) SPLIT 4 Replica 2/3 SPLIT 3 Replica 2/3 HOST 5 HOST 4 HOST 6

  30. Execution Model: Placement HOST 1 HOST 2 SPLIT 0 Replica 2/3 SPLIT 4 Replica 1/3 SPLIT 3 Replica 3/3 SPLIT 2 Replica 2/3 MAPPER MAPPER HOST 3 C SPLIT 3 Replica 1/3 HOST 0 C REDUCER SPLIT 0 Replica 3/3 SPLIT 2 Replica 3/3 SPLIT 1 Replica 1/3 SPLIT 0 Replica 1/3 SPLIT 1 Replica 2/3 MAPPER MAPPER C SPLIT 4 Replica 2/3 Rack/network-aware C SPLIT 3 Replica 2/3 HOST 5 HOST 4 HOST 6 COMBINER C

  31. Apache Hadoop • The most popular open-source implementation of MapReduce • http://hadoop.apache.org/ HBase Pig Hive Chukwa MapReduce HDFS Zoo Keeper Core Avro

  32. Apache Mahout (1/2) • A scalable machine learning and data mining library built upon Hadoop • Currently, ver 0.7: the implementation details are not known • http://mahout.apache.org/ • Supporting algorithms • Collaborative filtering • User and Item based recommenders • K-Means, Fuzzy K-Means clustering • Singular value decomposition • Parallel frequent pattern mining • Complementary Naive Bayes classifier • Random forest decision tree based classifier • …

  33. Apache Mahout (2/2) • Data structure for vectors and matrices • Vectors • Dense vectors as a double[] • Sparse vectors as a HashMap<Integer, Double> • Operations: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum, and cross • Matrices • Dense matrix as a double[][] • SparseRowMatrixor SparseColumnMatrix as a Vector[] as holding the rows or columns of the matrix in a SparseVector • SparseMatrixas a HashMap<Integer, Vector> • Operations: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart, and zSum

  34. 3. Data Mining with MapReduce

  35. Clustering Basics • Grouping data to form new categories (clusters) • Principle: maximizing intra-cluster similarity and minimizing inter-cluster similarity • e.g., customer locations in a city two clusters

  36. k-Means Clustering (1/3) • Arbitrarily choose kpoints from D as the initial cluster centers • (Re)assign each point to the cluster to which the point is the most similar, based on the mean value of the points in the cluster (centroid) • Update the cluster centroids, i.e., calculate the mean value of the points for each cluster • Repeat 2~3 until the criterion function converges

  37. k-Means Clustering (2/3)

  38. k-Means Clustering (3/3)

  39. k-Means on MapReduce (1/2) [Chu et al., 2006] • Map: assigning each point to the closet centroid Mapper1 Mapper2 Map(point p, the set of centroids): forcin centroids do ifdist(p, c) < minDistthen minDist = dist(p, c) closestCentroid = c emit (closestCentroid, p) Data Map Reduce (1, …), (2, …) Split1 Mapper1 Reducer1 Mapper2 Reducer2 Split2 (3, …), (4, …)

  40. k-Means on MapReduce (2/2) [Chu et al., 2006] • Reduce: updating each centroid with newly assigned points Reducer1 Reducer2 Reduce(centroid c, the set of points): forpin points do coordinates += p count += 1 emit (c, coordinates / count) Repeat Data Map Reduce (1, …), (2, …) New centroids for C1 and C2 Split1 Mapper1 Reducer1 New centroids for C3 and C4 Mapper2 Reducer2 Split2 (3, …), (4, …)

  41. Classification Basics Classifier Unseen data (Jeff, Professor, 4, ?) Features Prediction Feature Generation Training data Tenured = Yes Class label

  42. Compute Distance Test Record Training Records Choose the k “nearest” records k-NN Classification (1/2) • Intuition behind k-NN classification

  43. k-NN Classification (2/2) • Compute the distance to other training records • Identify k nearest neighbors • Use the class labels of the NNs to determine the class label of an unknown record (e.g., by the majority vote)

  44. k-NN Classification on MapReduce (1/2) • Map: finding candidates for k-nearest neighbors • Obtaining localk-nearest neighbors in the split k=3 Map(query q, the set of points): knns = find k-nearest neighborsfrom the given set of points // Output the k-NNs in the split emit (q, knns) Mapper1 Mapper2 Query Data Map Reduce local k-nearest neighbors Split1 Mapper1 Reducer1 Mapper2 Split2

  45. k-NN Classification on MapReduce (2/2) • Reduce: finding true k-nearest neighbors • Obtaining globalk-nearest neighbors k=3 Reduce (query q, local neighbors): knns = find k-nearest neighborsamong all local neighbors emit (q, knns) Reducer1 Query Only Once Data Map Reduce local k-nearest neighbors Split1 Mapper1 k-nearest neighbors Reducer1 Mapper2 Split2

  46. Naïve Bayes Classifiers (1/2) • The probability model for a classifier is a conditional model • Using Bayes’ theorem, we write • Under the conditional independence assumptions, the conditional distribution can be expressed as below

  47. Naïve Bayes Classifiers (2/2) • Example X = (age <=30, income = medium, student = yes, credit_rating= fair) P(buy = “yes”) = 9/14 = 0.643 P(buy = “no”) = 5/14= 0.357 P(age = “<=30” | buy = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buy = “no”) = 3/5 = 0.6 P(income = “medium” | buy = “yes”) = 4/9 = 0.444 P(income = “medium” | buy = “no”) = 2/5 = 0.4 P(student = “yes” | buy = “yes) = 6/9 = 0.667 P(student = “yes” | buy = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buy = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buy = “no”) = 2/5 = 0.4 P(X | buy = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X | buy = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 0.643 x 0.044 = 0.028 vs. 0.357 x 0.019 = 0.007 Training Data  buy = yes

  48. Naïve Bayes Classifiers on MapReduce (1/3) [Chu et al., 2006] • We have to estimate , , from the training data • For simplicity, we consider two-class problems • Thus, we count the number of records in parallel

  49. Naïve Bayes Classifiers on MapReduce(2/3) [Chu et al., 2006] • Map: counting the number of records for 1~4 in the previous page (1) For conditional probability (2) For class distribution Map(record): emit ((Aj, C), 1) Map(record): emit (C, 1) Data Map Reduce Counts for C=1 Split1 Mapper1 Reducer1 Mapper2 Reducer2 Split2 Counts for C=0

  50. Naïve Bayes Classifiers on MapReduce(3/3) [Chu et al., 2006] • Reduce: summing up the counts (1) For conditional probability (2) For class distribution Reduce ((Aj, C), counts): total += counts emit ((Aj, C), total) Reduce (C, counts): total += counts emit (C, total) Only Once Data Map Reduce Counts for C=1 Split1 Mapper1 Reducer1 Mapper2 Reducer2 Split2 Counts for C=0

More Related