1 / 99

Introduction to Large-Scale Graph Computation

Introduction to Large-Scale Graph Computation. + GraphLab and GraphChi Aapo Kyrola , akyrola@cs.cmu.edu Feb 27, 2013. Acknowledgments. Many slides (the pretty ones) are from Joey Gonzalez’ lecture (2012) Many people involved in the research:. Haijie Gu. Danny Bickson. Arthur

margie
Télécharger la présentation

Introduction to Large-Scale Graph Computation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Large-Scale Graph Computation + GraphLab and GraphChi Aapo Kyrola, akyrola@cs.cmu.edu Feb 27, 2013

  2. Acknowledgments • Many slides (the pretty ones) are from Joey Gonzalez’ lecture (2012) • Many people involved in the research: Haijie Gu Danny Bickson Arthur Gretton Yucheng Low Joey Gonzalez Carlos Guestrin Guy Blelloch Joe Hellerstein David O’Hallaron Alex Smola

  3. Contents • Introduction to Big Graphs • Properties of Real-world Graphs • Why Map-Reduce not good for big graphs  specialized systems • Vertex-Centric Programming Model • GraphLab -- distributed computation • GraphChi -- disk-based

  4. Basic vocabulary • Graph (network) • Vertex (node) • Edge (link), in-edge, out-edge • Sparse graph / matrix A e B Terms: e is an out-edge of A, and in-edge of B.

  5. introduction to Big graphs

  6. What is a“Big” Graph? • Definition changes rapidly: • GraphLab paper 2009: biggest graph 200M edges • Graphlab & GraphChi papers 2012: biggest graph 6.7B edges • GraphChi @ Twitter: many times bigger. • Depends on the computation as well • matrix factorization (collaborative filtering) or Belief Propagation much more expensive than PageRank

  7. What is “Big” Graph Big Graphs are always extremely sparse. • Biggest graphs available to researchers • Altavista: 6.7B edges, 1.4B vertices • Twitter 2010: 1.5B edges, 68M vertices • Common Crawl (2012): 5 billion web pages • But the industry has even bigger ones: • Facebook (Oct 2012): 144B friendships, 1B users • Twitter (2011): 15B follower-edges • When reading about graph processing systems, be critical of the problem sizes – are they really big? • Shun, Blelloch (2013, PPoPP): use single machine (256 gb RAM) for in-memory computation on same graphs as the GraphLab/GraphChi papers.

  8. Examples of Big Graphs • Twitter – what kind of graphs? • follow-graph engagement graph list-members graph topic-authority (consumers -> producers)

  9. Example of Big Graphs • Facebook: extended social graph • FB friend-graph: differences to Twitter’s graph? Slide from Facebook Engineering’s presentation

  10. Other Big Networks • WWW • Academic Citations • Internet traffic • Phone calls

  11. What can we compute from social networks / web graphs? • Influence ranking • PageRank, TunkRank, SALSA, HITS • Analysis • triangle counting (clustering coefficient), community detection, information propagation, graph radii, ... • Recommendations • who-to-follow, who-to-follow for topic T • similarities • Search enhancements • Facebook’s Graph Search • But actually: it is a hard question by itself!

  12. Sparse Matrices How to represent sparse matrices as graphs? • User x Item/Product matrices • explicit feedback (ratings) • implicit feedback (seen or not seen) • typically very sparse

  13. Product – Item bipartite graph Women on the Verge of aNervous Breakdown 4 3 The Celebration City of God 2 Wild Strawberries 5 La Dolce Vita

  14. What can we compute from user-item graphs? • Collaborative filtering (recommendations) • Recommend products that users with similar tests have recommended. • Similarity / distance metrics • Matrix factorization • Random walk based methods • Lots of algorithms available. See Danny Bickson’s CF toolkit for GraphChi: • http://bickson.blogspot.com/2012/08/collaborative-filtering-with-graphchi.html

  15. Probabilistic Graphical Models • Each vertex represents a random variable • Edges between vertices represent dependencies • modelled with conditional probabilities • Bayes networks • Markov Random Fields • Conditional Random Fields • Goal: given evidence (observed variables), compute likelihood of the unobserved variables • Exact inference generally intractable • Need to use approximations.

  16. Cooking Cameras Shopper 2 Shopper 1

  17. Image Denoising Synthetic Noisy Image Few Updates Graphical Model

  18. Still more examples • CompBio • Protein-Protein interaction network • Activator/deactivator gene network • DNA assembly graph • Text modelling • word-document graphs • Knowledge bases • NELL project at CMU • Planar graphs • Road network • Implicit Graphs • k-NN graphs

  19. Resources • Stanford SNAP datasets: • http://snap.stanford.edu/data/index.html • ClueWeb (CMU): • http://lemurproject.org/clueweb09/ • Univ. of Milan’s repository: • http://law.di.unimi.it/datasets.php

  20. Twitter network visualization, by Akshay Java, 2009 properties of real world graphs

  21. Natural Graphs [Image from WikiCommons]

  22. Natural Graphs • Grids and other Planar Graphs are “easy” • Easy to find separators • The fundamental properties of natural graphs make them computationally challenging

  23. Power-Law • Degree of a vertex = number of adjacent edges • in-degree and out-degree

  24. Power-Law = Scale-free • Fraction of vertices having k neighbors: • P(k) ~ k-alpha • Generative models: • rich-get-richer (preferential attachment) • copy-model • Kronecker graphs (Leskovec, Faloutsos, et al.) • Other phenomena with power-law characteristics?

  25. Natural Graphs  Power Law Top 1% of vertices is adjacent to 53% of the edges! “Power Law” -Slope = α≈ 2 Altavista Web Graph: 1.4B Vertices, 6.7B Edges

  26. Properties of Natural Graphs Great talk by M. Mahoney :“Extracting insight from large networks: implications of small-scale and large-scale structure” • Small diameter • expected distance between two nodes in Facebook: 4.74 (2011) • Nice local structure, but no global structure from Michael Mahoney’s (Stanford) presentation

  27. Graph Compression • Local structure helps compression: • Blelloch et. al. (2003): compress web graph to 3-4 bits / link • WebGraph framework from Univ of Milano • social graphs ~ 10 bits / edge (2009) • Basic idea: • order the vertices so that topologically close vertices have ids close to each other • difference encoding

  28. Computational Challenge • Natural Graphs are very hard to partition • Hard to distribute computation to many nodes in balanced way, so that the number of edges crossing partitions is minimized • Why? Think about stars. • Graph partitioning algorithms: • METIS • Spectral clustering • Not feasible on very large graphs! • Vertex-cuts better than edge cuts (talk about this later with GraphLab)

  29. Why MapReduce is not enough large-scale graph computation systems

  30. Parallel Graph Computation • Distributed computation and/or multicore parallelism • Sometimes confusing. We will talk mostly about distributed computation. • Are classic graph algorithms parallelizable? What about distributed? • Depth-first search? • Breadth-first search? • Priority-queue based traversals (Djikstra’s, Prim’s algorithms)

  31. MapReduce for Graphs • Graph computation almost always iterative • MapReduce ends up shipping the whole graph on each iteration over the network (map->reduce->map->reduce->...) • Mappers and reducers are stateless

  32. Iterative Computation is Difficult • System is not optimized for iteration: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data StartupPenalty Disk Penalty Startup Penalty Disk Penalty Startup Penalty Disk Penalty Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data

  33. MapReduce and Partitioning • Map-Reduce splits the keys randomly between mappers/reducers • But on natural graphs, high-degree vertices (keys) may have million-times more edges than the average • Extremely uneven distribution • Time of iteration = time of slowest job.

  34. Curse of the Slow Job Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier http://www.www2011india.com/proceeding/proceedings/p607.pdf

  35. Map-Reduce is Bulk-Synchronous Parallel • Bulk-Synchronous Parallel = BSP (Valiant, 80s) • Each iteration sees only the values of previous iteration. • In linear systems literature: Jacobi iterations • Pros: • Simple to program • Maximum parallelism • Simple fault-tolerance • Cons: • Slower convergence • Iteration time = time taken by the slowest node

  36. Asynchronous Computation • Alternative to BSP • Linear systems: Gauss-Seidel iterations • When computing value for item X, can observe the most recently computed values of neighbors • Often relaxed: can see most recent values available on a certain node • Consistency issues: • Prevent parallel threads from over-writing or corrupting values (race conditions)

  37. MapReduce’s (Hadoop’s) poor performance on huge graphs has motivated the development of special graph-computation systems

  38. Specialized Graph Computation Systems (Distributed) • Common to all: Graph partitions resident in memory on the computation nodes • Avoid shipping the graph over and over • Pregel (Google, 2010): • “Think like a vertex” • Messaging model • BSP • Open source: Giraph, Hama, Stanford GPS,.. • GraphLab (2010, 2012) [CMU] • Asynchronous (also BSP) • Version 2.1 (“PowerGraph”) uses vertex-partitioning  extremely good performance on natural graphs + Others • But do you need a distributed framework?

  39. “Think like a vertex” vertex-centric programming

  40. Vertex-Centric Programming • “Think like a Vertex” (Google, 2010) • Historically, similar idea used before in systolic-computation, data-flow systemsthe Connection Machine and others. • Basic idea: each vertex computes individually its value [in parallel] • Program state = vertex (and edge) values • Pregel: vertices send messages to each other • GraphLab/Chi: vertex reads its neighbors’ and edge values, modifies edge values (can be used to simulate messaging) • Iterative • Fixed-point computations are typical: iterate until the state does not change (much).

  41. Computational Model (GraphLab and GraphChi) • Graph G = (V, E) • directed edges: e = (source, destination) • each edge and vertex associated with a value (user-defined type) • vertex and edge values can be modified • (GraphChi: structure modification also supported) A e B Data Data Data Data Data Data Terms: e is an out-edge of A, and in-edge of B. Data Data Data Data GraphChi – Aapo Kyrola

  42. Vertex Update Function Data Data Data Data Data Data Data Data MyFunc(vertex) { // modify neighborhood } Data Data Data Data Data Data Data

  43. Parallel Computation • Bulk-Synchronous: All vertices update in parallel (note: need 2x memory – why?) • Asynchronous: • Basic idea: if two vertices are not connected, can update them in parallel • Two-hop connections • GraphLab supports different consistency models allowing user to specify the level of “protection” = locking • Efficient locking is complicated on distributed computation (hidden from user) – why?

  44. Scheduling • Often, some parts of the graph require more iterations to converge than others: • Remember power-law structure • Wasteful to update all vertices equal number of times.

  45. The Scheduler The scheduler determines the order that vertices are updated b d a c CPU 1 c b e f g Scheduler e f b a i k h j i h i j CPU 2 The process repeatsuntil the scheduler is empty

  46. Types of Schedulers (GraphLab) • Round-robin • Selective scheduling (skipping): • round robin but jump over un-scheduled vertice • FIFO • Priority scheduling • Approximations used in distributed computation (each node has its own priority queue) • Rarely used in practice (why?)

  47. Example: Pagerank • Express Pagerank in words in the vertex-centric model

  48. Example: Connected Components 1 2 5 3 7 4 6 First iteration: Each vertex chooses label = its id.

  49. Example: Connected Components 1 1 5 1 5 2 6 Update: my vertex id = minimum of neighbors id.

  50. Example: Connected Components How many iterations needed for convergence? (In synchronous model) 1 1 5 1 5 1 5 What about asynchronous model? Component id = leader id (smallest id in the component)

More Related