1 / 43

Link Analysis: PageRank

Link Analysis: PageRank. Ranking Nodes on the Graph. vs. Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure. Link Analysis Algorithms.

padma
Télécharger la présentation

Link Analysis: PageRank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link Analysis: PageRank

  2. Ranking Nodes on the Graph vs. • Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu • Since there is large diversity in the connectivity of the web graph we can rank the pages by the link structure Slides by Jure Leskovec: Mining Massive Datasets

  3. Link Analysis Algorithms • We will cover the following Link Analysis approaches to computing importances of nodes in a graph: • Page Rank • Hubs and Authorities (HITS) • Topic-Specific (Personalized) Page Rank • Web Spam Detection Algorithms Slides by Jure Leskovec: Mining Massive Datasets

  4. Links as Votes • Idea:Links as votes • Page is more important if it has more links • In-coming links? Out-going links? • Think of in-links as votes: • www.stanford.edu has 23,400 inlinks • www.joe-schmoe.com has 1 inlink • Are all in-links are equal? • Links from important pages count more • Recursive question! Slides by Jure Leskovec: Mining Massive Datasets

  5. Simple Recursive Formulation p • Each link’s vote is proportional to the importance of its source page • If page p with importance x has n out-links, each link gets x/n votes • Page p’s own importance is the sum of the votes on its in-links Slides by Jure Leskovec: Mining Massive Datasets

  6. PageRank: The “Flow” Model The web in 1839 y/2 y a/2 y/2 m a m a/2 Flow equations: ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2 A “vote” from an important page is worth more A page is important if it is pointed to by other important pages Define a “rank” rj for node j Slides by Jure Leskovec: Mining Massive Datasets

  7. Solving the Flow Equations Flow equations: ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2 • 3 equations, 3 unknowns, no constants • No unique solution • Additional constraint forces uniqueness • ry+ ra + rm = 1 • ry = 2/5, ra = 2/5, rm = 1/5 • Gaussian elimination method works for small examples, but we need a better method for large web-size graphs Slides by Jure Leskovec: Mining Massive Datasets

  8. PageRank: Matrix Formulation • Stochastic adjacency matrix M • Let page j has djout-links • If j → i, then Mij= 1/djelse Mij = 0 • M is a column stochastic matrix • Columns sum to 1 • Rank vector r: vector with an entry per page • ri is the importance score of page i • iri = 1 • The flow equations can be written r = M r Slides by Jure Leskovec: Mining Massive Datasets

  9. j i i = 1/3 M r Example r Suppose page j links to 3 pages, including i Slides by Jure Leskovec: Mining Massive Datasets

  10. Eigenvector Formulation • The flow equations can be writtenr = M ∙ r • So the rank vector is an eigenvector of the stochastic web matrix • In fact, its first or principal eigenvector, with corresponding eigenvalue 1 Slides by Jure Leskovec: Mining Massive Datasets

  11. r = Mr y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m Example: Flow Equations & M y a m ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2 Slides by Jure Leskovec: Mining Massive Datasets

  12. Power Iteration Method di …. out-degree of node i • Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks • Power iteration: a simple iterative scheme • Suppose there are N web pages • Initialize: r(0) = [1/N,….,1/N]T • Iterate: r(t+1) = M ∙ r(t) • Stop when |r(t+1) – r(t)|1 <  • |x|1 = 1≤i≤N|xi| is the L1 norm • Can use any other vector norm e.g., Euclidean Slides by Jure Leskovec: Mining Massive Datasets

  13. PageRank: How to solve? y a m ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2 Iteration 0, 1, 2, … • Power Iteration: • Set /N • And iterate • ri=jMij∙rj • Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Slides by Jure Leskovec: Mining Massive Datasets

  14. Random Walk Interpretation i1 i2 i3 j • Imagine a random web surfer: • At any time t, surfer is on some page u • At time t+1, the surfer follows an out-link from uuniformly at random • Ends up on some page vlinked from u • Process repeats indefinitely • Let: • p(t)… vector whose ithcoordinate is the prob. that the surfer is at page iat time t • p(t)is a probability distribution over pages Slides by Jure Leskovec: Mining Massive Datasets

  15. The Stationary Distribution i1 i2 i3 j • Where is the surfer at time t+1? • Follows a link uniformly at random p(t+1) = M · p(t) • Suppose the random walk reaches a state p(t+1) = M · p(t) = p(t) then p(t)is stationary distributionof a random walk • Our rank vectorr satisfies r = M · r • So, it is a stationary distribution for the random walk Slides by Jure Leskovec: Mining Massive Datasets

  16. PageRank: Three Questions or equivalently • Does this converge? • Does it converge to what we want? • Are results reasonable? Slides by Jure Leskovec: Mining Massive Datasets

  17. Does This Converge? a b = Iteration 0, 1, 2, … Example: ra 1 0 1 0 rb 0 1 0 1 Slides by Jure Leskovec: Mining Massive Datasets

  18. Does it Converge to What We Want? a b = Iteration 0, 1, 2, … Example: ra 1 0 0 0 rb 0 1 0 0 Slides by Jure Leskovec: Mining Massive Datasets

  19. Problems with the “Flow” Model 2 problems: • Some pages are “dead ends” (have no out-links) • Such pages cause importanceto “leak out” • Spider traps (all out links arewithin the group) • Eventually spider traps absorb all importance Slides by Jure Leskovec: Mining Massive Datasets

  20. Problem: Spider Traps y a m ry = ry/2 + ra /2 ra = ry/2 rm = ra /2 + rm Iteration 0, 1, 2, … • Power Iteration: • Set • And iterate • Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 3/6 7/12 16/24 1 Slides by Jure Leskovec: Mining Massive Datasets

  21. Solution: Random Teleports y y a a m m • The Google solution for spider traps: At each time step, the random surfer has two options: • With probability , follow a link at random • With probability 1-, jump to some page uniformly at random • Common values for  are in the range 0.8 to 0.9 • Surfer will teleport out of spider trap within a few time steps Slides by Jure Leskovec: Mining Massive Datasets

  22. Problem: Dead Ends y a m ry = ry/2 + ra /2 ra = ry/2 rm = ra /2 Iteration 0, 1, 2, … • Power Iteration: • Set • And iterate • Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 1/6 1/12 2/24 0 Slides by Jure Leskovec: Mining Massive Datasets

  23. Solution: Dead Ends y y a a m m • Teleports:Follow random teleport links with probability 1.0 from dead-ends • Adjust matrix accordingly Slides by Jure Leskovec: Mining Massive Datasets

  24. Why Teleports Solve the Problem? Markov Chains Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the probability of being at eachstate x X Goal is to find π such that π = P π Slides by Jure Leskovec: Mining Massive Datasets

  25. Why is This Analogy Useful? Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a uniquepositive stationary vector as long as P is stochastic, irreducibleand aperiodic. Slides by Jure Leskovec: Mining Massive Datasets

  26. Make M Stochastic y a m • ai…=1 if node i has out deg 0, =0 else • 1…vector of all 1s ry = ry/2 + ra /2 + rm /3 • ra = ry/2+ rm /3 • rm = ra /2 + rm /3 Stochastic: Every column sums to 1 A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets

  27. Make M Aperiodic y a m A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k. A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets

  28. Make M Irreducible y a m From any state, there is a non-zero probability of going from any one state to any another A possible solution: Add green links Slides by Jure Leskovec: Mining Massive Datasets

  29. Solution: Random Jumps From now on: We assume M has no dead endsThat is, we follow random teleport links with probability 1.0 from dead-ends di … out-degree of node i • Google’s solution that does it all: • Makes M stochastic, aperiodic, irreducible • At each step, random surfer has two options: • With probability 1-, follow a link at random • With probability , jump to some random page • PageRank equation [Brin-Page, 98] Slides by Jure Leskovec: Mining Massive Datasets

  30. The Google Matrix • PageRank equation [Brin-Page, 98] • The Google Matrix A: • G is stochastic, aperiodic and irreducible, so • What is  ? • In practice  =0.85 (make 5 steps and jump) Slides by Jure Leskovec: Mining Massive Datasets

  31. Random Teleports ( = 0.8) 1/n·1·1T S 0.8·½+0.2·⅓ 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y + 0.2 0.8 0.2·⅓ 0.8·½+0.2·⅓ a 0.8·½+0.2·⅓ m 0.2· ⅓ y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8+0.2·⅓ 0.8·½+0.2·⅓ 0.2·⅓ 0.2· ⅓ A y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . . Slides by Jure Leskovec: Mining Massive Datasets

  32. Computing Page Rank • A = ∙M + (1-) [1/N]NxN ½ ½ 0 ½ 0 0 0 ½ 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 A = 0.8 +0.2 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 = • Key step is matrix-vector multiplication • rnew = A ∙ rold • Easy if we have enough main memory to hold A, rold, rnew • Say N = 1 billion pages • We need 4 bytes for each entry (say) • 2 billion entries for vectors, approx 8GB • Matrix A has N2 entries • 1018 is a large number! Slides by Jure Leskovec: Mining Massive Datasets

  33. Matrix Formulation • Suppose there are N pages • Consider a page j, with set of out-links dj • We have Mij = 1/|dj| when j→i and Mij = 0 otherwise • The random teleport is equivalent to • Adding a teleport link from j to every other page with probability (1-)/N • Reducing the probability of following each out-link from 1/|dj| to /|dj| • Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly Slides by Jure Leskovec: Mining Massive Datasets

  34. Rearranging the Equation [x]N… a vector of length N with all entries x , where since So we get: Slides by Jure Leskovec: Mining Massive Datasets

  35. Sparse Matrix Formulation • We just rearranged the PageRank equation • where [(1-)/N]N is a vector with all N entries (1-)/N • M is a sparse matrix! • 10 links per node, approx 10N entries • So in each iteration, we need to: • Compute rnew = M ∙ rold • Add a constant value (1-)/N to each entry in rnew Slides by Jure Leskovec: Mining Massive Datasets

  36. Sparse Matrix Encoding source node degree destination nodes • Encode sparse matrix using only nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk Slides by Jure Leskovec: Mining Massive Datasets

  37. Basic Algorithm: Update Step Initialize all entries of rnew to (1-)/N For each page p(of out-degree n): Read into memory: p, n, dest1,…,destn, rold(p) for j = 1…n: rnew(destj) += rold(p) / n rold rnew 0 src degree destination 0 1 1 2 2 3 3 4 4 5 5 Slides by Jure Leskovec: Mining Massive Datasets • Assume enough RAM to fit rnew into memory • Store rold and matrix M on disk • Then 1 step of power-iteration is: 6 6

  38. Analysis • Assume enough RAM to fit rnew into memory • Store rold and matrix M on disk • In each iteration, we have to: • Read rold and M • Write rnew back to disk • IO cost = 2|r| + |M| • Question: • What if we could not even fit rnew in memory? Slides by Jure Leskovec: Mining Massive Datasets

  39. Block-based Update Algorithm rold rnew src degree destination 0 0 1 1 2 3 2 4 3 5 4 5 Slides by Jure Leskovec: Mining Massive Datasets

  40. Analysis of Block Update • Similar to nested-loop join in databases • Break rnew into k blocks that fit in memory • Scan M and rold once for each block • k scans of M and rold • k(|M| + |r|) + |r| = k|M| + (k+1)|r| • Can we do better? • Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration Slides by Jure Leskovec: Mining Massive Datasets

  41. Block-Stripe Update Algorithm src degree destination rnew 0 rold 1 0 1 2 3 2 4 3 5 4 5 Slides by Jure Leskovec: Mining Massive Datasets

  42. Block-Stripe Analysis • Break M into stripes • Each stripe contains only destination nodes in the corresponding block of rnew • Some additional overhead per stripe • But it is usually worth it • Cost per iteration • |M|(1+) + (k+1)|r| Slides by Jure Leskovec: Mining Massive Datasets

  43. Some Problems with Page Rank • Measures generic popularity of a page • Biased against topic-specific authorities • Solution: Topic-Specific PageRank (next) • Uses a single measure of importance • Other models e.g., hubs-and-authorities • Solution: Hubs-and-Authorities (next) • Susceptible to Link spam • Artificial link topographies created in order to boost page rank • Solution:TrustRank (next) Slides by Jure Leskovec: Mining Massive Datasets

More Related